Why are data teams switching to Apache Iceberg?

Let's dive into the hype behind Apache Iceberg and break down what it is, why data teams are using it, and whether or not you should too.

Kevin Hu, PhD

and

Will Harris

February 12, 2025

Kevin Hu, PhD

Co-founder / Data and ML

Will Harris

Writer / Data

February 12, 2025

Why are data teams switching to Apache Iceberg?

Apache Iceberg has been one of the hottest topics in the data world for the past year. There’s been some pretty opinionated discourse all over the internet—from LinkedIn to Reddit to Twitter/X—but how many teams are actually making the switch to Iceberg, why did they decide to do it?

After scouring the internet for takes and talking to Metaplane customers who made the switch themselves, here’s why data teams are switching to Apache Iceberg.

Wait, what is Apache Iceberg?

Yeah, we should probably cover that first in case you’ve been reading the name everywhere but didn’t know what it was.

At its core, Apache Iceberg is an open source, high-performance table format for massive analytic datasets. Think of it as a smart layer that sits between your data lake storage (like S3 or HDFS) and your query engines.

First developed by the data team at Netflix, Iceberg has now been adopted by the Apache software foundation.

Unlike traditional file organization methods, Iceberg brings database-like capabilities to your data lake, letting you make schema changes, time travel and rollback, ACID transactions at scale, and faster queries.

Plus, since it’s open source and isn’t bound to platform-specific formats, you can query data from different sources, helping centralize your data and save costs by not having to store your data in multiple warehouses.

The architecture of Apache Iceberg

Without getting too bogged down in Iceberg architecture, I think it’s still helpful to understand the different layers that make up the Iceberg format.

A diagram showing the architecture of an Apache Iceberg table.

The catalog layer: Your source of truth

Think of the catalog layer as a directory that keeps track of all your tables. It's the first point of contact when you want to work with your data, and it maintains two essential pieces of information:

Where each table is located
The current metadata pointer for each table

The catalog can be implemented in different ways depending on your setup. You might use AWS Glue Catalog, Hive Metastore, or a custom solution like Nessie. Regardless of which implementation you choose, the catalog ensures that all table operations (like creates, renames, and deletes) happen atomically.

The metadata layer: Where the magic happens

The metadata layer is what sets Iceberg apart from traditional table formats. It uses a hierarchy of JSON files to track everything about your table's structure and history. Here's how the metadata is organized:

Snapshots are at the top of the hierarchy. They're like photographs of your table at specific points in time, making features like time travel possible. Each snapshot points to a manifest list.

Manifest lists act as an index of all the manifest files for a particular snapshot. They include partition-level statistics that help Iceberg quickly identify which data files might be relevant for a query.

Manifest files are the detailed record-keepers. They track:

Individual data files and their locations
File-level statistics
Partition information
Which files have been added or deleted

The beauty of this layer is that all metadata files are immutable—once created, they're never modified. When changes happen, Iceberg creates new metadata files and updates the pointers accordingly. This approach makes concurrent operations much safer and more reliable.

The data layer: Your actual data

The data layer is straightforward: it's where your actual data files live. While Iceberg supports multiple formats, most teams use Apache Parquet files. These files are immutable and contain the row-level data that your queries ultimately need to access.

How it all works together

Let's look at what happens during everyday operations:

A diagram showing how write and read operations work with Apache Iceberg.

When reading data:

The query engine asks the catalog where to find the table
The catalog provides the current metadata location
The engine reads the metadata files, using statistics at each level to figure out exactly which files it needs
Only the necessary data files are read, making queries much faster

When writing data:

New data files are written
New manifest files are created to track these files
A new snapshot is created
The catalog is updated to point to the new snapshot

This layered approach has a handful of benefits:

Excellent query performance through multiple levels of filtering
Reliable concurrent operations
Easy schema evolution
Point-in-time queries and rollbacks

The power of Apache’s architecture becomes particularly clear when dealing with large-scale data operations. Suddenly, adding a column to a petabyte-scale table or getting snapshots of your data from last week becomes much easier.

What does Apache Iceberg replace?

Historically, data teams have relied on Hive tables or direct file access patterns to manage their data lakes. These approaches worked fine when data volumes were smaller and requirements simpler, but they're showing their age now that data teams are trying to squeeze more out of larger data sets.

Iceberg effectively replaces:

Apache Hive table formats
Raw file management in data lakes
Custom solutions for managing partitioning and schema evolution
Complex workarounds for maintaining data consistency

Why data teams are making the switch to Iceberg

We’ve covered Iceberg architecture and how it works, but let’s talk about why real teams have actually been making the switch.

1. Save on data warehouse cost

Data teams are always looking for ways to cut down their spend, and for teams storing large amounts of data in multiple places, Iceberg helps you do just that. When talking to some of our own customers who’ve switched to Iceberg, this is usually the first thing that comes up. We know data warehouse spend can creep up fast, and if you can eliminate the redundancy of having to store data in multiple locations, it can save a lot of money.

2. Better query performance

The next most common Iceberg benefit we heard from our customers is how much better their query performance is. Depending on how large your data sets are, the performance improvements of switching to Iceberg can be massive–up to 10x faster in some cases. Iceberg's intelligent file pruning and metadata handling mean you're scanning way less data to get the answers you need.

3. Schema evolution that won’t break your pipeline

Iceberg handles schema changes much more gracefully than traditional table formats, allowing data teams to add, drop, or reorder columns without breaking existing queries. When schema changes occur, Iceberg maintains backward compatibility by tracking schema versions in the metadata layer, so existing queries and pipelines continue to run smoothly while new data adopts the updated schema.

4. Time travel without the complexity

Querying data at a certain point in time is usually a pretty complex operation, but thanks to Iceberg’s snapshot architecture, it’s more straightforward. Iceberg making it easier to roll back changes has been a popular feature among data teams.

5. Multi-table transactions that actually work

If you've ever tried to maintain consistency across multiple tables in a data lake, you know it's traditionally been somewhere between painful and impossible. Iceberg's ACID transactions make this a non-issue, making it a game-changer for teams dealing with interconnected data systems.

Should you switch to Iceberg, too?

For all the chatter there’s been around Iceberg for the past year, the reality is not that many teams are in a position to drop everything and make the switch. Sure, companies like Netflix, Stripe, and Pinterest are doing it, but they have a lot of resources at their disposal. Plus, they’re probably handling more data than 99.9% of companies.

Consider switching if:

Your data lake is growing beyond 10TB and query performance is becoming a concern
You're dealing with frequent schema changes that cause downstream issues
You need better point-in-time analysis capabilities
You're building a lakehouse architecture from scratch
Your team is spending too much time managing partitioning strategies

Maybe hold off if:

Your data volumes are relatively small (<1TB)
You're heavily invested in a proprietary data warehouse solution that's meeting your needs
Your team doesn't have the bandwidth for a migration right now
Your use cases are simple and don't require Iceberg's advanced features

While it's not the right choice for every team yet, the benefits are compelling enough that we're seeing a clear trend toward adoption, especially among teams dealing with large-scale analytics workloads.

If you're feeling the pain of managing a growing data lake, dealing with schema evolution headaches, or just want more reliable and performant analytics, Iceberg might be worth a closer look. The learning curve is surprisingly low, and the benefits become apparent pretty quickly.

Check out the Apache Iceberg documentation, or better yet, spin up a test environment and see for yourself what all the buzz is about.

‍

Why are data teams switching to Apache Iceberg?

Why are data teams switching to Apache Iceberg?