apache-iceberg

Iceberg-Powered Unification: Why Table/Stream Duality Will Redefine ETL in 2025!

Karthic Rao

21 Jan 2025 — 6 min read

A futuristic, high-tech illustration of a massive iceberg floating in a digital ocean. The top of the iceberg is depicted as structured tables or grid-like panels (representing batch data), while beneath the surface, streams of glowing data flow in real-time around the iceberg’s base.

Traditional ETL has long revolved around batch processing pipelines—Spark jobs ingest static, bounded datasets from data lakes or warehouses, perform transformations, and write results to another table or file system. Meanwhile, real-time ingestion platforms such as Apache Kafka, Apache Pulsar, and Redpanda have specialized in high-throughput, low-latency pipelines but often struggled with storing and reprocessing data at a massive scale.

Fast-forward to 2024: multiple vendors began embracing the concept of table/stream duality, promising a revolution that will change the ETL market starting in 2025. At the center of this evolution sits Apache Iceberg—no longer just a “sink” for streaming data but a fully integrated storage layer that provides both the continuous stream and a table-oriented data view.

This post will explore how Apache Iceberg unlocks unbounded data storage, merging batch and streaming pipelines under one unified architecture and heralding a new era for ETL.

Why was ETL an unconquered territory for Streaming systems?

Stream processing systems—Kafka, Pulsar, Redpanda, etc.—excel at real-time data ingestion but have historically faced severe limitations around unbounded data retention. Storing entire data histories in logs or topics becomes impractical when dealing with petabyte-scale data, forcing teams to rely on external data lakes or warehouses for reprocessing. While these streaming platforms have expanded their storage capabilities (Kafka’s tiered storage, Pulsar’s BookKeeper, etc.), unbounded storage still poses a challenge. It significantly shoots up the complexity of managing these systems at scale.

This limited “boundedness” meant streaming engines primarily focused on the “real-time window” of data rather than the entire data estate. In other words, streaming vendors couldn’t truly conquer the ETL world because ETL usually requires processing all historical data at any given time.

With these limitations in mind, let’s see how the ETL landscape looked before Iceberg/stream unification—and why batch processing engines took center stage.

ETL Workloads: The Domain of Batch Processing

In the old/current world, ETL was synonymous with batch-based workflows. Hadoop, Spark, and similar engines would crunch massive (unbounded) datasets by reading them from data lakes—like HDFS or object stores—and produce new tables or files. The reliance on large-scale, specific data was precisely why batch engines dominated the ETL space. For instance:

Databricks’ Gold-Silver-Bronze architecture: Data is ingested into Bronze tables, cleansed into Silver tables, and aggregated into Gold tables.
High-latency batch runs: Large transformations occurred periodically (hourly, daily, weekly).

In contrast, real-time engines like Apache Flink often focused on “bounded windows” of data, making them adept for specific streaming analytics but not for large historical reprocessing in a single run.

The advent of Apache Iceberg posed multiple opportunities simultaneously for streaming vendors! Here's how the streaming vendors cashed in!

Introducing Apache Iceberg as the Defacto Storage Layer

Apache Iceberg is a high-performance table format that leverages object storage (S3, GCS, Azure Blob, etc.) to maintain datasets of virtually any scale. It was originally designed for batch analytics, but it’s increasingly recognized as the ideal foundation for table/stream duality. Why?

Metadata and Snapshot Management: Iceberg tracks data via immutable snapshots, enabling time travel, efficient schema evolution, and consistent reads/writes.
Near-Real-Time Reflection: With minimal batching, new data can flow into Iceberg with minimal latency, mirroring a continuous stream. This introduces more compaction overhead since one can't batch too long to maintain minimal latency, so the table mirrors the stream.
Deep Integration vs. “Just a Sink”: In 2024, major stream vendors like Confluent (Kafka) and StreamNative (Pulsar) announced deep integrations where Iceberg isn’t merely a final destination (sink) for data but the primary storage engine.

By making Iceberg native to their storage layers, these vendors can unify operational streaming with analytical tables. Instead of duplicating data, the same bytes in object storage serve both a streaming interface (through Kafka or Pulsar APIs) and a SQL-friendly table interface (Iceberg).

This near-real-time capability paves the way for table/stream duality, which we’ll explore next.

4. Table/Stream Duality Explained

At its core, table/stream duality means:

A stream can be projected as a table. You can see the continuous flow of events in a relational way.
A table’s changes can be materialized as a stream. Each row-level update can be re-emitted as an event.

Because unbounded data is just a superset of bounded data, we can think of batch as a special case of streaming (when you limit the time window). With Iceberg, the underlying files remain one dataset—while snapshots, manifests, and metadata management allow either:

Real-time streaming consumption, or
Batch snapshot reads at any point in time.

Small, continuous writes to Iceberg create a near real-time “stream,” and bigger snapshot queries treat those same underlying Parquet files as a consistent table.

In the next section, we'll explore some of the announcements and ongoing efforts by streaming vendors in this direction.

Industry Announcements and Integrations

Several new integrations are accelerating this trend:

Confluent: Deep integration of Iceberg into Kafka’s tiered storage, turning Kafka topics into first-class Iceberg tables.
StreamNative (Pulsar): Similar approach using BookKeeper for short-term retention and Iceberg for unbounded storage.
RisingWave and Others: Real-time analytics tools that rely on Iceberg for a consistent, snapshot-based view of streaming data.

The critical difference between simply writing to an Iceberg connector vs. building Iceberg into the storage engine is how it handles compaction, schema evolution, and real-time snapshots. With native integration, you avoid data duplication, keep latency low, and still support features like schema evolution out of the box.

With these integrations, let’s look at how the ETL market stands to be completely transformed.

The Changing ETL Landscape

By leveraging unbounded data storage via Iceberg, today’s stream processing engines can do what once required separate batch pipelines:

Full Historical Context: Any stream job can reprocess entire historical datasets because the data is persistently stored in Iceberg.
Real-Time + Batch in One: You can run real-time transformations on the latest events without losing access to older data.

In 2025, streaming vendors like Confluent and StreamNative can directly challenge traditional ETL incumbents, offering a unified experience:

Bounded vs. Unbounded Data: Bounded data sets (e.g., last 24 hours) are a subset of the overall unbounded data.
Developer Workflow: Rather than building complex data pipelines in multiple systems, developers can use Kafka topics as both the “extract” source and the “load” destination. This might look like:
- Extract: Subscribe to an existing topic, an Iceberg table. With Parquet files in Iceberg, streaming jobs only need to read relevant columns, boosting performance significantly.
- Transform: Filter or aggregate using a stream processing layer (Flink, Spark, or Confluent’s ksqlDB).
- Load: Write the results back to another topic that is automatically an Iceberg table as well, with no extra duplication steps.

This synergy eliminates the need for separate data-lake ingestion jobs!

A shift in compaction responsibilities is another fantastic by-product of this new paradigm. Let's understand it in detail.

7. Shifting Compaction Responsibilities

One of the biggest operational challenges when using Parquet or ORC files is compaction—merging lots of small files into larger ones for read efficiency. Traditionally, this was:

Optional or manual in some setups, often run as a separate batch job.
A core concern of data lake ecosystems, but not of streaming systems.

With native Iceberg integration, compaction becomes an inevitable background maintenance job that the storage engine handles automatically.

Now:

You can write small Parquet files to maintain near-zero latency.
The system merges these small files into bigger ones behind the scenes, ensuring query performance and cost efficiency.

This is drastically different from using an Iceberg connector, where you had to choose:

Low latency ingestion with too many small files, and
Run compaction using Spark jobs to compact the tables.

Closing thoughts

Table/stream duality stands ready to reshape the ETL market in 2025 and beyond. By combining unbounded storage (via Iceberg) with real-time data streams (via Kafka, Pulsar, or Redpanda), organizations can now:

Eliminate the divide between batch and streaming pipelines for ETL.
Simplify their architecture using a single, consistent data layer for real-time and historical reads.
Accelerate development timelines, as ETL becomes just another streaming application with immediate access to all historical data.

Looking ahead, we can anticipate bi-directional flows between tables and streams, letting data teams materialize Iceberg tables as real-time topics (and vice versa) with minimal friction. With no ETL duplication and no wasted data movement, the opportunity for innovation and cost savings is vast. Now is the time for data teams to explore these new integrations, adopt real-time-first strategies, and harness the full potential of unbounded data.

In summary, 2025 will be the year when streaming platforms truly step into the ETL spotlight, powered by Apache Iceberg and the table/stream duality that unifies bounded and unbounded data in one modern data architecture. Do you think it's time to embrace the future of ETL? I would love to know your thoughts.