apache-iceberg

Reimagining Query Planners and Metadata Services in the Age of Iceberg Lakehouse Tables

Karthic Rao

04 Jan 2025 — 6 min read

Like with so much promise under the hood, more than that mets the eye!

The Apache Iceberg project introduced an open table storage format to store structured data in object storage systems (e.g., AWS S3, MinIO) for analytical purposes. Much like the S3 protocol became the de facto standard for accessing and managing object storage, Iceberg’s catalog API has the potential to become a new standard interface for decoupling and disaggregating various components of a lakehouse query engine. This vision not only separates storage from compute, but also imagines the separation of components inside a query engine like the metadata service—enabling new heights of architectural flexibility and independent innovation.

In this blog post, we’ll explore:

How lakehouse comput engine query planners depend on rich metadata for efficient execution.
Why decoupling the components like planners and the metadata layer can accelerate innovation.
How queryable metadata (stored in Iceberg tables) can accelerate the development of tools like cost-based optimizers, query guardrails, compaction planners, and more.
The challenges of collecting detailed statistics (especially for non-partitioned columns in streaming scenarios).
A call to evolve the Iceberg Catalog API to realize these possibilities.

Why Query Planners Need Rich Table-Column Statistics

Let's start the discussion of catalog API's role in accelerating innovation with one of the critical components of a lakehouse query engine or any database system in general.

Query planners are central to database and analytical systems, determining the most efficient way to execute queries. They use statistics about row counts, column cardinality, partitioning schemes, and data distribution to decide optimal execution strategies. A well-informed planner can:

Estimate Costs: Accurately judge the resource cost of various join strategies (e.g., hash join vs. nested loop).
Optimize Joins: Handle data skew effectively to avoid bottlenecks.
Push Down Predicates: Filter unnecessary data early, reducing scan overhead.
Minimize Data Shuffle: Properly sized shuffles are crucial; an inaccurate plan can cause excessive network I/O and degrade performance.

A poor query plan—for example, choosing full table scans on huge data sets or misestimating cardinalities—can lead to massive data shuffles, saturating the network and prolonging execution times. Hence, rich, accurate metadata is essential.

Iceberg’s Catalog API: The Next Frontier enabling Disaggregation

Historically, query planners have been tightly coupled with analytical engines. Apache Iceberg enables the possibility of challenging the norm by standardizing how metadata is stored and accessed. Now, this opens the door for:

Complete Disaggregation: Much as the S3 protocol enabled the separation of storage from compute, Iceberg’s catalog API can separate query planners, optimizers, and metadata services from each other, if it evolves to offer standard communication interface between these components.
Cross-Engine Compatibility: A standardized interface means a query planner (or any other component) can operate across multiple engines without special adapters. A lakehouse query engine can be assembled like lego-pieces! Imagine a docker-compose file that can fit a query planner, metadata stats collector, or query executor of your choice! Wouldn't that be cool?
Faster Innovation: By decoupling services like column-level metadata collection from the query engine, tools such as cost-based optimizers, query guardrails, or compaction planners can evolve independently without waiting for monolithic engine releases.

Although today’s Iceberg Catalog API does not fully provide standardized interfaces for serving query plans or rich metadata, the promise is compelling. We’re seeing the beginnings of a future in which data architecture is so modular that each part—storage, metadata, execution—can be swapped or upgraded independently.

Decoupled Metadata as a Queryable Iceberg Table

One fascinating idea is to store detailed metadata in denormalized, queryable Iceberg tables. This would create a standalone “metadata collection layer” with its own life cycle, allowing independent services to:

Collect and Aggregate Stats: Gather file- and column-level statistics, such as histograms and NDV (number of distinct values), even for non-partitioned columns.
Provide Query Guardrails: Externally analyze stats to prevent or warn against expensive queries (e.g., large shuffles or skewed joins).
Leverage Cost-Based Optimizers: Supply advanced stats to third-party optimizers, which can then propose better query plans.
Enable Stream-Aware Planning: Account for rapidly arriving data in real-time, adjusting query plans on the fly.

With metadata as a first-class data set, you could run an Iceberg table query to glean insights about data distribution, streaming ingestion rates, or partition health—just as you would with any other data set. This is a major paradigm shift: metadata moves from an engine-specific afterthought to a universal resource that any service can query.

Tools and Services Leveraging the Decoupled Layer

Once you have queryable metadata, an ecosystem of independent tools can flourish. For instance:

Compaction Planners
- Why: Periodic compaction is vital for small-file consolidation and efficient future queries.
- How: By querying up-to-date file statistics from the metadata layer, these planners can decide which files to merge for optimal resource use.
Query Planners
- Why: A universal planner can act as a “brain” for multiple engines.
- How: It retrieves stats from the metadata service (including histograms, NDVs, and partition info) to build cost-based and shuffle-efficient plans.
Query Guardrails
- Why: Unbounded queries or poorly filtered joins can be extremely expensive.
- How: Before execution, a guardrail service consults the metadata store to check if a query will likely cause large data scans or overwhelming shuffles.
Advanced Optimizers & Stream-Aware Systems
- Why: Real-time data ingestion and high-frequency streaming can cause metadata to shift rapidly.
- How: With consistent updates to a queryable metadata table, stream-aware optimizers can re-plan or refine execution mid-flight, ensuring queries remain efficient even as data grows or changes distribution.

By disaggregating these modules, each service can innovate on its own timeline, experiment with algorithms, and iterate quickly without waiting for monolithic updates in a single query engine.

Challenges of Streaming Ingestion and Detailed Stats

Decoupling the metadata layer does not come without its challenges, especially in streaming environments:

High Write Velocity
- Rapid data ingestion can outpace metadata updates. Maintaining low-latency metadata pipelines is essential but not trivial.
Histogram and NDV Statistics for Non-Partitioned Columns
- Collecting and updating advanced stats (like column histograms and NDV)
- Streaming systems often ingest data in micro-batches.
Compaction Delays
- Many small files are a common side effect of streaming ingest. Compaction is necessary to maintain query efficiency and accurate stats, but doing so frequently can be resource-intensive.
Real-Time Constraints
- Traditional batch-oriented approaches to stats gathering may not suit use cases demanding near-real-time insights. A new wave of streaming-aware metadata services is needed to handle the velocity of modern data pipelines.

Separation of Compute from Compute: A New Architectural Paradigm

Much like the decoupling of storage and compute, we now witness a “separation of compute from compute” movement. Each complex function—such as query planning, optimization, or guardrails—can become a microservice that communicates via standard interfaces like Iceberg’s Catalog API. This separation:

Fosters Ecosystem Growth: Encourages specialized services to address niche problems (e.g., real-time compaction or AI-driven optimization).
Enhances Interoperability: Consistent APIs lower the barrier to integrating new tools and engines seamlessly.
Speeds Up Innovation: When services evolve independently, each can adopt the latest techniques without forcing downstream engine upgrades.

The manifestation of this idea would be great for the ecosystem: any number of compute and metadata services hooking into a standard storage and catalog layer.

Existing Precedents in Traditional Databases

In traditional relational databases, query optimization and metadata collection are typically unified within a single engine, which can gather and update extensive statistics as data changes. This holistic approach has historically given databases powerful query performance.

Iceberg already stores some metadata in “metadata tables”, but it remains limited in scope compared to the robust stats found in traditional databases. By extending these metadata tables to include richer, column-level Parquet stats (including real-time updates in streaming environments), lakehouses move one step closer to the capabilities of mature database systems—without surrendering their cloud-native, open architecture.

Why Tight Coupling Slows Innovation

In the monolithic model, where the query planner, metadata stats, and execution engine all exist within a single system, innovation in one area is often held back by dependencies in another. A new query optimization algorithm might require changes in how metadata is stored or updated, triggering a massive coordinated release. This slows down progress and forces teams to move in lockstep.

By decoupling these components around Iceberg’s standard interfaces, each part can iterate on its own schedule. This means:

Faster feature delivery for cost-based optimizers and query guardrails.
Fewer blocking dependencies and integration challenges.
A thriving ecosystem of specialized services, each pushing the boundaries of performance and functionality.

A Call to Evolve the Iceberg Catalog API

Right now, the Iceberg Catalog API is transformative but incomplete. It does not officially specify how to serve query plans, handle real-time advanced statistics, or orchestrate metadata services for streaming data. However, we believe it can grow into a standard that defines more robust interfaces for:

Exposing Rich Metadata: Standard calls to retrieve histograms, NDV, partition cardinalities, and other key stats.
Plan-Oriented Endpoints: Potentially returning optimized query plans (or plan fragments) to any requesting engine.
Streaming Metadata Interfaces: APIs designed for low-latency streaming ingestion stats, bridging the gap between batch and real-time data.

Such an evolution would further cement Iceberg’s role as the definitive lakehouse standard, fueling a new wave of disaggregated and modular data architectures.

Conclusion

Apache Iceberg has already revolutionized how we manage data, decoupling the storage layer from compute. The next frontier is decoupling and disaggregating query planners, optimizers, metadata services, and more. The critical enabler for this extreme decoupling of concern is a standard interface through which these components can communicate.

A future where Iceberg’s Catalog API evolves into a robust interface for serving not just data but comprehensive metadata—and even query plans—stands to transform the data ecosystem. This new wave of “separation of compute from compute” will foster an entire ecosystem of specialized tools, accelerating innovation and driving the lakehouse beyond its current limits.