apache-iceberg

SQL to Relational Algebra to Physical Plans: How Queries Truly Run (Across Databases, Warehouses, and Lakehouses)

Karthic Rao

10 Jan 2025 — 8 min read

How queries run inside databases, warehouses and lakehouses

This blog post is part of my ongoing series dissecting key concepts from the CMU Databases Courses and presenting them at the Bangalore systems meetup group. Join the discord channel to be part of the community and stay tuned about these sessions. The current series of sessions particularly cover query planners and optimizers. If you're curious to dive deeper into the academic foundations behind these systems, I highly recommend watching this CMU lecture on Query Optimization.

When we write an SQL query, we usually think about what data we need rather than how to retrieve it—this “what, not how” mindset is the hallmark of a declarative language. Meanwhile, behind the scenes, your data system translates that human-friendly SQL into a logical plan (often expressed in relational algebra) and refines it into an efficient physical plan—the specific set of steps and operations that fetch the data.

Interestingly, while logical plans can look pretty similar across different engines—whether it’s a traditional relational database like PostgreSQL or an analytics platform running on top of Apache Iceberg—physical plans can vary drastically based on things like available indexes, storage format, network constraints, and the hardware or cluster setup. That’s where the real magic (and complexity) happens.

In this blog post, we’ll take a deep dive into the stages of query processing:

SQL (Declarative): This is a user-friendly way of expressing the required data
Relational Algebra (Logical Plan): The internally used, procedural representation of which relational operations must occur.
Physical Plan: The final blueprint determines how and where those operations will be executed.

In this journey, we'll explore how systems like databases, data warehouses, and lakehouses (e.g., those built on Apache Iceberg) can produce radically different physical execution strategies—even if their logical plans look very similar.

1. The Journey from SQL to Physical Execution

1.1 Overview of the Process

Parsing and Validation: Once you submit an SQL statement, the system first checks for syntax errors and validates references to tables and columns.
Logical Plan Generation: It then converts the validated SQL into a tree of relational algebra operators (the logical plan).
Logical Optimization: The logical plan may be transformed via rules (e.g., pushing down filters) and possibly cost-based heuristics that reorder or rewrite parts of the plan for optimal performance.
Physical Plan Generation: The system chooses specific algorithms (e.g., hash join vs. sort-merge join) and data access methods (e.g., index scan vs. full table scan) to form a physical plan.
Physical Optimization & Execution: Final cost-based decisions (like parallelism, partitioning, and scheduling) are made to produce the most efficient plan for the target environment. Execution then begins.

1.2 Why Is This Important?

Understanding these stages helps database professionals, data engineers, and developers to:

Diagnose slow queries by knowing where the system might be spending time.
Write SQL that encourages the optimizer to pick optimal strategies.
Leverage advanced features (like partitioning and indexing) more effectively.
Recognize how different backends (database vs. data warehouse vs. lakehouse) might produce different execution plans.

2. The Declarative Nature of SQL

SQL’s design philosophy is simplicity for the user. A typical SQL query:

SELECT d.department_name, COUNT(e.employee_id) AS total_employees
FROM employees e
JOIN departments d ON e.department_id = d.department_id
WHERE d.location = 'Headquarters'
GROUP BY d.department_name;

Declarative: We only specify what we want (a join, a condition, an aggregation).
Set-Oriented: The query operates on entire sets of rows in employees and departments.
High-Level: We don’t specify how to traverse indexes or in which order to process rows.

However, the database engine eventually must decide on those how details. That’s where relational algebra and the logical plan come in.

3. Understanding Relational Algebra (The Logical Plan)

Relational algebra provides the procedural model that underpins SQL. By rewriting an SQL query into a relational algebra expression, the query engine gets a well-defined set of operators to manipulate data step by step. Below is a handy table mapping standard SQL components to their relational algebra counterparts and explaining how they appear in the logical plan.

3.1 Table of Relational Algebra Operators

Rel. Algebra Symbol	Operator Name	SQL Equivalent	Role in Logical Plan
$\sigma$ (sigma)	Selection	`WHERE <condition>`	Filters rows based on a predicate. Often appears early in the plan if push-down is possible.
$\pi$ (pi)	Projection	`SELECT <columns>`	Selects specific columns from a relation. Minimizes data volume if done early.
$\times$ (times)	Cartesian Product	Implicit in some joins (when no join condition)	Rarely used in optimized plans unless needed for certain joins or tests.
$\bowtie$ or $\Join$	Join	`JOIN <table> ON <condition>`	Combines rows from two relations based on a join condition. Key operator in multi-table queries.
$\cup$ (union)	Union	`UNION`	Combines two result sets, removing duplicates if specified.
$\cap$ (intersection)	Intersection	`INTERSECT`	Returns rows common to both results. Less commonly used in typical queries, but still part of the algebra.
$\setminus$ (minus)	Set Difference	`EXCEPT` or `MINUS`	Returns rows in one set that are not present in another.
$\gamma$ (gamma)	Aggregation / Grouping	`GROUP BY` or aggregates (e.g., SUM, COUNT)	Groups rows and computes aggregate values. This operator is crucial in analytical queries.

3.2 Example of a Logical Plan (Relational Algebra Tree)

Suppose we have the following SQL:

SELECT d.department_name, COUNT(e.employee_id) AS total_employees
FROM employees e
JOIN departments d ON e.department_id = d.department_id
WHERE d.location = 'Headquarters'
GROUP BY d.department_name;

A simplified logical plan could be represented as a tree (in text form) like so:

                 γ (group by department_name, count(employee_id))
                                |
                             σ (d.location = 'Headquarters')
                                |
                         ⋈ (e.department_id = d.department_id)
                            /                \
                           /                  \
            π (employee_id, department_id)    π (department_name, location, department_id)
                from employees (e)                 from departments (d)

Projection (π\pi): Select only the required columns from employees and departments.
Join (⋈\Join): Join these two sets on the matching department_id.
Selection (σ\sigma): Filter rows to retain only those in Headquarters.
Aggregation (γ\gamma): Finally, group by department_name and count the employee_id.

This logical plan is largely system-agnostic—whether you’re on PostgreSQL, MySQL, or an OLAP engine like Spark SQL, you might see a similar query breakdown at a high level. However, once we move into physical planning, significant differences can emerge.

4. The Importance of Cost-Based (vs. Rule-Based) Optimization

4.1 Rule-Based Optimization

Definition: Transformations apply a fixed set of rewrite rules (e.g., “push down all WHERE clauses as early as possible,” “convert cross-join + filter into a join,” etc.).
Advantages: Quick and straightforward.
Disadvantages: Lacks sophistication when facing large, complex queries. Doesn’t weigh the cost of different approaches.

4.2 Cost-Based Optimization

Definition: Uses statistics about table sizes, index selectivity, data distribution, etc., to estimate the cost (CPU, I/O, network) of each potential plan.
Advantages: Can choose between multiple valid plans based on predicted performance (e.g., “Hash Join might be faster than a Nested Loop Join if the table is large, but not if it’s small”).
Disadvantages: Requires good statistics. Inadequate or outdated stats can mislead the optimizer, producing suboptimal plans.

Multiple Plans for the Same Query

For a given query, there could be multiple equivalent logical plans, and there could also be various ways to execute the query to achieve the same physical result.
Factors like the system’s metadata and statistics (e.g., cardinalities, histogram distributions, table sizes) guide the cost-based optimizer in picking one plan over another. We'll dwell more on the optimization aspect of a query in future blogs.

5. From Logical Plan to Physical Plan

5.1 Physical Operators

In the physical plan, each logical operator is replaced by an actual algorithm:

Index Scan vs. Table Scan
Hash Join vs. Sort-Merge Join vs. Nested Loop
Hash Aggregation vs. Sort-Based Aggregation

The choice depends heavily on factors like the presence of indexes, the expected size of data, and whether the system can parallelize or partition the data.

5.2 Resource & Execution Strategy

CPU and Memory: Some join algorithms like Hash Join can be memory-intensive, while Sort-Merge might be more CPU-intensive.
Network: In distributed systems, data shuffling can become a bottleneck. Shuffles are often required for joins or aggregations.
I/O: If the system can’t fit intermediate data in memory, it may spill to disk, slowing the query.

5.3 Scheduling & Parallelism

In modern distributed query engines (e.g., Spark, Flink, Trino):

A single query may spawn thousands or even millions of parallel tasks.
There’s typically a task queue to manage concurrency.
The scheduler may apply dynamic, cost-based scheduling to allocate resources more efficiently.
All these layers influence the physical plan—though the logical plan might remain the same.

6. Physical Plan Differences Across Systems

Even with the same logical plan, systems can produce very different physical execution strategies. Let’s look at three broad categories:

6.1 Traditional Databases (e.g., PostgreSQL, MySQL, Oracle)

Local Storage: Data resides on a local disk.
Index Utilization: These systems often heavily emphasize B-tree or bitmap indexes, enabling index scans.
Optimizer Focus: Minimizing random I/O and ensuring optimal join order is key.
Parallelism: Some queries can run in parallel but not as massively parallel as large-scale analytical engines.
Plan Example:
- A join might use a Nested Loop if the inner table has a good index on the join key.
- Aggregations might rely on in-memory or temp table structures for group-by operations.

6.2 Data Warehouses (e.g., Amazon Redshift, Snowflake)

Columnar Storage: Data is stored in a columnar format to speed up scans of individual columns.
Massively Parallel Processing (MPP): Queries are distributed across multiple nodes, each responsible for a portion of the data.
Physical Plan:
- Typically uses distributed Hash Joins or Sort-Merge Joins when dealing with large datasets.
- Big parallel scans are common; indexing might be less crucial than in OLTP-oriented databases.
Minimizing Data Shuffle: The optimizer tries to co-locate data or reduce shuffles by distributing data based on join keys.

6.3 Object Storage + Apache Iceberg (Lakehouses)

Cloud Object Storage: Data files reside in object storage systems like Amazon S3 or Azure Blob.
Iceberg Table Metadata: Partition information and file-level stats can help skip entire files.
Execution Engines: Usually Spark, E6data, Trino, or similar big data engines for query execution.
Physical Plan:
- Pushdown: Predicates (e.g., partitions or min/max values) are pushed down to skip reading entire data files.
- Shuffle Handling: Large joins or aggregations require shuffles across the cluster. Careful scheduling is needed to avoid network saturation.
- Memory Management: If the shuffle or join is too large to fit in memory, the engine will spill to disk, impacting performance.
- Task Queue & Schedulers: The system might dynamically allocate tasks across the cluster. If a node fails or is slow, the scheduler adjusts tasks accordingly.

7. Sample Relational Algebra to Physical Plan (Text Diagram)

Consider a slightly more complex join across three tables (Employees, Departments, Salaries) as a final example. A possible logical plan tree:

                      σ (Departments.location = 'HQ')
                                |
                         ⋈ (Emp.department_id = Dept.department_id)
                           /                 \
                          /                   \
   ⋈ (Emp.emp_id = Sal.emp_id)                π (department_id, location)
           /       \
          /         \
π (emp_id, department_id)   π (emp_id, salary)
   from Employees (Emp)        from Salaries (Sal)

A corresponding physical plan (in a traditional DB):

Nested Loop Join   <-- because the DB found a selective index on department_id
|  -> Index Scan on Departments (using index on department_id, filter location='HQ')
|  -> Index Scan on Employees (using index on department_id)
   -> Nested Loop
      | -> Index Scan on Salaries (using index on emp_id)
      | -> ...

Whereas a Spark/Trino plan in an Iceberg-based lakehouse for the same logical plan might look like:

Distributed Hash Join  <-- because there's no local indexing, but big parallel resources
|  -> Scan Departments Parquet files with partition predicate (location='HQ')
|  -> Broadcast/Shuffle Hash Join
     | -> Scan Employees Parquet files
     | -> Shuffle Salaries on emp_id 
         ...

Clearly, it has the same logic, but very different physical plans.

Conclusion

The journey of an SQL query—from its declarative statement to its logical plan and finally to the physical plan—is central to efficient data retrieval. While logical plans derived from relational algebra often share much in common across different systems, physical plan execution can differ radically due to available hardware, storage formats, indexing capabilities, and scaling strategies.

In a traditional database, the optimizer might choose a nested loop join on an indexed column to minimize random I/O on local storage.
In a data warehouse, it might rely on columnar scans and big parallel merges.
In a lakehouse scenario with Iceberg, the engine might push down partition filters to skip entire files and then rely on a distributed hash join.

I hope you found this blog helpful in introducing logical and physical plans, and why physical plans differ for various data systems. Subscribe to the blog, and follow me on Linkedin and X to stay tuned for more updates.