apache-iceberg

Accelerating Iceberg Analytics: How Apache Arrow Can Help get the best out of SIMD for Breakneck Speed

Karthic Rao

16 Jan 2025 — 5 min read

CPU's efficiently performing in-memory analytics on data from Iceberg tables

Modern analytics engines need to squeeze every drop of performance out of the CPU. We often hear that SIMD (Single Instruction Multiple Data) can accelerate computation by processing multiple data points in a single CPU instruction. And that’s true—provided the processor isn’t left twiddling its thumbs waiting for data from memory. To unlock the full benefit of SIMD (or any high-performance CPU operation), we must design our data pipelines to efficiently feed data to the CPU.

This post explores how this principle can be applied when building analytics on top of object storage native open-table formats like Apache Iceberg tables, which store data in Parquet files on disk, and how in-memory formats like Apache Arrow drastically reduce wasted CPU cycles. It also discussed why Rust is becoming a first-class citizen in the data engineering world, tying these components together in next-generation analytics engines.

1. Parquet and Iceberg: An Excellent On-Disk Duo

Parquet at Rest for Apache Iceberg

Most Iceberg tables store data in Apache Parquet format. Parquet is a columnar, compressed file format optimized for efficiently reading large swaths of data on disk. It helps techniques like column pruning (scan only needed columns) and compresses column data very effectively, which means less I/O when reading from disk. These features make Parquet an excellent “at rest” format.

However, once you bring data from disk into memory to perform aggregations, filters, or other analytic queries, the row groups and column chunks in Parquet aren’t necessarily arranged in the most optimal way for in-memory processing. After all, Parquet’s primary mission is to reduce disk I/O, not to perfect CPU cache efficiency once your data is in memory. That’s where Apache Arrow comes in.

2. The Challenge of In-Memory Analytics

Analytical Queries: High Memory and CPU Demands

When performing complex analytics—say, scanning billions of rows to compute aggregates, filter, or join—your CPU will do a lot of work. Analytical queries touch large chunks of data, often from one or more columns. While you can load these columns into memory, the pattern in which they are laid out matters exceptionally for performance.

Why is this so critical? Because modern CPUs run at high clock speeds (e.g., 3 GHz or more), but memory access latencies can be hundreds or thousands of cycles if data isn’t in the CPU’s cache. Repeatedly waiting for data from main memory forces the CPU to stall—leading to wasted cycles and undermining the theoretical gains of fast instruction pipelines.

3. SIMD to the Rescue…Almost

How SIMD Accelerates Analytical Computations

SIMD instructions allow the CPU to handle multiple data elements, like data in vectors in parallel. For example, if you’re summing all the values in a column, a vectorized SIMD instruction could add several numbers together in a single step—hugely speeding things up compared to scalar processing. This vectorization shines in analytical workloads where large blocks of homogeneous data (e.g., all floating-point values from a particular column) are operated on identically.

But There’s a Catch: Memory Stalls

Even if you have blazing-fast vector instructions, it won’t matter if the CPU is left idle half the time, waiting for the data to arrive from memory. The CPU can only process data loaded into its caches (L1, L2, L3). If your data layout forces random, scattered accesses all over memory, you’ll incur cache misses, and your SIMD advantage won’t be fully realized.

This is where cache-aware data structures and prefetching become crucial. We need to feed data to those SIMD pipelines in a contiguous, cache-friendly manner so that each cache line the CPU loads is packed with relevant data.

4. CPU Caches and the Cost of Memory Fetches

The Hierarchy: L1, L2, L3, and Beyond

Modern CPUs have multiple levels of cache:

L1 cache: Smallest but fastest (latency ~1 ns).
L2 cache: Larger, slightly slower (latency ~7–10 ns).
L3 cache: Even more significant and shared across cores (latency ~10–20 ns).
Main memory (RAM): Orders of magnitude slower (50–100 ns).

When the CPU needs data that isn’t in one of these caches (a cache miss), it has to fetch it from the next level (or eventually from RAM). Each miss incurs a penalty in clock cycles—the CPU stalls while data travels over the memory bus.

Why Latency Matters

At a clock speed of 3 GHz, one CPU cycle is roughly 0.33 ns. Waiting for data from RAM effectively wastes hundreds or thousands of cycles. During that time, your CPU can’t do practical work. This reality underscores that data access patterns are as important as raw computing capabilities—particularly for analytical workloads that scan large columns.

5. Columnar Data Fetching for Analytics

Why Columnar Layouts Are a Boon

In analytics, we frequently focus on a few columns out of many. By storing columns contiguously, we ensure that when we load a cache line from memory, we pull in relevant data from the exact column we’re working on. This reduces cache line wastage (where half the cache line might be full of data from irrelevant columns) and thereby improves spatial locality—the CPU sees a neat block of data to process in a SIMD fashion.

Parquet already delivers a columnar structure on disk, but once loaded into memory, we want something even more optimized for CPU caches. Apache Arrow provides precisely that.

6. Meet Apache Arrow: A CPU-Cache-Friendly Format

Arrow’s Columnar In-Memory Format

Apache Arrow is designed to be a high-performance, in-memory format that aligns with modern CPU pipelines. It stores each column’s data in contiguous arrays, making it simple to:

Load data into SIMD registers without rearranging.
Use vectorized instructions for large chunks of data.
Minimize cache misses since adjacent elements of a column sit right next to each other in memory.

7. From Parquet to Arrow to Flight: The End-to-End Flow

Parquet for On-Disk Storage

Iceberg keeps data in Parquet files for efficient at-rest storage. Parquet’s compression and column-based partitioning reduce disk space and reading overhead.

Arrow for In-Memory Execution

When processing queries—aggregation, filtering, or scanning billions of rows—we convert the relevant Parquet data into Arrow’s columnar in-memory structure. That structure ensures we can harness SIMD effectively while keeping data loaded in CPU caches.

Arrow Flight for Data Transfer

Arrow Flight provides a high-speed protocol for transferring data between services or nodes that preserves the Arrow format over the wire. This means zero (or minimal) reformatting overhead before (or after) the data hits the network, which can be a game-changer in distributed analytics systems.

All these components—Parquet, Arrow, and Arrow Flight—form an integrated pipeline:

Parquet at rest for the disk.
Load and materialize to Arrow arrays in memory.
Transfer using Arrow Flight if needed.
Perform highly efficient analytics that taps into SIMD and caches.

8. Rust: A First-Class Citizen for Data Engineering

Why Rust Ties It All Together

Rust is increasingly favored in data engineering circles for:

Performance: Rust compiles down to native code with minimal overhead, which is perfect for CPU-intensive tasks.
Safety: Rust’s ownership model helps avoid memory corruption bugs, which are critical for large-scale data processing.
Ecosystem: Libraries like Apache Arrow have excellent Rust bindings, making it easier to plug into Arrow arrays natively.

Rust’s compile-time guarantees, efficient memory usage, and integration with Arrow can yield blazing-fast analytic engines. You get near-C++ performance with safer concurrency—ideal for building the next generation of Iceberg-based analytic query engines.

Conclusion

At first glance, we might assume SIMD alone solves our performance problems. But as we’ve seen, memory stalls can sabotage even the most sophisticated vector instructions. The key to achieving that “lightning-fast analytics” is ensuring data is fed efficiently to the CPU—through careful attention to cache usage, prefetching, alignment, and linear data access.

Apache Parquet excels at on-disk compression and scans, and Apache Iceberg uses it to manage the data of massive tables. Once in memory, Apache Arrow provides a cache-friendly, columnar layout that lets SIMD be used to full effect. With Rust as the implementation language, we can tie these components together to build a robust, next-generation data platform that’s safe, fast, and designed for the modern CPU architecture.

From Parquet to Arrow Flight, from disk to CPU cache, this integrated workflow reduces wasted cycles and maximizes the throughput of each core—turbocharging analytics for the data-driven world.