Karthic Rao - hackintoshrao

Apache Iceberg

One Interface, Many Backends: The Design of Iceberg Rust's Universal Storage Layer with OpenDAL

Apache Iceberg's core promise is to treat files in a data lake on Minio AIStor, S3, GCS, HDFS, or your local disk as if they were rows in a high-performance database table, complete with ACID transactions and schema evolution. It's a powerful abstraction. But how does

Apache Iceberg

Fast Distributed Iceberg Writes and Queries with Apache Arrow IPC

In distributed analytical systems, performance relies on two main factors: the efficiency of data movement between processes or network nodes and the efficiency of data processing once it reaches its destination. Apache Arrow's Inter-Process Communication (IPC) framework addresses both challenges. Arrow IPC offers a language-agnostic columnar data format

Rust

Invisible State Machines: Understanding Rust’s impl Future Return Types

Discover how compiler-generated futures work behind the scenes—and why they’re both powerful and perplexing.

Rust

No Extra Boxes, Please: When (and When Not) to Wrap Heap Data

If the compiler doesn’t force you to Box, you probably don’t need one!

Rust

From Scope to Thread: Mastering Closure Variable Captures in Rust

Exploring the why and how of capturing variables inside a closure running in a new thread in Rust

apache-iceberg

Iceberg-Powered Unification: Why Table/Stream Duality Will Redefine ETL in 2025!

Traditional ETL has long revolved around batch processing pipelines—Spark jobs ingest static, bounded datasets from data lakes or warehouses, perform transformations, and write results to another table or file system. Meanwhile, real-time ingestion platforms such as Apache Kafka, Apache Pulsar, and Redpanda have specialized in high-throughput, low-latency pipelines but

apache-iceberg

Accelerating Iceberg Analytics: How Apache Arrow Can Help get the best out of SIMD for Breakneck Speed

Modern analytics engines need to squeeze every drop of performance out of the CPU. We often hear that SIMD (Single Instruction Multiple Data) can accelerate computation by processing multiple data points in a single CPU instruction. And that’s true—provided the processor isn’t left twiddling its thumbs waiting

apache-iceberg

SQL to Relational Algebra to Physical Plans: How Queries Truly Run (Across Databases, Warehouses, and Lakehouses)

This blog post is part of my ongoing series dissecting key concepts from the CMU Databases Courses and presenting them at the Bangalore systems meetup group. Join the discord channel to be part of the community and stay tuned about these sessions. The current series of sessions particularly cover query

apache-iceberg

Reimagining Query Planners and Metadata Services in the Age of Iceberg Lakehouse Tables

The Apache Iceberg project introduced an open table storage format to store structured data in object storage systems (e.g., AWS S3, MinIO) for analytical purposes. Much like the S3 protocol became the de facto standard for accessing and managing object storage, Iceberg’s catalog API has the potential to

lakehouse

The Need for New Low-Level API for Lakehouse-Centric Compute Engines

In this post, I will argue that today’s query engines, such as Apache Spark, do not effectively interact with lake houses due to limitations in handling object storage semantics, lack of integration with catalog systems like Iceberg for governance and lineage, and inadequate support for lakehouse-specific features. I believe

Understanding map and filter_map in Rust: Handling Arrays with Option and Result Values

When working with collections in Rust programming language, especially arrays or vectors, it's common to encounter elements inside arrays wrapped in Option or Result types. Rust provides powerful iterator methods like map and filter_map to manipulate these arrays efficiently. In this blog post, we'll explore

Simplifying Rust Lifetimes: When Owning Data Reduces Complexity

In Rust programming language, lifetimes are a powerful feature that ensures memory safety without needing a garbage collector. However, they can sometimes introduce complexity, mainly when dealing with borrowed references in structs and functions. This complexity can spread throughout your codebase, making it harder to read and maintain. This post