← Back to Blog

Delta Lake: My Learnings and Why You Should Know It

by Omeed Tehrani

Delta Lake: My Learnings and Why You Should Know It

Introduction

In the world of big data engineering, Delta Lake has become increasingly important. It's an open-source storage layer that brings reliability to data lakes. If you're working with Apache Spark, building data pipelines, or dealing with large-scale data processing, understanding Delta Lake is crucial.

Let me share what I've learned about it and why it matters.

What Problem Does Delta Lake Solve?

Traditional data lakes have some serious issues:

Problems with Plain Parquet/Data Lakes

  1. No ACID Transactions: Multiple writers can corrupt data
  2. No Schema Enforcement: Bad data can sneak in
  3. Difficult Updates/Deletes: Immutable formats make modifications painful
  4. No Time Travel: Can't easily query historical data
  5. Failed Jobs Leave Partial Data: No atomic commits

Example of a typical failure:

Job starts writing 1000 Parquet files →
Job fails at file 500 →
Now you have 500 orphaned files →
Manual cleanup required

This is a nightmare for production data pipelines.

Enter Delta Lake

Delta Lake solves these problems by adding a transaction log on top of Parquet files. It's essentially a metadata layer that provides:

ACID transactions
Schema enforcement and evolution
Time travel (query data as it was at any point)
Unified batch and streaming
Scalable metadata handling

The Delta Lake Storage Layer

Architecture

Delta Lake consists of:

  1. Data Files: Still stored as Parquet (columnar, compressed)
  2. Transaction Log: JSON files in _delta_log/ directory
  3. Checkpoints: Parquet summaries of the log for fast reads
my_table/
├── _delta_log/
│   ├── 00000000000000000000.json  ← First transaction
│   ├── 00000000000000000001.json  ← Second transaction
│   ├── 00000000000000000002.json  ← Third transaction
│   └── 00000000000000000010.checkpoint.parquet  ← Summary
├── part-00000-xxx.snappy.parquet
├── part-00001-xxx.snappy.parquet
└── part-00002-xxx.snappy.parquet

The Transaction Log (The Magic)

Every operation (write, update, delete, merge) creates a new transaction log entry. This log entry contains:

  • Add: Which files were added
  • Remove: Which files were removed
  • Metadata: Schema, partition info
  • Commit Info: Timestamp, operation type, isolation level

Example transaction log entry:

{
  "add": {
    "path": "part-00000-xxx.parquet",
    "size": 12345,
    "modificationTime": 1609459200000,
    "dataChange": true,
    "stats": "{\"numRecords\":100,\"minValues\":{...}}"
  }
}

How Reads Work

  1. Read latest checkpoint (fast summary of log up to that point)
  2. Read incremental log entries since checkpoint
  3. Reconstruct current state (which files are "live")
  4. Read only the live Parquet files

This is fast because:

  • Checkpoints amortize the cost of reading the entire log
  • Only need to scan metadata, not all data files
  • Can prune files based on statistics in the log

How Writes Work (ACID!)

# Pseudo-code for a Delta write
def write_to_delta(data, table_path):
    # 1. Read current version number from log
    current_version = read_latest_version(table_path)
    
    # 2. Write new Parquet files
    new_files = write_parquet(data, table_path)
    
    # 3. Attempt to commit transaction log
    new_version = current_version + 1
    log_entry = {
        "commitInfo": {...},
        "add": new_files
    }
    
    try:
        # Atomic write! Uses cloud object store semantics
        atomic_write(f"{table_path}/_delta_log/{new_version:020d}.json", log_entry)
        return "SUCCESS"
    except FileAlreadyExistsException:
        # Someone else committed version N already
        return "CONFLICT - retry"

The key insight: The transaction log provides serializable isolation through optimistic concurrency control.

Storage Layer Breakdown

1. ACID Transactions

Example: Two jobs writing simultaneously

# Job A
df_a.write.format("delta").mode("append").save("/data/table")

# Job B (running at same time)
df_b.write.format("delta").mode("append").save("/data/table")

With plain Parquet: 💥 Data corruption or lost writes

With Delta Lake:

  • Both jobs write their Parquet files
  • Job A commits transaction log version N
  • Job B attempts to commit version N → conflict detected
  • Job B retries with version N+1
  • Both writes succeed safely!

2. Time Travel

Query data as it was at any point in time:

# Current data
df = spark.read.format("delta").load("/data/table")

# Data as of version 5
df_v5 = spark.read.format("delta").option("versionAsOf", 5).load("/data/table")

# Data as of timestamp
df_historical = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/data/table")

This is implemented by:

  1. Reading the transaction log up to that version/timestamp
  2. Reconstructing which files were "live" at that point
  3. Reading only those files

Use cases:

  • Debugging: "What did the data look like when the bug happened?"
  • Compliance: "Show me data as of quarter-end"
  • Rollback: "Undo the last 3 bad writes"

3. Schema Enforcement and Evolution

Schema enforcement prevents bad data:

# Define schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False)
])

# This works
good_data = [(1, "Alice"), (2, "Bob")]
spark.createDataFrame(good_data, schema).write.format("delta").save("/data/table")

# This fails - schema mismatch!
bad_data = [(1, "Alice", "extra_column")]
spark.createDataFrame(bad_data).write.format("delta").mode("append").save("/data/table")
# AnalysisException: Schema mismatch!

Schema evolution allows controlled changes:

# Add a new column safely
df_with_new_col.write.format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/data/table")

4. Upserts (MERGE operations)

One of Delta's killer features:

from delta.tables import DeltaTable

# Load existing table
delta_table = DeltaTable.forPath(spark, "/data/users")

# New updates
updates = spark.createDataFrame([
    (1, "Alice", "alice_new@email.com"),
    (3, "Charlie", "charlie@email.com")
], ["id", "name", "email"])

# MERGE: Update if exists, insert if new
delta_table.alias("target") \
    .merge(
        updates.alias("source"),
        "target.id = source.id"
    ) \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

This is orders of magnitude faster than "read all → filter → union → write all".

Limitations of Delta Lake Approach

1. Not a Database

  • No indexes (besides partition pruning)
  • No fast single-row lookups
  • Not meant for OLTP workloads

2. Transaction Log Growth

  • Log grows with every operation
  • Need periodic checkpoints and cleanup
  • Very high-frequency small writes can cause issues

3. Cloud Storage Costs

  • Many small files = many LIST operations
  • Metadata-heavy workloads can be expensive on S3/ADLS

4. Small Files Problem

  • Lots of small writes create lots of small Parquet files
  • Need to run OPTIMIZE regularly to compact
# Compact small files
spark.sql("OPTIMIZE delta.`/data/table`")

# Z-order for better pruning
spark.sql("OPTIMIZE delta.`/data/table` ZORDER BY (date, user_id)")

Real-World Use Cases

1. Streaming + Batch Unified

# Streaming write
stream = spark.readStream.format("kafka")...
stream.writeStream.format("delta").start("/data/table")

# Batch read (sees all data including streaming!)
df = spark.read.format("delta").load("/data/table")

2. CDC (Change Data Capture)

# Capture changes from database
changes = spark.read.format("delta").option("readChangeFeed", "true").load("/data/table")

3. Data Quality Enforcement

# Reject bad data at write time
df.write.format("delta") \
    .option("checkConstraints", "age > 0") \
    .save("/data/table")

Why You Should Learn It

  1. Industry Standard: Adopted by Databricks, AWS, Azure, Google Cloud
  2. Simplifies Pipelines: No more manual deduplication or compaction
  3. Enables ML: Time travel for reproducible training datasets
  4. Career Value: Big data engineers need to know this

Getting Started

# Install
pip install delta-spark

# Write
df.write.format("delta").save("/tmp/delta-table")

# Read
df = spark.read.format("delta").load("/tmp/delta-table")

# Time travel
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-12") \
    .load("/tmp/delta-table")

Conclusion

Delta Lake isn't just "Parquet with extras"—it's a fundamental rethinking of how data lakes should work. By adding a transaction log, it brings database-like reliability to the data lake world without sacrificing the scalability and flexibility that made data lakes popular in the first place.

If you're building data pipelines in 2024 and beyond, Delta Lake (or similar technologies like Apache Iceberg or Apache Hudi) are essential tools in your toolkit.

The future of data engineering is lakehouse: the best of data warehouses (ACID, schema) and data lakes (scalability, flexibility, open formats). And Delta Lake is leading that charge.


Resources: