SapotaCorp

CDC into a lakehouse: Change Data Feed, MERGE, and not reprocessing everything

The moment a pipeline has to reflect updates and deletes from a source, not just appends, full reloads stop scaling and append-only logic starts lying. Change data capture done properly means capturing what changed, applying it with MERGE, and propagating it downstream with Change Data Feed instead of recomputing the world every run. Here is how the pieces fit.

CDC into a lakehouse: Change Data Feed, MERGE, and not reprocessing everything

Key takeaways

  • Once a source has updates and deletes, append-only ingestion is wrong, not just slow. The table accumulates stale and duplicate rows because nothing supersedes the old version of a record, so correctness, not only cost, forces a move to change data capture.
  • MERGE is how you apply captured changes correctly: update the rows that changed, insert the new ones, and delete the removed ones in a single atomic operation, rather than truncate-and-reload or naive append.
  • Change Data Feed lets a Delta table expose the row-level changes it received, so downstream tables can consume only what changed instead of rescanning the whole upstream table. This is what makes incremental propagation through Silver and Gold practical.
  • Full reloads feel safe because they are simple, but they stop scaling and they erase history. Incremental CDC is more work to set up and far cheaper to run, and it preserves the change history that append-everything and reload-everything both lose.

Most pipelines start append-only, and for a while that is fine. Data arrives, you add it to the table, life is simple. Then the source starts sending updates to records you already have, and deletes for records that no longer exist, and the append-only model quietly becomes wrong. The table now holds two versions of the same record with nothing marking which is current, and rows that were deleted at the source live on forever because nothing ever removed them. The numbers stop reconciling, and the usual first reaction is to paper over it with a full reload every run, which works until the table is large enough that reloading it nightly becomes its own problem.

This is the point at which a pipeline has to grow up into change data capture, and on a Delta Lake lakehouse the tools for it are specific and worth understanding properly: MERGE for applying changes correctly, and Change Data Feed for propagating them downstream without rescanning everything. The reason to learn them is not performance for its own sake. It is that once a source has updates and deletes, incremental is the only model that is actually correct.

Append-only stops being correct, not just slow

It is worth being precise about why append-only fails, because the failure is about correctness, not cost, and that changes how seriously you take it.

When a source only ever adds new records, appending them is right. The moment the source can update a record, appending the new version leaves the old version sitting in the table with nothing to say it is superseded, so a query that should return one current row returns two and has no principled way to pick. The moment the source can delete a record, append-only has no mechanism to reflect that at all, so deleted records persist indefinitely. The table is not merely behind; it is wrong, holding duplicates and ghosts.

People often reach for a full reload at this point, truncate the table and rebuild it from the current state of the source, and it does produce a correct snapshot. But it scales badly, because you reprocess the entire source every run regardless of how little changed, and it throws away history, because each reload replaces the previous state with no record of what changed between them. For a small table it is a reasonable shortcut. For anything that grows, it is a cost and a blind spot that compound together.

MERGE applies changes correctly and atomically

The correct way to apply a set of captured changes to a Delta table is MERGE, because it handles all three kinds of change in one atomic operation. Given the changes from the source, MERGE updates the rows that changed to their new values, inserts the rows that are genuinely new, and deletes the rows that were removed, and it does the whole thing transactionally so the table is never left half-updated.

This is what makes MERGE the right primitive for an upsert, the update-or-insert pattern at the heart of CDC. Rather than working out by hand which records are new versus existing, or truncating and reloading to sidestep the question, you describe how to match incoming changes to existing rows and what to do in each case, and Delta applies it as a single consistent step. The atomicity matters in a lakehouse that downstream consumers are reading continuously, because they either see the table before the merge or after it, never in the middle of a partial update. It is the same reason we leaned on careful, transactional relationship fixes rather than in-memory rewrites when deduplicating identities on a credit platform: set-based, atomic operations are how you change a large table without corrupting it mid-flight.

Change Data Feed propagates changes without rescanning

MERGE gets the changes into a table correctly, but in a medallion architecture you usually have to get them further, from Silver into Gold, from one table into the aggregates and reports built on it. The naive way is to rescan the whole upstream table every time anything changes, which recreates the full-reload problem one layer up. Change Data Feed is the feature that avoids it.

When you enable Change Data Feed on a Delta table, the table exposes the row-level changes it received, the inserts, updates, and deletes, as something downstream consumers can read directly. A downstream table can then consume just the changes since it last ran and apply them, rather than rescanning the entire upstream table to figure out what is different. This is what makes incremental propagation through the layers practical: each layer passes along only what actually changed, so the work at every stage is proportional to the change volume, not to the total size of the data. Without it, "incremental" tends to stop at the first table and everything downstream quietly goes back to reprocessing the world.

Incremental costs more to build and far less to run

Being honest about the trade-off matters, because incremental CDC is not free. It is more to set up than append-only or reload-everything. You have to capture the changes from the source, decide how records are matched, write the MERGE, enable and consume Change Data Feed downstream, and reason about ordering and late-arriving data. Compared to "just reload it," that is real upfront effort.

What you get for that effort is a pipeline whose running cost tracks how much actually changed rather than how much data exists, which is the difference between a job that stays cheap as the table grows and one that gets more expensive every month for doing the same work. You also keep the change history, the record of what moved and when, that both append-everything and reload-everything throw away, and that history is often valuable in its own right, especially in domains like finance where how a balance got to its current value matters as much as the value. The decision rule is straightforward: while the source is genuinely append-only and small, keep it simple; the moment it has updates and deletes, or it has grown enough that reloads hurt, the upfront cost of incremental CDC pays back quickly and keeps paying.

Putting it together

A working CDC flow on a lakehouse has a recognisable shape. Changes are captured from the source rather than re-read in full. They are applied to the Silver table with MERGE, so updates, inserts, and deletes all land correctly in one atomic step. Change Data Feed is enabled so the layers above can consume just those changes, and the Gold tables and aggregates are updated incrementally from that feed rather than rebuilt from scratch. The result is a pipeline that reflects updates and deletes correctly, costs in proportion to change rather than size, and remembers how the data got to where it is.

The thing to take away is that CDC is not an optimisation you add when a pipeline gets slow. It is the correct model the moment a source can change records that already exist, and reaching for a full reload instead is choosing a snapshot that is right today and a cost curve that is wrong tomorrow.

If your pipeline reloads everything to stay correct

The tell is a job that truncates and rebuilds a growing table every run because it is the only way the numbers come out right. That works until it does not, and it is quietly throwing away the change history along the way. Both problems are solved by moving to proper change data capture.

Sapota's data team builds incremental, CDC-based pipelines as the default for sources that mutate, and it is part of the same reusable framework behind the medallion-on-AWS platform we delivered for a regulated fintech. Getting the incremental model right early is what keeps running costs flat as the data grows.

Reach out via the custom software page with a description of how your sources change and how you currently keep up. If the answer involves reloading everything, there is usually a cheaper, more correct path.

Engineering certifications

Sapota engineers hold credentials on Data Engineering. Each badge links to the individual engineer's credly profile.

Browse Data Engineering certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project