SapotaCorp

Identity deduplication in a medallion lakehouse: the silent nulls nobody catches

Moving customer deduplication off an on-prem monolith and into a Bronze/Silver/Gold lakehouse sounds like a straight port. It is not. There is no direct JOIN to lean on, the old "keep the oldest ID" cleanup quietly nulls out half your accounts, and running the dedup in memory deadlocks on real data. Here is what we learned rebuilding it for a credit bureau.

Identity deduplication in a medallion lakehouse: the silent nulls nobody catches

Key takeaways

  • On a medallion lakehouse there is no direct JOIN to dedup against the way an on-prem RDBMS gives you. Identity has to be resolved as data flows through Bronze, Silver, and Gold, so the dedup logic lives in the pipeline, not in a query you run once.
  • The classic "keep the oldest customer ID as the surviving record" cleanup is dangerous on migrated data. It can wipe the newest rows from the mapping table and leave the client key null across the account table, and because nothing errors, the damage is silent.
  • Identity for a real person is a composite, not a single field. Resolving on a combination of identifiers (old national ID, new national ID, passport) before merging is what stops one person from splitting into two customers, or two people from collapsing into one.
  • Running deduplication in memory over a full dataset deadlocks and stalls. Splitting the work by file type and pushing the heavy relationship fixes into the database as controlled procedures keeps it from taking the whole pipeline down.

A credit bureau came to us partway through moving its reporting platform off an aging on-prem monolith and onto an AWS lakehouse, and the part that had stalled them was the one that sounds the most boring: deduplicating customers. On paper it is a solved problem. A person is a person; collapse their scattered records into one identity and move on. In practice it was the thing quietly corrupting credit histories, and the reason was that nobody had accepted how different the problem becomes once you leave a traditional database behind.

In the old system, customer information arrived spread across many raw files, attached to whatever identity document happened to be on hand at the time, and the monolith stitched it together with the tool every on-prem engineer reaches for first, a JOIN. On the new platform the data had to flow through a Bronze, Silver, and Gold medallion architecture, and somewhere in that flow every individual had to be resolved down to a single identity. The team had ported the logic faithfully, and it was producing duplicate-value errors on combinations of the new national ID, the old national ID, and the passport number, over and over. The logic was not wrong in the way they were looking for. It was wrong in that it assumed a database that was no longer there.

There is no JOIN to save you

The first thing to internalise is that a medallion lakehouse does not hand you the direct, indexed JOIN an on-prem RDBMS does. You are moving data through layers, Bronze for raw, Silver for cleaned and conformed, Gold for the business-ready tables, and identity has to be resolved as part of that movement rather than as a single query you run against one big table at the end. The dedup is a stage in a pipeline, not a statement you execute.

This matters because the on-prem instinct is to treat deduplication as a lookup-and-merge you do once against the whole set. Lift that instinct into the lakehouse and you get exactly the symptoms this team had: duplicate keys surfacing repeatedly, because the logic keeps trying to resolve identity at the wrong layer and against data that has not been conformed yet. The fix starts with accepting that identity resolution is something the pipeline does on the way through, with the rules applied at the Silver stage where the data has been cleaned enough to compare, not bolted on at the end as if the lake were a relational database wearing a different hat.

The cleanup that deletes your data without telling you

The most damaging thing we found was not the duplicate errors everyone could see. It was a silent one nobody could.

The legacy cleanup rule was the kind that reads as obviously sensible: when you find duplicates of a person, keep the oldest customer ID as the surviving record and retire the rest. On the original system, where the data was tidy and relationships were enforced, that was fine. On the migrated data it was quietly catastrophic. Choosing the oldest ID as the survivor wiped the newest rows out of the mapping table that linked clients to their dimension keys, and the downstream effect was that the client identifier came back null across great swathes of the account table. No exception was thrown. No job failed. The pipeline reported success while leaving accounts with no customer attached to them, which in a credit reporting context means an account whose history belongs to nobody.

This is the failure mode that should keep data engineers up at night, because it is invisible by construction. A loud error gets fixed. A silent one ships, propagates into the Gold layer, and is discovered weeks later when someone asks why a chunk of accounts have no owner. The lesson we took from it is that any cleanup rule which deletes or supersedes records has to be treated as high-risk and validated by what it leaves behind, not by whether it runs without complaint. "It completed" is not "it was correct," and on migrated data the gap between those two is where the damage lives.

A person is a composite key, not a field

The duplicate errors themselves came from treating identity as if a single field could carry it. Real people do not work that way, especially in a market where identity documents have changed over time. The same individual showed up under an old national ID in some files, a new national ID in others, and a passport number in others again, and any rule keyed on one of those alone either split one person into two customers or, worse, merged two different people into one.

Both of those are serious. Splitting a person fragments their credit history so that neither half tells the truth. Merging two people writes one person's debts onto another. The only way through is to stop pretending identity is atomic and resolve it on the combination, looking up against old national ID, new national ID, and passport together before deciding that two records are the same human being. We rewrote the soft-delete cleanup so that the merge decision was made on that three-part composite, performed as a deliberate lookup before any record was retired, rather than on a single field chosen for convenience. The duplicate-value errors that had been recurring for weeks came from the shortcut of a single key; the composite is what actually matches a person to themselves.

Do the heavy lifting in the right place, and not all at once

The last problem was operational rather than logical. Running the deduplication across the full dataset in memory, the obvious way, deadlocked and stalled the pipeline, because the relationship-fixing work over a large dataset is exactly the kind of thing that does not fit in memory and does not parallelise cleanly when everything contends for the same rows.

We solved it on two fronts. First, the heavy relationship corrections, reconfiguring how accounts linked back to clients, were pushed into controlled stored procedures running directly in the managed Postgres database, where that kind of set-based relational fix-up belongs and where it can be done transactionally and observably rather than held in a Spark job's memory. Second, we stopped treating all the data as one undifferentiated pile and split the processing by file type, separating the ordinary settlement files from the bad-debt files into their own flows. That separation was not cosmetic; it cut the volume any single dedup pass had to hold at once and let each flow be tuned for its own shape, which is what stopped the deadlocks. The general principle is that deduplication at scale is not one big operation you run heroically over everything; it is several scoped operations placed where the work actually fits.

What the rebuild left behind

By the end, the bureau had a deduplication stage that resolved identity as part of the medallion flow rather than as a port of an on-prem JOIN, a soft-delete cleanup that merged on the three-part identity composite and was validated by the keys it preserved rather than by whether it ran, and a processing model that pushed relational fix-ups into the database and split the load by file type so it stopped deadlocking. Most importantly, the silent-null failure was gone, not because we found every instance by hand, but because the cleanup logic could no longer produce it.

If there is one thing to carry out of this, it is that migrating deduplication from an on-prem database to a lakehouse is not a port, it is a redesign, and the most dangerous bugs in it are the ones that complete successfully. The visible duplicate errors are annoying but honest. The silent nulls are the ones that quietly corrupt the thing the whole platform exists to get right.

If you are migrating a data platform off a monolith

The dedup story is a specific instance of a general truth we keep meeting: the logic that was correct on an on-prem monolith is often quietly wrong on a lakehouse, because it assumed guarantees the new architecture does not give you for free. The fix is rarely more compute. It is rethinking the logic for a layered, set-based world where the most expensive mistakes do not raise an error.

Sapota's data team has rebuilt this kind of pipeline across credit, fintech, and commerce workloads, and the medallion patterns carry over even though the domain specifics differ. We have written separately about the full medallion-on-AWS build for a regulated fintech, which sits alongside this as the architecture these dedup lessons live inside.

Reach out via the custom software page with a description of the platform you are modernising and where the data is fighting you. The silent failures are usually the ones worth finding first.

Engineering certifications

Sapota engineers hold credentials on Data Engineering. Each badge links to the individual engineer's credly profile.

Browse Data Engineering certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project