SapotaCorp

Data quality gates between Bronze and Silver: where bad data should die

When you ingest financial data from dozens of sources, the spec and the actual files disagree more often than they agree. One source's "balance" is the total, another's is only what is current; one reports aging as a percentage, another as an amount. Let that through and you turn a healthy borrower into a defaulter on paper. The fix is a quality gate that kills bad data before Silver.

Data quality gates between Bronze and Silver: where bad data should die

Key takeaways

  • The specification and the real files almost never fully agree. Field definitions drift between the spec document and the sample data, so a pipeline that trusts the spec will silently mishandle sources that do not follow it.
  • Ambiguous financial fields are a trap. A "balance" that is the total in one source and only the current portion in another, or an aging bucket reported as a percentage in the spec but an amount in the file, will inflate or deflate a customer's debt if you do not pin down which is which.
  • The right place to stop bad data is the Bronze-to-Silver boundary. A quality gate there blocks records that violate the debt logic before they reach the conformed layer, so errors never get the chance to cascade into Gold and the scoring that sits on top of it.
  • The conflicts that a gate cannot resolve automatically are business decisions, not engineering ones. Packaging them as explicit scenarios and locking the rule with the client before coding is what keeps the gate honest rather than guessing.

A credit bureau we worked with had to take in financial data from a long list of non-bank lenders and turn it into a single coherent view of each borrower. The hard part was not the volume. It was that no two sources meant the same thing by the same words, and the documents describing the data disagreed with the data itself. When that goes unchecked in a credit platform, the consequence is not a cosmetic glitch. It is a borrower who pays on time being reported as a serious defaulter, or a genuine defaulter looking clean, because somewhere upstream a field meant something other than what the pipeline assumed.

The instinct on a project like this is to fix each discrepancy as you find it, source by source, deep in the transformation code. That is how you end up with a pipeline nobody can reason about and bad data already sitting in the Gold layer by the time anyone notices. The better answer is to decide, deliberately, that bad data has a place where it is supposed to die, and to build the gate that kills it there.

The spec and the sample never quite agree

The first reality to make peace with is that the specification document and the actual sample files will not match, and the gap between them is where the bugs live. The spec is what the source said it would send. The sample is what it actually sends, and the two drift, because the spec was written once and the data is produced by a system that has its own history and its own shortcuts.

On this project the drift was not subtle once you looked for it, but it was completely invisible if you trusted the spec. The pipeline had been built to the documents, and the documents described a tidiness the files did not have. So the starting principle we applied was simple and a little uncomfortable: trust the data over the document, and treat every field whose meaning is asserted by the spec but not confirmed by the sample as unverified until proven. A pipeline built on the spec alone is built on a description of the data, not the data.

Ambiguous fields are how debt gets doubled

Two specific ambiguities show exactly how this turns into real harm.

The first was the balance field. Across sources, the statement balance was sometimes the total amount the borrower owed and sometimes only the portion that was current, the part not yet overdue. Those are very different numbers, and nothing in the field name told you which one you were looking at. Read a current balance as a total, or a total as a current balance, and you have just misstated what the person owes, in a system whose entire purpose is to state accurately what people owe. There is no rounding error here; it is a wrong number presented with full confidence.

The second was aging. Debt gets bucketed by how overdue it is, the one-to-thirty-days bucket, the thirty-one-to-sixty bucket, and so on, and the specification asked for these as percentages while the actual files returned amounts. A percentage and an amount are not interchangeable, and a pipeline that treats one as the other does not just produce a slightly-off figure. It can move a borrower between aging categories entirely, turning what should be a current, healthy account into a deeply delinquent one, or the reverse. In a credit bureau, that misclassification is the product failing at its core job.

The lesson is that ambiguous financial fields are not a data-cleaning nuisance to be smoothed over. They are the precise mechanism by which a pipeline reports a falsehood about a real person's finances, and they have to be resolved explicitly rather than assumed.

Build the gate at Bronze-to-Silver

The structural decision that makes all of this manageable is choosing where bad data dies, and the answer is the boundary between Bronze and Silver.

Bronze is the raw landing zone; you take what the source gives you, faithfully and unjudged. Silver is the conformed layer where data is supposed to be clean and trustworthy. The transition between them is therefore the natural checkpoint, the last moment before data is treated as good, and it is where a strict data-quality gate belongs. That gate's job is to validate records against the debt logic, that a balance is the kind of balance you expect, that an aging value is in the unit you require, that the numbers are internally consistent, and to block the ones that fail before they are ever promoted to Silver.

The reason the boundary matters so much is what sits downstream. If bad data slips into Silver and then Gold, it does not stay contained; it flows straight into the credit scoring that the bureau's customers rely on, and a wrong balance or a misclassified aging bucket becomes a wrong credit decision. The gate at Bronze-to-Silver is what keeps a single malformed source from cascading into the scoring layer. Catching it there is cheap and contained. Catching it after Gold means unwinding decisions that have already been made on bad numbers.

The conflicts a gate cannot decide on its own

A quality gate can enforce a rule, but it cannot invent the rule, and the most important conflicts on this project were not ones engineering could resolve alone. Whether a given source's balance should be read as total or current, or how to reconcile a spec that wants percentages with files that send amounts, is a business definition, not a coding choice. Guess at it and the gate is just enforcing a guess.

So the discrepancies that could not be resolved mechanically were packaged up as explicit scenarios, each one a concrete description of the conflict and the options for handling it, and taken to the client's project owners to decide before anything went into the core logic. Where a rule could be made deterministic, we made it deterministic: the current balance, for instance, was derived by a fixed formula, taking the total owed and subtracting the overdue portion, so that every record was reduced to the same well-defined figure rather than relying on whichever interpretation a given source happened to use. The split is the important part. The gate automates the rules that can be made objective, and the rules that require a business decision are decided by the business, on the record, before they are coded.

What the gate leaves you with

The bureau ended up with a quality gate at the Bronze-to-Silver boundary that blocked debt-logic violations before they reached the conformed layer, a fixed derivation for the figures that could be made deterministic, and a documented set of agreed rulings for the ambiguities that needed a human decision. The payoff was not only fewer wrong numbers. It was that when a new source misbehaved, it failed loudly at the gate instead of quietly corrupting a credit score three layers downstream.

The broader point generalises well beyond credit data. In any pipeline that aggregates from many sources, the sources will disagree, the spec will lie a little, and the only question is whether the disagreement is caught at a gate you designed or discovered in an output someone trusted. Decide where bad data dies, and make sure it dies before the layer everything else is built on.

If your pipeline trusts data it should be checking

The pattern of "the numbers are subtly wrong and we cannot tell why" is almost always a missing or misplaced quality gate. The data is not lying randomly; it is disagreeing with an assumption baked into the pipeline, and the fix is to make that assumption an explicit, enforced check at the right boundary rather than a hope.

Sapota's data team builds these validation layers into the medallion pipelines we ship, and the same discipline underpins the wider medallion-on-AWS platform we built for a regulated fintech. The gate is cheap to add early and expensive to retrofit once bad data has reached Gold.

Reach out via the custom software page with a description of the sources you are aggregating and where the numbers stop making sense. The disagreement is usually findable, and usually fixable at a boundary you control.

Engineering certifications

Sapota engineers hold credentials on Data Engineering. Each badge links to the individual engineer's credly profile.

Browse Data Engineering certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project