A credit bureau we worked with had built its ingestion the way most platforms start: one fixed file template, loaded in batches, parsed by code written specifically for that shape. It worked right up until the second partner. Every new lender that integrated sent files with a different number of columns, in a different order, in a different format, and the team's response had been the only one the architecture allowed, which was to write more parsing code for each one. The source kept growing, the system kept getting harder to change, and it was heading toward a load measured in millions of API calls that the original design was never going to survive. The brief, when they reached us, was effectively "make ingestion stop being a code change every time."
There are three distinct problems tangled together here, and they are worth separating, because each has its own fix and the team had been treating them as one big mess. There is the schema problem, that every partner's file is shaped differently. There is the scale problem, that the volume had outgrown the original compute. And there is the operability problem, that when something broke, nobody could tell what. Pull them apart and each becomes tractable.
Stop encoding the schema in the code
The root cause of the bloat was that the structure of each partner's file lived in the source code. A new format meant a new parser, which meant a deploy, which meant the codebase grew linearly with the number of partners and every onboarding was an engineering task. That is the pattern to break first, because everything else is harder while it persists.
The fix is to move the schema out of the code and into configuration. We built a dynamic template engine, an ingestion flow on top of the existing Glue and Airflow tooling, where the column structure of each partner's file is described in a mapping configuration rather than hard-coded. Onboarding a new partner becomes editing a mapping rule, defining which incoming column means what, through configuration rather than touching the parsing source at all. The engine reads the mapping and adapts; the code stays the same. This is the same principle as preferring configuration over customization anywhere else: the thing that varies per partner belongs in data you can change safely, not in code you have to redeploy and retest every time a lender joins.
The immediate payoff is that the codebase stops growing with the partner list. The deeper payoff is that onboarding moves from the engineering team's backlog to a configuration task, which is where the bottleneck disappears.
Validate structure, because drift is silent
Making ingestion config-driven introduces a risk you have to handle deliberately, which is that a flexible loader can be too trusting. If a partner shifts a column, adds one, or reorders their file, a loader that simply reads positions will happily map data into the wrong field, and like most ingestion failures it does this silently. The values are all present and all plausible; they are just in the wrong places, and nothing throws.
So the mapping configuration is not only a convenience for onboarding, it is the thing you validate against. The engine checks the incoming file's structure against the expected mapping before it trusts the contents, so a column that has drifted out of place is caught as a structural mismatch rather than loaded into the wrong field. The general rule, the same one that applies to deduplication and to financial fields, is that the dangerous ingestion errors are the ones that do not raise an exception, and the only defence is to assert what you expect rather than assume the file complies.
Make the failures readable, or your operators pay for it
The third problem was the one that quietly burned the most hours. The file processing ran in the background, on scheduled jobs orchestrated through the pipeline, and when a file failed the old system returned a generic quality-check index, a code that told the operations team something had gone wrong without telling them what, where, or in which row or column. So every failure became a manual investigation, an engineer digging through data by hand to find the one malformed line, while the daily credit report waited.
The fix was not more logging, it was better-aimed logging. We turned the opaque index codes into descriptions that actually named the problem, the specific row and column that violated the rule, so the message itself told the operator where to look. Then we connected the pipeline's logs to real-time monitoring through Grafana and CloudWatch, and wired up email alerting through AWS SES so that when a file failed, the operations team received a message naming the offending row and column immediately, rather than discovering the failure later and starting an investigation. The difference between "DQ index 47" and "row 1,204, column statement_balance, expected amount got percentage" is the difference between an afternoon of debugging and a two-minute fix, and at the volume this platform ran, that difference compounded every single day.
Scale is an architecture decision, not a slider
Underneath all of this was a load the original design could not carry. The platform was heading toward something on the order of eight million API calls, over files large enough that pure serverless functions would time out or send the infrastructure bill through the roof under that kind of concurrency. Serverless is the right default for spiky, modest workloads, and the wrong default for sustained heavy processing of large files at high volume.
So part of the work was honestly re-architecting the compute rather than tuning it, moving the heavy processing off pure serverless functions and onto container and instance-based compute that could handle the throughput smoothly and predictably. This is the unglamorous half of scaling: recognising when you have left the envelope a given service is good at and moving to one built for the load, instead of pushing the original choice past where it works and paying for it in timeouts and cost spikes. The lesson is that real scale usually forces an architecture change, not just a configuration change, and the sooner you accept that the cheaper the transition is.
What the rebuild left behind
The bureau came out of it with ingestion that onboarded a new partner format by editing a mapping rule rather than shipping code, structural validation that caught column drift instead of silently misloading it, failure alerts that named the exact row and column and reached the operations team in real time, and compute sized for the actual load rather than the load it started with. Onboarding stopped being an engineering project, debugging stopped being an archaeology dig, and the platform stopped threatening to fall over at volume.
The thread running through all three fixes is the same one that runs through most data engineering at scale: push the things that vary into configuration, validate what you assume instead of trusting it, and make your failures tell you where they happened. None of those are exotic. They are just the difference between a pipeline that grows with you and one that you fight a little harder every time a new partner shows up.
If your ingestion grows with every new source
The signature of this problem is a codebase that gets bigger every time you onboard a partner, and an operations team that dreads file failures because the errors do not say anything. Both are fixable, and the fix is architectural: configuration-driven schemas, structural validation, readable alerting, and compute matched to the real load.
Sapota's data team builds ingestion this way as a matter of course, and it is part of the same reusable framework behind the medallion-on-AWS platform we delivered for a regulated fintech. Getting the ingestion layer right early is what lets the platform absorb new sources without absorbing new pain.
Reach out via the custom software page with a description of how your sources arrive and where onboarding or failures are slowing you down. The fix is usually moving the variation out of the code.








