SapotaCorp

DataWeave patterns and performance: transforms that scale

A Customer 360 API that joins three banking back-ends looks trivial in a demo and falls over in production. The difference is rarely syntax — it's the handful of DataWeave patterns that decide whether your transform runs in milliseconds or times out at scale. Here is what actually matters when the record count climbs into the millions.

DataWeave patterns and performance: transforms that scale

Key takeaways

  • The single biggest DataWeave performance mistake is filtering one array inside a map over another, which is O(n²); building a groupBy index once turns every lookup into O(1) and is the difference between a 320ms run and a timeout at 10k×50k records.
  • Collapse multiple passes (sum, count, avg, max) into a single reduce with an accumulator object — at production volumes this is a straightforward 75% CPU saving for no loss of readability.
  • Streaming with deferred=true keeps large files off the heap, but it only survives sequential operations like map and filter; orderBy, groupBy, and distinctBy force full materialization and silently break the stream.
  • Treat default, ?. null-safe selectors, and try-wrapped coercion as mandatory on any field sourced from an external system — null and missing fields are the normal case in real back-ends, not the exception.

We were building a Customer 360 API for a financial-services client — the kind of thing that reads great on a slide. One endpoint, one clean JSON payload for the mobile app, stitched together from three back-ends: a core-banking system that owned the customer master, a wealth platform that owned portfolios, and a separate mortgage system that owned loans. Join them on the customer identifier, shape the result, ship it.

In the demo, with a dozen sample records, it was instant. The first time we pointed it at a realistic data volume — tens of thousands of customers against a comparable number of accounts — the flow timed out. Not crashed, not erred, just sat there past 120 seconds and gave up. The DataWeave was correct. It produced the right answer for small inputs every single time. It was also quietly O(n²), and nobody had noticed because correctness and performance are different problems that happen to live in the same .dwl file.

That gap — between a transform that works and a transform that scales — is what this is about. The syntax of DataWeave you can learn in an afternoon. The patterns that keep it fast under load are the ones you only learn by getting burned, so here are the ones that earned their place in our codebase.

The join is where it all goes wrong

The naive way to merge two arrays by key is the way everyone writes it first, because it reads like the problem statement: for each customer, find their accounts.

payload.customers map ((cust) -> {
  cif: cust.cif,
  accounts: payload.accounts filter ($.cif == cust.cif)
})

This is a nested loop. Every customer triggers a full scan of the accounts array. At 10,000 customers and 50,000 accounts that is five hundred million comparisons, and DataWeave will dutifully attempt all of them. The fix is to pay the indexing cost once, up front, and turn the inner scan into a hash lookup.

%dw 2.0
output application/json
var accountByCif = payload.accounts groupBy $.cif
---
payload.customers map ((cust) -> {
  cif: cust.cif,
  accounts: accountByCif[cust.cif] default []
})

groupBy runs once at O(m), and after that accountByCif[cif] is an O(1) selector. The whole pipeline drops from O(n × m) to O(n + m). On our actual data the numbers were not subtle. At 1k × 5k the nested version took 1.8 seconds and the indexed version took 35 milliseconds. At 10k × 50k the nested version timed out past two minutes while the indexed version finished in 320 milliseconds. At 100k × 500k the nested version ran the worker out of memory; the indexed one did it in just over three seconds. Same output, three orders of magnitude apart.

This is the first thing I look for in any DataWeave review now. The moment I see a filter whose predicate references the outer lambda's variable, I know there is a groupBy that should have happened earlier.

Walk the array once, not four times

The second habit that costs you is computing several aggregates as if each one were free. You want a total, a count, an average, and a max, so you write exactly that:

{
  total: sum(payload.items.amount),
  count: sizeOf(payload.items),
  avg: sum(payload.items.amount) / sizeOf(payload.items),
  max: max(payload.items.amount)
}

That walks the array four times — twice for the sum, because avg calls it again, plus the sizeOf and the max. On a short list it is irrelevant. On the nightly batch, where this client processed several million customers a day, every redundant pass multiplies straight into the total runtime. A single transform that wastes 50ms across five million records is roughly seventy hours of wall-clock time you did not need to spend.

The answer is one reduce that carries an accumulator object and gathers everything in a single traversal:

var stat =
  payload.items reduce ((item, acc = {total: 0, count: 0, max: 0}) -> {
    total: acc.total + item.amount,
    count: acc.count + 1,
    max:   if (item.amount > acc.max) item.amount else acc.max
  })
---
{
  total: stat.total,
  count: stat.count,
  avg: if (stat.count == 0) 0 else stat.total / stat.count,
  max: stat.max
}

Note the count == 0 guard on the average. Skip it and an empty list divides by zero, which gives you Infinity, which JSON cannot represent, so it serializes as null and your downstream consumer gets a surprise it will report as a bug three weeks later. The single-pass version is not just faster; it forces you to think about the empty case where the multi-pass version let you ignore it.

Streaming buys you headroom, but it is fragile

When the input stops being an API response and starts being a file — a multi-hundred-megabyte CSV dump, a large SWIFT batch — materializing the whole thing into the heap is how you get an out-of-memory worker. The way out is streaming, and it has to be turned on at both ends. The reader declares it in the MIME type, and the transform declares it in the output directive:

%dw 2.0
output application/json deferred=true
---
payload map ((row) -> {
  cif: row.CIF,
  balance: row.BALANCE as Number
})

deferred=true is the instruction that tells DataWeave not to build the output array in memory but to push records straight to the writer as they flow through. The catch — and this is the part that bites people — is that streaming only survives sequential operations. map, filter, and pluck keep the stream intact. The moment you introduce orderBy, groupBy, distinctBy, or a reduce whose accumulator depends on the whole input, DataWeave has to see every record before it can emit the first one, so it materializes the lot and your streaming is gone — silently, with no warning. If you genuinely need a global sort or group over a file too large to fit in memory, that is a signal to break the work into stages or push the operation down to a database, not to fight DataWeave about it.

Lazy by default, eager when you least expect it

DataWeave is lazy in a lot of places. A var is not computed until something reads it; an expression does not run until the output needs it. That is usually a gift, but it has two sharp edges. The first is that certain operations are eager and will quietly force a full materialization — sizeOf has to count the whole array, and orderBy/groupBy/distinctBy need all the data in hand. So a guard like if (sizeOf(payload.items) > 0) walks the entire array just to ask whether it is non-empty. if (!isEmpty(payload.items)) short-circuits at the first element instead.

The second edge is the mirror image: a var that nothing consumes never runs at all, and a var used in three places may be re-evaluated three times. If an expensive transform feeds several output fields, compute it once into a var that the output actually references, and stop guessing. When a script crosses 100 milliseconds, open the Profiler in the DataWeave Playground and let it tell you where the time goes rather than reasoning about laziness in your head.

The defensive habits that keep production quiet

Performance is most of the story, but the transforms that survive contact with real data also assume the data is hostile. In our integration the core-banking system returned null for phone numbers, the wealth platform omitted the field entirely, and corporate customers came back with a null address that would throw Cannot dereference field 'city' from null the instant you reached into it. Three small operators handle nearly all of this. default supplies a fallback, but only when the value is null or missing — an empty string or a 0 passes straight through, which trips up anyone expecting JavaScript-style falsy behavior. The ?. null-safe selector returns null instead of throwing when the left side is null, and chaining it with default gives you two layers of safety on a deep path like payload.t24Customer?.address?.line1 default "". And any coercion of external input — someString as Number on a CSV column — belongs inside a try from dw::Runtime, because one stray non-numeric value will otherwise take down the whole record.

A few more things earned permanent spots on our pre-merge checklist. Coerce a value once into a var rather than re-coercing it on every reference. Prefer contains "ERROR" over matches /.*ERROR.*/, because a leading .* invites catastrophic backtracking in the underlying Java regex engine. Pull magic thresholds out into named vars or, better, Mule properties so they can be tuned without a code change. And mask PII — identity numbers, phone numbers, account numbers — before anything hits a log, which for a regulated bank is a compliance requirement, not a nicety.

The thread running through all of it is the same: a DataWeave transform that is correct on a handful of records tells you almost nothing about how it behaves on a million. Write the join as an index, walk each array once, keep the stream alive through sequential operations, and treat every external field as if it might be null — and the version that demoes well and the version that runs at three in the morning finally become the same version.


Building or operating MuleSoft integrations? Our Salesforce team designs API-led architectures, builds Mule flows, and runs them in production. Get in touch ->

See our full platform services for the stack we cover.

Engineering certifications

Sapota engineers hold credentials on MuleSoft. Each badge links to the individual engineer's credly profile.

Browse MuleSoft certs

Need this on your team?

Sapota engineers ship the patterns you read here. Two-week paid trial, direct pricing from $1,800/ engineer/month, no agency markup.

Get a quote
Contact Us Now

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

  • Certificated
  • Assured quality
  • Extra maintenance

Tell us about your project