A Dataverse row changes. Five downstream systems need to know: an ERP that tracks financials, a search service that indexes the record, a Power BI dataset that feeds executive dashboards, a notification queue that messages field reps on mobile, and a data lake for analytics retention.
The junior version of this is a single async plugin that calls all five from the same post-operation step. It works in development. In production, it fails the first time any one of the five has a bad afternoon - the plugin step errors, retries ten times, fills the System Jobs queue, and operators get paged.
The pattern we ship at enterprise scale is different: the plugin's only job is to publish a message to Azure Service Bus. Five independent consumers pull from that message topic and handle their respective destinations. Each consumer retries, dead-letters, and monitors independently. The plugin never calls out to anything but the queue.
Here is the architecture, the code, the failure semantics, and the six months of real-production experience that shaped it.
The architecture
A Topic (not a Queue) because the same event feeds multiple subscribers. Subscriptions filter so each consumer only sees events it cares about. Dead-letter per subscription, not global - the ERP and search services can fail independently without cross-contaminating.
Stage 1: the plugin
The plugin is small on purpose. Every line is a potential failure point; the less it does, the more reliable it is.
Three key decisions:
- Payload is metadata, not the full row. The message includes the entity name, ID, and which attributes changed. Consumers fetch the full row if they need it. Messages stay small (sub-1KB), which Service Bus charges less for and which avoids the edge case of payloads exceeding message size limits.
- CorrelationId from the plugin context. Every log entry and every downstream message carries the same correlation ID, making cross-system tracing possible with one query.
- Subject field for filtering. Subscribers filter on Subject LIKE 'account.%' or Subject = 'order.Update'. Subject-based filtering is cheap server-side; body-based filtering is more expensive.
Stage 2: Service Bus Topic configuration
Topic-level settings:
- Max message size: 1MB (default). Our payloads are ~500 bytes, nowhere near the limit.
- Message time-to-live: 7 days. Long enough to survive a weekend outage, short enough that stale messages don't haunt the system.
- Duplicate detection: enabled with 10-minute window. If the same MessageId arrives twice within 10 minutes, the second is dropped. This guards against plugin retries causing duplicate downstream processing.
Subscription-level settings per subscriber:
- Filter rule: SQL-like expression on message properties (Subject, custom headers).
- Max delivery count: 5 (same reasoning as in the simpler Service Bus pattern).
- Lock duration: tuned to each consumer's processing time. The ERP consumer, which makes a remote call taking up to 30 seconds, has a 60-second lock. The search consumer, which is purely in-Azure and takes 1-2 seconds, has a 30-second lock.
Stage 3: the consumers
Each subscriber is an Azure Function with a Service Bus Topic subscription trigger. They share a common frame but implement different downstream logic.
Common frame:
The idempotency store is either Cosmos DB with TTL or Redis. Every successfully processed MessageId is recorded. The TTL matches the message time-to-live on the Topic plus a safety margin.
Core logic per consumer is where the differences live:
- ERP consumer: fetches the full Dataverse row via Web API (using a managed identity), maps to the ERP's schema, calls the ERP API with an idempotency key derived from MessageId.
- Search consumer: fetches the row, calls the search service's indexing API.
- Power BI consumer: triggers a dataset partition refresh; uses the correlation ID to tag the refresh operation.
- Notification consumer: looks up which users to notify based on the row's owner and policy, sends via the notification service.
- Data lake consumer: appends a compressed JSON record to partitioned ADLS Gen2 path by date.
Observability
The chain is not useful in production without observability. Our setup:
- Application Insights per Function App, with shared workspace so queries span functions.
- Every log entry includes CorrelationId and MessageId via the BeginScope pattern above.
- Custom metrics: ProcessedMessages, FailedMessages, DeadLetteredMessages, ProcessingDuration.
- Dashboards (Azure Monitor workbooks):End-to-end latency per event type (from Dataverse change timestamp to final destination write).Dead-letter queue depth per consumer.Error rate per consumer, per hour.
- Alerts: DLQ depth > 10 for any consumer, processing duration p95 > SLA for any consumer.
The end-to-end trace for a single event: a KQL query across Application Insights joining by CorrelationId. One row change in Dataverse, traced through plugin → Service Bus publish → each consumer → each downstream call. When something fails, we know exactly where in the chain.
Six months in production: the numbers
A client with:
- 50,000 Dataverse events per day
- 5 consumers averaging 3 seconds processing each
- Bursts to 500 events/minute during peak hours
Current state:
- P50 end-to-end latency (event → all consumers complete): 4.2 seconds
- P95 end-to-end latency: 11 seconds
- Dead-letter rate: 0.003% (one in 33,000 messages)
- Consumer error rate: 0.1-0.2% (mostly transient, auto-retried)
- Monthly Azure cost (Service Bus + Function Apps + Application Insights): ~$320
What we tuned after launch
Ratio of Function App instances to queue depth. Consumption plan auto-scales; the defaults were conservative. We adjusted maxConcurrentCalls per function to keep consumer processing near its per-instance capacity before spinning up more instances.
Filter rules on subscriptions. Initially each consumer received every event and filtered in code. Moving the filter to the subscription level (server-side) cut consumer costs by ~60% because consumers no longer woke up for irrelevant events.
Deduplication window. Initial 1-minute window was too short; plugin retries occasionally pushed duplicates more than a minute apart. Extended to 10 minutes after measuring actual retry intervals.
When not to ship this
This pattern is overkill for:
- Projects with one or two downstream consumers - direct Power Automate flow is simpler.
- Low-volume scenarios (< 1000 events/day) - the operational complexity outweighs the throughput benefit.
- Teams without Azure experience - maintaining Service Bus, Function Apps, and Application Insights is real ops burden. Pick a simpler pattern until the operational muscle exists.
For the projects where it fits (enterprise-scale, multi-consumer, event-critical), this is the architecture we reach for first. It survives load, it surfaces failures cleanly, and when something goes wrong, the fix is almost always in a known location with clear visibility.