Building a Medallion Lakehouse on AWS Glue and Iceberg (No Databricks)

We built a production medallion lakehouse for a regulated credit bureau using only AWS native services: Glue, Iceberg, MWAA, RDS, and Athena. Here is the full end to end stack, why each layer exists, and the honest tradeoffs of skipping Databricks.

Published Jun 28, 2026

Building a Medallion Lakehouse on AWS Glue and Iceberg (No Databricks)

Key takeaways

A medallion lakehouse on AWS does not need Databricks: Glue Spark, Iceberg tables, MWAA, RDS, and Athena cover ingestion, transformation, orchestration, and serving on their own.
Apache Iceberg is the load bearing choice because it gives you schema evolution, ACID writes, and snapshot time travel, which together make SCD Type 4 auditing and safe reprocessing practical.
Each layer has one job: Bronze standardizes raw files into Iceberg, Silver cleanses and runs quality gates, and Gold models a dimensional star schema for consumption.
RDS PostgreSQL serves curated Gold data to reporting tools while Athena handles serverless ad hoc queries directly on S3, so you do not pay warehouse compute for exploration.
Going AWS native trades the polished Databricks developer experience for lower cost, no extra vendor, and full control, at the price of wiring observability and the catalog together yourself.

When a regulated credit bureau asked us to modernize its credit-reporting platform, the obvious default was Databricks on top of S3. We went a different way. The whole thing runs as an aws medallion architecture built only from AWS native services: AWS Glue for Spark ETL, Apache Iceberg for table storage, Amazon MWAA for orchestration, Amazon RDS PostgreSQL as the serving warehouse, and Amazon Athena for serverless query. No third party lakehouse engine, no extra vendor contract, no per DBU billing. This post walks the real stack end to end, explains what each piece actually does, and is honest about what you give up by not buying Databricks.

The shape of the system

Data enters through AWS Transfer Family, a managed SFTP endpoint. External contributors drop files (CSV, TXT, XLSX) into a folder structure keyed by contributor, and Transfer Family writes them straight into an Amazon S3 landing zone. From there a chain of Glue Spark jobs promotes the data through the medallion layers, MWAA decides when each job runs and what happens on failure, and the finished dimensional model lands in RDS PostgreSQL for the reporting engine. Athena sits beside all of it for direct SQL against S3, and CloudWatch plus Grafana Cloud watch the whole pipeline.

In plain text the flow looks like this:

SFTP (Transfer Family) -> S3 landing -> Glue (Bronze, Iceberg)
   -> Glue (Silver, cleanse + Glue DQ) -> Glue (Gold, star schema)
   -> Glue (Gold to RDS PostgreSQL)        -> reporting tools
S3 (all Iceberg layers) -> Athena (serverless ad hoc SQL)
MWAA orchestrates every Glue job; CloudWatch + Grafana observe everything.

The landing zone is deliberately dumb. Files sit there as an exact byte for byte copy of what the contributor sent, with no transformation at all. That immutable copy is your audit anchor and your replay button: when a downstream job has a bug, you reprocess from landing rather than asking contributors to resend. A head object check stops the same filename from being ingested twice in one window, and processed files move into dated archive, rejected, and duplicate prefixes so every file has a traceable end state.

Why Iceberg carries the whole design

Every persistent table from Bronze onward is an Apache Iceberg table, and that choice is what makes the rest of the architecture work rather than being an afterthought. Three properties matter here.

Iceberg gives you schema evolution without rewriting data. Credit contributors do not agree on a fixed file format, and new columns appear over time, so Bronze writes with write.spark.accept-any-schema enabled and absorbs the drift instead of breaking. It gives you ACID writes, so a Glue job that fails halfway does not leave a half written table that the next reader trips over: a commit is all or nothing. And it gives you snapshot based time travel, meaning every write produces a new immutable snapshot of the table while old snapshots remain queryable.

That last property is not a nice to have for a credit bureau, it is the audit engine. The Gold SCD pipeline reads Iceberg's changelog view across a range of snapshots, computes what changed between versions, and writes full row history and field level audit tables off the back of it. Without snapshot semantics in the storage layer you would be hand rolling change data capture; with Iceberg it falls out of the table format itself. We go deep on that in SCD Type 4 with Iceberg snapshots.

Bronze: standardize, do not clean

The Bronze layer is the first stage of persistent, query optimized storage. Glue Spark reads the raw landing files (using the Spark Excel library and positional mapping for the messier spreadsheet contributors), converts them into Iceberg tables, and enriches each record with system metadata: a UUID, an ingestion timestamp, and the source file name. Tables are partitioned by ingestion_date so date ranged queries only scan what they need.

The discipline here is that Bronze does format standardization and almost nothing else. There is light cleansing, stripping zero width "ghost" characters and normalizing obvious date and number formatting, but no business validation, no deduplication of records, no conforming to a master schema. Bronze is meant to be a raw but structured historical record of everything that ever arrived. If you start enforcing business rules at this layer you lose the ability to see what the source truly sent, which for a regulated dataset is exactly the thing you need to keep.

Silver: cleanse, conform, and gate on quality

The Silver layer is where data earns the right to be trusted. A Glue job reads Bronze incrementally (only records where ingestion_date is past the last successful checkpoint, which keeps compute cost down), then runs a transformation framework that standardizes nulls, casts decimals and integers, parses multiple date formats, normalizes phone numbers and emails, and patches missing fields. Patching is layered: forward patching pulls an account's last known value from the existing Silver table, internal patching uses Spark window functions to fill gaps inside the current batch, and conditional logic infers values from operation and status codes.

The part that makes Silver a real gate is AWS Glue Data Quality. After transformation but before anything is written, the batch runs through a metadata driven DQ engine. Rules live in a control CSV, get compiled into Glue's DQDL at runtime, and are split by contributor type, since buy now pay later data and traditional credit data have different shapes. Rules are either Critical or Warning. A Critical failure hard rejects the row, which is routed to a dq/error prefix for the operations team and never reaches Silver. A Warning is ingested but logged. The interesting rules are temporal: by unioning the incoming batch with the latest Silver partition and applying a LAG window over statement date, the engine can check month over month behavior, for example that a credit limit did not silently change or that arrears did not jump more than two months in a cycle. That hybrid baseline approach gets its own writeup in baseline data quality.

Gold: the dimensional star schema

The Gold layer turns the clean, normalized Silver tables into a star schema built for reading. It is a set of dimensions (client, sponsor, account, contributor, and an SSM company lookup), a bridge table linking sponsors to accounts many to many, and a fact_statement table holding monthly balances, arrears, and transaction counts. Each entity follows the same extract, transform, load, process pattern, assigns integer surrogate keys in place of natural keys, and merges into the permanent Gold table with an Iceberg upsert.

Two pieces of Gold deserve a callout. The first is master data management. The same person shows up across contributors with different identifiers, an old IC here, a passport there, so the job uses a graph style clustering pass to group related records into a single golden record, then applies survivorship rules to pick a surviving row and backfill its gaps. We cover that in MDM golden records. The second is the SCD Type 4 audit described earlier, which runs as a post processor over Gold's Iceberg snapshots to maintain history and field audit tables separately from the current state.

Serving: RDS for reports, Athena for exploration

Gold lives in Iceberg on S3, but the reporting engine wants a relational database, so a final Glue job loads Gold into Amazon RDS PostgreSQL. It is incremental on an updated_at watermark, enforces a 24 month retention window so RDS stays small and fast, and runs CREATE INDEX and ANALYZE over JDBC after each load so query plans stay healthy. RDS is the warehouse that dashboards and the report engine hit directly.

Amazon Athena serves a different need. It runs serverless SQL straight against the Iceberg tables in S3, with no data movement and no warehouse to keep running. Analysts and engineers use it for ad hoc questions, and the cost model is pay per query on bytes scanned, which the Parquet columnar layout and partition pruning keep low. The split is deliberate: RDS for predictable, indexed, repeated reporting workloads, Athena for cheap exploration where you do not want to pay for standing compute. Older S3 data tiers down to S3 Glacier automatically through lifecycle policies, so infrequently accessed history costs a fraction of standard storage while staying durable and retrievable.

Orchestration and observability

Amazon MWAA, which is managed Apache Airflow, is the conductor. It schedules the Glue jobs, enforces the dependency order (you cannot build the account dimension before the client dimension exists, and Gold cannot run before Silver), and handles retries with exponential backoff for transient failures. Every job writes to its own audit table (file, Silver, Gold, and RDS ingestion histories), and checkpoints only advance after a successful load, so a failed run safely re runs from where it stopped rather than reprocessing or skipping data. The MWAA setup is its own topic in orchestrating it with MWAA.

Observability is two layered. CloudWatch collects the raw metrics and logs (job duration, records processed, DPU usage, failure counts, S3 errors) and fires alarms into SNS, which fans out to PagerDuty and Slack for critical alerts and email for warnings. Grafana Cloud sits on top as the single pane: unified dashboards for MWAA and Glue execution metrics, Glue DQ pass and fail scores, and RDS and Athena utilization, so an operator can see pipeline health and data health in one place instead of clicking through console pages.

The Databricks tradeoff, honestly

Choosing AWS native over Databricks is a real tradeoff, not a free win. What you give up is the developer experience: Databricks notebooks, its managed Spark with autoscaling and Photon, Unity Catalog governance, and a single integrated platform where most of this wiring is already done. With Glue you assemble those pieces yourself. The Glue Data Catalog is your metastore, you build the DQ framework as a Glue job rather than using a packaged feature, and stitching CloudWatch to Grafana is work you own.

What you get back is meaningful for this kind of client. There is no additional vendor and no DBU markup on top of the EC2 and S3 you already pay AWS for, which matters under a fixed budget. Everything stays inside the AWS account boundary, IAM, and VPC, which is easier to defend to a regulator than a second platform holding the data. And because Iceberg is an open table format, the data is not locked to any engine: Athena, Spark, or anything else that speaks Iceberg can read it tomorrow. For a regulated bureau that values cost control and a single security perimeter over notebook ergonomics, native was the right call. For a team that lives in notebooks and wants autoscaling Spark out of the box, Databricks would still be the faster path.

Takeaway

A medallion lakehouse does not require a dedicated lakehouse vendor. Glue Spark, Apache Iceberg, MWAA, RDS, and Athena cover the full path from SFTP ingestion to a served star schema, and Iceberg's schema evolution, ACID writes, and snapshot time travel are what make the harder parts (quality gating, MDM, and SCD Type 4 auditing) tractable. The cost is that you wire the platform together yourself instead of buying it assembled. For the right requirements, lower cost, one vendor, one security boundary, and open data, that is a trade worth making.

If you are weighing an AWS native lakehouse against Databricks, or you already have one and want it to behave under audit, we have built this end to end. See what we do at /service and get in touch at /contact.

Daniel Duong

Salesforce + AI Engineer

Share Your Story

We build trust by delivering what we promise – the first time and every time!

We'd love to hear your vision. Our IT experts will reach out to you during business hours to discuss making it happen.

WHY CHOOSE US

"Collaborate, Elevate, Celebrate where Associates - Create Project Excellence"

SapotaCorp beyond the IT industry standard, we are

Certificated
Assured quality
Extra maintenance

Building a Medallion Lakehouse on AWS Glue and Iceberg (No Databricks)

Key takeaways

The shape of the system

Why Iceberg carries the whole design

Bronze: standardize, do not clean

Silver: cleanse, conform, and gate on quality

Gold: the dimensional star schema

Serving: RDS for reports, Athena for exploration

Orchestration and observability

The Databricks tradeoff, honestly

Takeaway

Daniel Duong

Need this on your team?

Share Your Story

Contact Us

Email

WhatsApp

Office

WHY CHOOSE US

Tell us about your project

Contacts

Company

Services

contacts

Building a Medallion Lakehouse on AWS Glue and Iceberg (No Databricks)

Key takeaways

The shape of the system

Why Iceberg carries the whole design

Bronze: standardize, do not clean

Silver: cleanse, conform, and gate on quality

Gold: the dimensional star schema

Serving: RDS for reports, Athena for exploration

Orchestration and observability

The Databricks tradeoff, honestly

Takeaway

Daniel Duong

Need this on your team?

More from Data Engineering

The Delta Lake performance gotchas nobody warns you about

CDC into a lakehouse: Change Data Feed, MERGE, and not reprocessing everything

Delta Lake schema enforcement and evolution: the guardrail and the trap

Ingesting every partner's file format without rewriting code each time

Data quality gates between Bronze and Silver: where bad data should die

Identity dedup in a medallion lakehouse: the nulls no one catches

Share Your Story

Contact Us

Email

WhatsApp

Office

WHY CHOOSE US

Tell us about your project

contacts