Medallion Architecture: Best Practices for Managing Bronze, Silver, and Gold Levels

Centralized Layer Model

Many organizations leverage Medallion Architecture to organize data logically within a data lakehouse. This approach involves processing incoming data through distinct stages, often called "layers." The most common structure utilizes Bronze, Silver, and Gold tiers, which is why "Medallion Architecture" is used.

While the 3-layer design remains popular, discussions regarding the scope, purpose, and best practices for each layer are ongoing. Additionally, a gap often exists between theoretical concepts and real-world implementation. Here's a breakdown of these data architecture layers with practical considerations:


Data Platform Strategy: A Foundation for Layering

The initial and crucial consideration for your data architecture's layering is your data platform's usage. A centralized and shared platform will likely have a different structure when compared to a federated, multi-platform setup used by various domains.

Layering can also vary depending on platform alignment source-system alignment or consuming-side alignment. Source-system-aligned platforms are generally easier to standardize in terms of layering and structure due to the more predictable data usage on the source side.


Exploring Each Layer: Theory and Practice

Let's delve into each layer, outlining its goals and practical observations from real-world applications.

Landing Zone (Optional):
Purpose: A temporary storage location for data gathered from various sources before it's transferred to the Bronze layer. This layer becomes particularly relevant when extracting data from challenging source systems, such as those of external clients or SaaS vendors.

Implementation Considerations:
The design of a landing zone can vary significantly. It's often a simple Blob storage account, but in some cases, it might be integrated directly with data lake services like a container, bucket, or designated folder for data ingestion. The data housed here can be quite diverse, encompassing file formats like CSV, JSON, XML, Parquet, Delta, and more.

Centralized Layer Model

Bronze Layer: Raw and Un-validated Data

Purpose: A repository for storing data in its original, un-validated state. It houses un-validated data (without requiring predefined schemas). Data arrives through full loads or delta loads.

Key characteristics include:

o    Maintains the raw state of the data source in the structure “as-is”.
o    Data is immutable (read-only).
o    Managed using interval partitioned tables, for example, using a YYYYMMDD or date-time folder structure.
o    Retains the full (unprocessed) history of each dataset in an efficient storage format, for example, Parquet or Delta.
o    For transactional data: This can be appended incrementally and grow over time.
o    Provides the ability to recreate any state of a given data system.
o    Can be any combination of streaming and batch transactions.
o    May include extra metadata, such as schema information, source file names, or recording the time data was processed.


A common query I encounter is, “Which file format is superior? Should I opt for Delta or Parquet?” While Delta offers faster speed, its benefits aren’t as pronounced if your data is already versioned or historised using a folder structure. As such, maintaining a transaction log or applying versioning isn’t crucial. Bronze data is typically new or appended data. Therefore, selecting Parquet is perfectly acceptable. However, you could also choose Delta to remain consistent with other layers.

For transactional data: This can be appended incrementally and grow over time.

   o    Provides the ability to recreate any state of a given data system.

   o    Can be a combination of streaming and batch transactions.

  o  May include additional metadata, such as schema information, source file names, or data processing timestamps.

Format Selection: While Delta offers faster speeds, its benefits are less pronounced if your data is already versioned or historiated using a folder structure. In such cases, maintaining a transaction log or applying versioning isn't essential. Parquet is perfectly acceptable for Bronze data. However, Delta can be chosen for consistency across layers.

Business User Access: It's uncommon for raw data in the Bronze layer to be directly used for business user queries or ad-hoc analyses due to its complexity and lack of inherent structure. Securing this layer with numerous small tables can also be challenging. Therefore, Bronze serves primarily as a staging layer and a source for other layers, accessed mainly by technical accounts.

Beyond Parquet and Delta: Exploring Apache Iceberg

Parquet and Delta Lake are popular data storage options for Medallion Architecture. While these are excellent choices, Apache Iceberg is emerging as a strong contender.

Here's why you might consider it:

o    Scalability: Iceberg excels at handling massive datasets efficiently, making it ideal for large-scale data lakes.
o   ACID Compliance: It offers ACID (Atomicity, Consistency, Isolation, Durability) guarantees, ensuring data integrity in concurrent write operations.
o   Schema Evolution: Iceberg gracefully handles schema changes, allowing you to evolve your data model over time without compromising data integrity.



Silver Layer: Refined and Enriched Data

Purpose: Provides a refined structure over ingested data. It represents a validated, enriched version of the data that can be trusted for downstream operational and analytical workloads.

Key characteristics include:

o    It utilizes data quality rules for data validation and processing.
o    It typically contains only functional data, so technical or irrelevant data from Bronze is filtered out.
o  Historization is usually applied by merging all data. Data is processed using slowly changing dimensions (SCD), either type 2 or type 4. This involves adding additional columns, such as start, end, and current columns.
o    Data is stored in an efficient storage format; preferably Delta, alternatively Parquet.
o    Uses versioning for rolling back processing errors.
o    Handles missing data and standardizes clean or empty fields.
o    Data is usually enriched with reference and/or master data.
o    Data is often clustered around specific subject areas.
o   Data may still be source-system aligned and organized, meaning it has not been integrated with data from other domains yet.


Considerations for Silver as a Transient Layer:

   Some propose Silver as a transient storage layer, enabling the deletion of outdated data or the spontaneous creation of storage accounts. The decision depends on your needs. If you don't plan to use data in its original context for operational reporting or analytics, then Silver can be temporary.

However, if you require historical data for operational reporting and analytics, establish Silver as a permanent layer.

Silver Layer Design Optimization:

Denormalized Data Model: When querying Silver data, consider using a denormalized data model. This approach eliminates the need for extensive joins and aligns better with the distributed columnar storage architecture. However, there's no need to abandon established data modeling practices like the 3rd normal form or Data Vault if they suit your needs.

Regarding historicization, the Delta file format already versions Parquet files, and you have a historical record in your Bronze layer for reloading data if necessary. For automation and adaptability, a metadata-driven strategy is recommended for managing tables and notebooks. Data redundancy in a lake environment is less expensive than the computational processing and data joining required to eliminate it. Unless you're dealing with significant daily schema changes, the complexity of a data vault might not be necessary.

Silver Layer Integration Considerations:

Data Integration: The question of whether to integrate data between applications and source systems in the Silver layer arises frequently. For easier management and isolation of concerns, it's generally recommended to keep things separate if possible. This suggests that if you're using Silver for operational reporting or analytics, avoid prematurely merging and integrating data from source systems. Doing so could lead to unnecessary connection points between applications. Users interested only in data from a single source might be linked to other systems if the data is first combined in a harmonized layer before being provided.

Consequently, these data users are more likely to experience potential impacts from other systems. If you're striving for an isolated design, the integration or combination of data from different sources should be moved up to the Gold layer. The same reasoning applies to aligning your Lakehouses with the source-system side of your architecture. If you plan to build data products and are particular about data ownership, then advise against engineers prematurely cross-joining data from applications in other domains.

Data Enrichment: When considering enrichments like calculations in the Silver layer, your goal plays a key role. If operational reporting requires enrichments, you can start enriching data in Silver. This might lead to some additional calibration when merging data later in the Gold stage. While this may require additional effort, the flexibility it offers can be worthwhile.

Gold Layer: Project-Specific, Integrated Data

Purpose: The Gold layer houses data structured in "project-specific" databases, readily available for consumption. This integration of data from various sources may result in a shift in data ownership. Key characteristics of the Gold layer include:

o    Gold tables represent data transformed for consumption or specific use cases.
o    Data is stored in an efficient storage format, preferably Delta.
o    Gold uses versioning for rolling back processing errors.
o  Historization is applied only for the relevant use cases or consumers. So, Gold can be a selection or aggregation of data found in Silver.
o    Complex business rules are applied in Gold. It utilizes post-processing activities, calculations, enrichments, and use-case-specific optimizations.
o    Data is highly governed and well-documented.

Bronze Layer: Raw and Un-validated Data

Purpose: A repository for storing data in its original, un-validated state. It houses un-validated data (without requiring predefined schemas). Data arrives through full loads or delta loads.

Key characteristics include:

o    Maintains the raw state of the data source in the structure “as-is”.
o    Data is immutable (read-only).
o    Managed using interval partitioned tables, for example, using a YYYYMMDD or date-time folder structure.
o    Retains the full (unprocessed) history of each dataset in an efficient storage format, for example, Parquet or Delta.
o    For transactional data: This can be appended incrementally and grow over time.
o    Provides the ability to recreate any state of a given data system.
o    Can be any combination of streaming and batch transactions.
o    May include extra metadata, such as schema information, source file names, or recording the time data was processed.


A common query I encounter is, “Which file format is superior? Should I opt for Delta or Parquet?” While Delta offers faster speed, its benefits aren’t as pronounced if your data is already versioned or historised using a folder structure. As such, maintaining a transaction log or applying versioning isn’t crucial. Bronze data is typically new or appended data. Therefore, selecting Parquet is perfectly acceptable. However, you could also choose Delta to remain consistent with other layers.

For transactional data: This can be appended incrementally and grow over time.

   o    Provides the ability to recreate any state of a given data system.

   o    Can be a combination of streaming and batch transactions.

  o  May include additional metadata, such as schema information, source file names, or data processing timestamps.

Format Selection: While Delta offers faster speeds, its benefits are less pronounced if your data is already versioned or historiated using a folder structure. In such cases, maintaining a transaction log or applying versioning isn't essential. Parquet is perfectly acceptable for Bronze data. However, Delta can be chosen for consistency across layers.

Business User Access: It's uncommon for raw data in the Bronze layer to be directly used for business user queries or ad-hoc analyses due to its complexity and lack of inherent structure. Securing this layer with numerous small tables can also be challenging. Therefore, Bronze serves primarily as a staging layer and a source for other layers, accessed mainly by technical accounts.

Beyond Parquet and Delta: Exploring Apache Iceberg

Parquet and Delta Lake are popular data storage options for Medallion Architecture. While these are excellent choices, Apache Iceberg is emerging as a strong contender.

Here's why you might consider it:

o    Scalability: Iceberg excels at handling massive datasets efficiently, making it ideal for large-scale data lakes.
o   ACID Compliance: It offers ACID (Atomicity, Consistency, Isolation, Durability) guarantees, ensuring data integrity in concurrent write operations.
o   Schema Evolution: Iceberg gracefully handles schema changes, allowing you to evolve your data model over time without compromising data integrity.



Silver Layer: Refined and Enriched Data

Purpose: Provides a refined structure over ingested data. It represents a validated, enriched version of the data that can be trusted for downstream operational and analytical workloads.

Key characteristics include:

o    It utilizes data quality rules for data validation and processing.
o   It typically contains only functional data, so technical or irrelevant data from Bronze is filtered out.
o Historization is usually applied by merging all data. Data is processed using slowly changing dimensions (SCD), either type 2 or type 4. This involves adding additional columns, such as start, end, and current columns.
o   Data is stored in an efficient storage format; preferably Delta, alternatively Parquet.
o    Uses versioning for rolling back processing errors.
o    Handles missing data and standardizes clean or empty fields.
o    Data is usually enriched with reference and/or master data.
o    Data is often clustered around specific subject areas.

o    Data may still be source-system aligned and organized, meaning it has not been integrated with data from other domains yet.


Considerations for Silver as a Transient Layer:

   Some propose Silver as a transient storage layer, enabling the deletion of outdated data or the spontaneous creation of storage accounts. The decision depends on your needs. If you don't plan to use data in its original context for operational reporting or analytics, then Silver can be temporary.

However, if you require historical data for operational reporting and analytics, establish Silver as a permanent layer.

Silver Layer Design Optimization:

De-normalized Data Model: When querying Silver data, consider using a de-normalized data model. This approach eliminates the need for extensive joins and aligns better with the distributed columnar storage architecture. However, there's no need to abandon established data modeling practices like the 3rd normal form or Data Vault if they suit your needs.

Regarding historicization, the Delta file format already versions Parquet files, and you have a historical record in your Bronze layer for reloading data if necessary. For automation and adaptability, a metadata-driven strategy is recommended for managing tables and notebooks. Data redundancy in a lake environment is less expensive than the computational processing and data joining required to eliminate it. Unless you're dealing with significant daily schema changes, the complexity of a data vault might not be necessary.

Silver Layer Integration Considerations:

Data Integration: The question of whether to integrate data between applications and source systems in the Silver layer arises frequently. For easier management and isolation of concerns, it's generally recommended to keep things separate if possible. This suggests that if you're using Silver for operational reporting or analytics, avoid prematurely merging and integrating data from source systems. Doing so could lead to unnecessary connection points between applications. Users interested only in data from a single source might be linked to other systems if the data is first combined in a harmonized layer before being provided.

Consequently, these data users are more likely to experience potential impacts from other systems. If you're striving for an isolated design, the integration or combination of data from different sources should be moved up to the Gold layer. The same reasoning applies to aligning your Lakehouses with the source-system side of your architecture. If you plan to build data products and are particular about data ownership, then advise against engineers prematurely cross-joining data from applications in other domains.

Data Enrichment: When considering enrichments like calculations in the Silver layer, your goal plays a key role. If operational reporting requires enrichments, you can start enriching data in Silver. This might lead to some additional calibration when merging data later in the Gold stage. While this may require additional effort, the flexibility it offers can be worthwhile.

Gold Layer: Project-Specific, Integrated Data

Purpose: The Gold layer houses data structured in "project-specific" databases, readily available for consumption. This integration of data from various sources may result in a shift in data ownership. Key characteristics of the Gold layer include:

o    Gold tables represent data transformed for consumption or specific use cases.
o    Data is stored in an efficient storage format, preferably Delta.
o    Gold uses versioning for rolling back processing errors.
o  Historization is applied only for the relevant use cases or consumers. So, Gold can be a selection or aggregation of data found in Silver.
o    Complex business rules are applied in Gold. It utilizes post-processing activities, calculations, enrichments, and use-case-specific optimizations.
o    Data is highly governed and well-documented.

Centralised Layer Model

Gold Layer Complexity: The Gold layer is often the most intricate layer due to its design being dependent on the overall architecture's scope. In a basic setup where your Lakehouses are solely source-system aligned, data in the Gold layer represents "data product" data, making it general and user-friendly for distribution to numerous domains. Following this distribution, the data is expected to be housed in another platform, possibly another Lakehouse.

Gold Layer Variations Based on Architecture Scope

Depending on the breadth of your architecture, your Gold layer could be a combination of data product data and use-case-specific data. In such scenarios, the Gold Layer in these platforms caters to the exact needs of analytic consumers. The data is modeled with a shape and structure tailored specifically for the use case in progress. This method fosters principles where you only allow domains to function on internal data that isn't directly distributed to other domains. Therefore, some tables may be flexible and adjustable, while other tables with a formal status are consumed by other entities.

For broader Lakehouse architectures encompassing both provider and consumer ends, anticipate additional layers, often referred to as workspace or presentation layers. In this configuration, the data in Gold is more universal, integrated, and ready for various use cases. These workspace or presentation layers then contain subsets of data, resembling traditional data modeling within data warehousing. Essentially, Gold serves as a universal integration layer from which data marts or subsets can be filled.

Gold Layer for Data Dissemination and Governance:

Certain organizations utilize workspace or presentation layers to disseminate data to other platforms or teams. They achieve this by selecting and/or pre-filtering data for specific scenarios. Some organizations implement tokenization, which involves substituting sensitive data with randomized strings. Others employ additional services for data anonymization. This data can be seen as behaving similarly to "data product" data.

The Gold layer often stands out for its adherence to enterprise-wide standards. While enterprise data modeling can be complex and time-consuming, many organizations still utilize it to ensure data reusability and standardization. Here are some approaches to consider for achieving standardization in the Gold layer:

Enterprise Data Model: The enterprise data model provides an abstract framework for organizing data within teams' Lakehouses. The enterprise model typically outlines specific data elements and requires teams to follow specific reference values or include master identifiers when sharing datasets with other teams.

 

Centralized Data Harmonization: An alternative approach involves gathering, harmonizing, and distributing data in a separate Lakehouse architecture. From there, data can be consumed downstream by other domains, similar to a master data management approach.

Domain-Specific Ownership: Another option is to mandate teams to take responsibility for specific harmonized entities. Other teams must then conform by consistently using these entities.

In most cases, standardization and harmonization primarily occur within the Gold layer.

Centralized Layers model

Technology Considerations for the Gold Layer

Choosing a database or service for the Gold layer is a complex decision. It requires careful consideration of various factors and involves making compromises. Beyond data structure, you'll need to assess your requirements in terms of:

-  
Consistency: How strictly does your data need to be consistent across reads?
-  Availability: How crucial is it for your data to be constantly accessible?
-  Caching: Can caching strategies improve query performance?
-  Timeliness: How quickly do you need data updates to reflect in the Gold layer?
-  Indexing: How can indexing strategies optimize specific queries?


There is no single service that excels in all of these aspects simultaneously. A typical Lakehouse architecture often comprises a blend of diverse technology services, such as:

-  Server-less SQL services for ad-hoc queries.
-  Columnar stores for fast reporting.
-  Relational databases for more complex queries.
-  Time-series stores for IoT and stream analysis.


By understanding the theoretical foundations and practical considerations of each layer, you can effectively implement Medallion Architecture for your data lakehouse. Remember, the "best practices" can vary depending on your specific data platform strategy and the overall scope of your Lakehouse architecture. Choose the approaches and technologies that best align with your unique needs and goals for optimal data management.

Dot Labs logo

Dot Labs is an IT outsourcing firm that offers a range of services, including software development, quality assurance, and data analytics. With a team of skilled professionals, Dot Labs offers nearshoring services to companies in North America, providing cost savings while ensuring effective communication and collaboration.

Visit our website: www.dotlabs.ai, for more information on how Dot Labs can help your business with its IT outsourcing needs.

For more informative Blogs on the latest technologies and trends click here

Leave a Reply

Your email address will not be published. Required fields are marked *