Data Warehouse Series - Implementation of Data Warehouse Layering#
Content organized from:
Significance of Data Warehouse Layering#
-
Clear data structure: Each data layer has its own scope, making it easier to locate and understand when using tables.
-
Data lineage tracking: Simply put, we ultimately present a business table that can be directly used, but it has many sources. If a source table has issues, we want to quickly and accurately locate the problem and understand its impact.
-
Reduce redundant development: Standardizing data layers and developing some common intermediate layer data can greatly reduce redundant calculations.
-
Simplifying complex problems. Breaking a complex task into multiple steps, where each layer only handles a single step, makes it simpler and easier to understand. It also facilitates maintaining data accuracy; when data issues arise, we can fix only the problematic steps instead of repairing all data.
-
Shielding raw data anomalies.
-
Shielding business impacts, so that a business change does not require re-integrating data.
Four-Layer Model#
ODS Layer#
ODS (Operational Data Store)
This is the layer closest to the data source, where data from the source is extracted, cleaned, and transmitted, meaning it is loaded into this layer after ETL. Generally, the data in the ODS layer is isomorphic to the source system data, primarily aimed at simplifying subsequent data processing tasks. In terms of data granularity, the ODS layer has the finest granularity.
Tables in the ODS layer typically include two types:
- One for storing the current data that needs to be loaded.
- One for storing processed historical data.
Historical data is generally retained for 3-6 months before being cleared to save space. However, different projects should be treated differently; if the source system's data volume is not large, it can be retained for a longer time, or even kept in full.
Data needs to undergo the following processes before being loaded into this layer:
-
Denoising (e.g., removing bank card information that significantly deviates from normal levels).
-
Deduplication (e.g., names of individuals are included in both bank account information and public security population information, but only one copy should be retained).
-
Data cleansing (e.g., if a person's bank card is fraudulently used, with two transactions occurring within ten minutes in China and Japan, this is dirty data).
-
Business extraction.
-
Unit standardization.
-
Field reduction (e.g., fields that support front-end system operations but are not needed in data mining).
-
Business discrimination.
DW#
Data Warehouse Layer (DW), the main body of the data warehouse.
Typically, various data models are established based on themes, such as star or snowflake structures.
For methods related to dimensional modeling and star, snowflake models, please refer to: A Brief Discussion on Data Modeling Methods in Data Warehouse Construction
DM Layer#
Data Mart Layer (DM), also known as DWS (Data Warehouse Service) or Theme Layer, stores lightly aggregated data.
Typically divided according to business needs into traffic, orders, users, etc., generating wide tables with many fields for subsequent business queries, OLAP analysis, data distribution, etc.
In terms of data granularity, this layer contains lightly summarized data, with no detailed data remaining. In terms of time span, it is usually part of the DW layer, primarily aimed at meeting user analysis needs. From an analytical perspective, users typically only need to analyze data from recent years (e.g., data from the last three years). In terms of breadth, it still covers all business data.
APP Layer#
Data Product Layer (APP), this layer provides result data for data products.
In terms of data granularity, it consists of highly summarized data. In terms of breadth, it does not necessarily cover all business data. In extreme cases, a model can be built for each report at the APP layer to support it, achieving the goal of exchanging space for time.
The application layer is based on business needs, derived from the statistics of the previous three layers, and can directly provide query displays or be imported into MySQL for use. The report data we often refer to, or those large wide tables, are generally placed here.
A system for managing metadata information is needed, capable of providing convenient metadata operations and query operations. It mainly provides data for data products and data analysis, typically stored in systems like ES, MySQL for online system use, and may also exist in Hive or Druid for data analysis and data mining.
Three-Layer Model#
Similar to the four-layer model, but with more logic placed in the DW layer.
Buffer Data Cache Layer#
This layer is used to store the raw data provided by the interface party, with the table structure in this layer remaining basically consistent with the source data. The data retention time depends on the data volume and project circumstances; if the data volume is large, only recent data may be stored, while historical data is backed up. The purpose of this layer is for data transfer and backup.