Data Warehouse Series - Basic Concept Organization

Data Warehouse Series - Basic Concept Organization#

Content organized from:

Basic Concepts of Data Warehouse

I. Data Warehouse and Data Mart#

Data Warehouse (Data Warehouse)#

A Data Warehouse is a theory of information system data storage that emphasizes using certain special data storage methods to make the data contained particularly beneficial for analysis and processing, thereby generating valuable information for decision-making.
A data warehouse is a subject-oriented, integrated, non-updatable (stable), time-variant (different times) data collection used to support decision support systems (DDS: Decision Support System, mainly report systems) in business management.
The data stored in a data warehouse must include time attributes, having the characteristic that once stored, it will not change over time. However, the data in the warehouse may continuously grow and change over time.

Data Mart (Data Market)#

A small-scale data warehouse at the department or workgroup level.

Comparison#

	Data Warehouse	Data Mart
Data Source	Legacy systems, OLTP systems, external data	Data Warehouse
Scope	Enterprise-level	Departmental or workgroup level
Subject	Enterprise subject	Departmental or specific analysis subject
Data Granularity	Finest granularity	Coarser granularity
Data Structure	Normalized structure (third normal form)	Star schema, snowflake schema, or a mix of both
Historical Data	Large amounts of historical data	Moderate amounts of historical data
Optimization	Handling massive data, data exploration	Easy access and analysis, fast queries
Indexing	Highly indexed	Highly indexed

II. Dimensions and Measures#

Dimension:
Refers to the perspective from which data is viewed; it is usually an attribute of the data record, such as time, location, etc. Generally, it is a set of discrete values, such as each independent date in the time dimension or each independent product in the product dimension. Therefore, during statistics, records with the same dimension value can be aggregated together, and then aggregation functions can be applied for summation, averaging, deduplication counting, and other aggregation calculations.

Cardinality of Dimension:
Refers to the number of different values that the dimension appears in the dataset. For example, "Country" is a dimension; if there are 200 different values, then the cardinality of this dimension is 200. Typically, the cardinality of a dimension ranges from dozens to tens of thousands, with some dimensions like "User ID" exceeding millions or even tens of millions. Dimensions with cardinality exceeding one million are usually referred to as Ultra High Cardinality Dimensions (UHC).

Measure:
A calculated value based on the data; it is generally a continuous value, such as total sales, number of different users, etc.

For example:
In an SQL query, the attributes in the GroupBy clause are usually dimensions, while the calculated values are measures. For instance:

SELECT
department,
site_id,
sum(price) AS total_sales,
count(DISTINCT seller_id) AS sellers
FROM kylin_sales
GROUP BY department, site_id

In the above query, department and site_id are dimensions, while sum(price) and count(distinct seller_id) are measures.

III. Basic Concepts of Cube#

Given a data model, we can combine all dimensions on it. For N dimensions, there are a total of 2^n combinations.

For a combination of dimensions, aggregate the measures and save the result as a materialized view, called a Cuboid.
The Cuboids of all dimension combinations as a whole are referred to as a Cube. In simple terms, a Cube is a collection of many materialized views aggregated by dimensions.
A Cube Segment refers to the Cube data calculated for a specific segment of the source data. Typically, the amount of data in a data warehouse grows over time, and Cube Segments are also constructed in chronological order.

IV. Hierarchy, Level, and Member#

A Cube is like a coordinate system, where each Hierarchy represents a coordinate axis, and each Member in the Hierarchy represents a value on that axis. The following diagram illustrates the internal structure of a Dimension using the time dimension as an example:

Dimensions are used to describe the characteristics of things from different angles, and generally, dimensions will have multiple levels, each Level containing some common or unique attributes (Attribute), which can be illustrated in the following diagram:

Taking the time dimension as an example, the time dimension generally includes several Levels such as year, quarter, month, and day. Each Level typically has common attributes like ID, NAME, and DESCRIPTION. These common attributes apply not only to the time dimension but also to various other types of dimensions. The ID is generally considered a surrogate primary key (Agent), used solely as a uniqueness marker and serves as a proxy for relationships in the multidimensional model, having no business significance; NAME is generally the business primary key (Business), which restricts uniqueness at the business level and is often used as a join key during data loading; DESCRIPTION records detailed descriptive information, and we often choose to use DESCRIPTION to express specific meanings during multidimensional display and analysis.

The structure of the above dimension is not directly applicable to OLAP; OLAP requires hierarchical top-down drilling or bottom-up aggregation. Therefore, each dimension must have a Hierarchy, with at least one default, though multiple can exist, as shown in the diagram below:

With a Hierarchy, the Levels within the dimension have a top-down tree structure relationship, meaning each upper-level Member contains zero or more lower-level Members, which are the branches of the tree. It is important to note that the root node of each Hierarchy tree is generally set to the summary of all Members (Total). When the dimension is not used in OLAP, the default display is the summary node of that dimension, which is the aggregation of all data in that dimension (or the dimension is not used for segmentation). Each layer in the Hierarchy contains several Members (Member). For example, if we create a time dimension spanning from 2006 to 2015, the highest-level node would have only one Total member, encompassing all ten years, while the year level would include ten members: 2006, 2007…2015. Each year would then contain four quarter members, and each quarter would contain three month members… This seems logical, and we can perform some OLAP operations based on the Hierarchy.

V. Data Cube#

A Data Cube is a technique commonly used for data analysis and indexing; it can establish multidimensional indexes on raw data, and analyzing data through the Cube can significantly speed up query efficiency.

The Data Cube is merely a figurative term for a multidimensional model. The cube itself is three-dimensional, but the multidimensional model is not limited to three dimensions and can combine more dimensions. On one hand, this is for easier explanation and description, and on the other hand, it provides space for mental imaging and imagination; furthermore, it distinguishes itself from traditional relational databases' two-dimensional tables.

Basic Operations#

Drill-down: Changes between different levels of the dimension, moving from a higher level to a lower level, or breaking down summary data into more detailed data. For example, drilling down on total sales data for the second quarter of 2010 to view consumption data for April, May, and June of 2010, as shown in the image; of course, one can also drill down to view sales data for cities like Hangzhou, Wenzhou, etc., in Zhejiang Province.
Roll-up: The reverse operation of drilling down, aggregating from fine-grained data to higher levels, such as summarizing sales data for Jiangsu Province, Shanghai City, and Zhejiang Province to view sales data for the Jiangsu-Zhejiang-Shanghai region, as shown in the image.
Slice: Selecting specific values in the dimension for analysis, such as only selecting sales data for electronic products or data for the second quarter of 2010.
Dice: Selecting specific ranges of data or certain specific values in the dimension for analysis, such as selecting sales data from the first quarter of 2010 to the second quarter of 2010, or sales data for electronic products and daily necessities.
Pivot: Swapping the positions of dimensions, similar to transposing rows and columns in a two-dimensional table, as shown in the image by rotating the product dimension and geographic dimension.

VI. Fact Table and Dimension Table#

Fact Table (FactTable)#

A Fact Table is a table that stores factual records, containing specific elements of each event and what happened, such as system logs, sales records, etc.
The Fact Table does not store actual content; it is a collection of primary keys, with these IDs corresponding to a record in the Dimension Table.
Records in the Fact Table are continuously growing dynamically, so its size is usually much larger than other tables.
Operational events occurring in the real world produce measurable values stored in the Fact Table. From the lowest granularity level, each row in the Fact Table corresponds to a measure event, and vice versa.

Dimension Table (DimensionTable)#

A Dimension Table, sometimes referred to as a Lookup Table, is descriptive information about the elements of events in the Fact Table.
It stores the attribute values of dimensions and can be associated with the Fact Table; it essentially extracts and standardizes frequently repeated attributes from the Fact Table into a single table for management.
Common Dimension Tables include: date tables (storing attributes corresponding to weeks, months, quarters, etc.), location tables (including attributes like country, province/state, city, etc.).
Each Dimension Table contains a single primary key column. The primary key of the Dimension Table can serve as a foreign key for any associated Fact Table, and the descriptive environment of the Dimension Table rows should completely correspond to the Fact Table rows. Dimension Tables are typically wide, flat non-normalized tables containing a large number of low-granularity text attributes.

VII. OLAP and OLTP#

OLAP (Online Analytical Processing)#

Online Analytical Processing is a collection of various operations aimed at analysis based on a multidimensional model of a data warehouse, capable of flexibly providing Roll-up, Drill-down, and Pivot analysis operations. It is a method for presenting integrated decision-making information, commonly used in decision support systems, business intelligence, or data warehouses.
Its function is to facilitate large-scale data analysis and statistical calculations, providing references and support for decision-making.
OLAP requires a large amount of historical data as a foundation, combined with time point differences, to perform complex analyses on multidimensional and aggregated information.
OLAP requires users to have subjective definitions of information needs, thus achieving better system efficiency.

The concept of OLAP has both broad and narrow interpretations in practical applications. The broad understanding aligns with its literal meaning, referring to any analysis processing that does not update data. However, more often, OLAP is understood in its narrow sense, specifically related to multidimensional analysis based on Cube calculations.

Types#

MOLAP (Multidimensional), Multidimensional OLAP#

MOLAP physically stores the multidimensional data used for OLAP analysis as multidimensional arrays, forming a "cube" structure. The attribute values of dimensions are mapped as subscript values or ranges of subscripts in the multidimensional array, while summary data is stored as values in the cells of the array.
This structure, once highly optimized, can maximize query performance.
Due to the new storage structure implemented at the physical level, it is also referred to as Physical OLAP; ROLAP primarily achieves this through software tools or middleware, still using relational database storage structures, thus called Virtual OLAP.
The advantage of MOLAP lies in its high data operation efficiency due to multidimensional preprocessing, while its main drawback is that data updates have a certain delay.

ROLAP (Relational), Relational OLAP#

ROLAP represents OLAP implementations based on relational databases. It centers around relational databases and uses relational structures to represent and store multidimensional data. ROLAP divides the multidimensional structure of the multidimensional database into two types of tables: one is the Fact Table, used to store data and dimension keywords; the other is the Dimension Table, which uses at least one table for each dimension to store descriptions of the dimension's hierarchy, member categories, etc.
The Dimension Table and Fact Table are linked through primary and foreign keys, forming a "star schema".
For dimensions with complex hierarchies, multiple tables can be used to describe them to avoid excessive storage space occupied by redundant data. This extension of the star schema is called a "snowflake schema".
The RDBMS used as a ROLAP storage system also undergoes corresponding optimizations for OLAP, such as parallel storage, parallel queries, parallel data management, cost-based query optimization, bitmap indexing, and SQL OLAP extensions (cube, rollup), etc.

HOLAP (Hybrid) Hybrid OLAP#

Represents OLAP implementations based on hybrid data organization (Hybrid OLAP), allowing users to choose which models to adopt ROLAP and which to adopt MOLAP based on their business needs.

Comparative Analysis#

Generally, infrequent or flexibly defined analyses are performed using ROLAP, while commonly used, standardized models adopt MOLAP. For example, detailed data is retained in the relational database's Fact Table, while aggregated data is stored in the Cube. Aggregation in ROLAP takes more time than in MOLAP, but query efficiency is higher than ROLAP but lower than MOLAP.

OLTP (Online Transaction Processing)#

Characterized by small data volumes, frequent DML operations, and many parallel transactions, but generally very short.
Online Transaction Processing focuses more on basic, daily transaction processing, including data insertion, deletion, modification, and querying.

Differences Between OLAP and OLTP#

Similar to the distinction between databases and data warehouses:

The design goal of database systems is transaction processing. Database systems are designed for recording updates and transaction processing, with data access characteristics based on primary keys, numerous atomic, isolated small transactions, concurrency, and recoverability being key attributes. Maximum transaction throughput is a critical metric, so database designs reflect these needs.
The design goal of data warehouses is decision support. Historical, summarized, and aggregated data is much more important than raw records. Query loads mainly focus on ad-hoc queries and complex queries involving joins, aggregations, etc. Compared to database systems, query throughput and response time are much more important than transaction processing throughput.

Data Processing Type	OLTP	OLAP
Target Audience	Business developers	Analytical decision-makers
Functionality	Daily transaction processing	Analysis-oriented decision-making
Data Model	Relational model	Multidimensional model
Data Volume	Few or dozens of records	Millions to tens of millions of records
Operation Types	Query, insert, update, delete	Primarily querying

VIII. Metadata#

Metadata is data about data. When people describe phenomena in the real world, abstract information is generated, which can be considered metadata. Metadata primarily describes the contextual information of data. In simple terms, if the content of each book in a library is data, then finding the index for each book is metadata. Metadata can be divided into three categories based on the objects they describe: Technical Metadata, Business Metadata, and Management Metadata:

Technical Metadata: Technical metadata describes concepts, relationships, and rules related to the technical field in data systems, mainly including descriptions of data structure and data processing characteristics, covering all data processing links such as data source interfaces, data warehouse and data mart storage, ETL, OLAP, data encapsulation, and front-end presentation.
Business Metadata: Business metadata describes concepts, relationships, and rules related to the business field in data systems, mainly including business terminology, information classification, metric definitions, and business rules.
Management Metadata: Management metadata describes concepts, relationships, and rules related to the management field in data systems, mainly including personnel roles, job responsibilities, and management processes.

For specific reference, see: Let's Talk About Metadata Management Systems in Data Warehouses