Yige

Yige

Build

Data Warehouse Series - Data Modeling Methods

Data Warehouse Series - Data Modeling Methods#

Content organized from:

  1. Data Modeling Methods in Data Warehousing
  2. Methods for Building Data Warehouses

Basic Concepts of Data Models#

A data model is a tool and method for abstractly describing the real world. It represents the interrelationships of affairs in the real world through abstract entities and the relationships between them. Here, the data model abstracts the relationships between entities, expressing specific business relationships in actual business through the definition and description of these relationships, divided into the following levels:
image.png

From the diagram above, we can easily see that in the entire data warehouse modeling process, we generally go through four processes:

  • Business modeling, generating business models, mainly solving the decomposition and programming at the business level.

  • Domain modeling, generating domain models, mainly abstracting the business model to create a domain conceptual model.

  • Logical modeling, generating logical models, mainly logicalizing the conceptual entities of the domain model and the relationships between entities at the database level.

  • Physical modeling, generating physical models, mainly addressing specific technical issues such as the physical realization of the logical model for different relational databases and performance.

Normalization Modeling#

In the design of data warehouse models, the third normal form is generally adopted, which has a strict mathematical definition. From its expressed meaning, a relation that conforms to the third normal form must meet the following three conditions:

  • Each attribute value is unique and unambiguous;
  • Each non-primary attribute must fully depend on the entire primary key, not just part of it;
  • Each non-primary attribute cannot depend on attributes in other relations, as such attributes should belong to other relations.

Dimensional Modeling#

The simplest description is to build a data warehouse and data mart according to fact tables and dimension tables, such as Star Schema, Snowflake Model, etc.

Star Schema#

The architecture in the above diagram is a typical star schema. The star schema is a denormalized structure where each dimension of the multidimensional dataset is directly connected to the fact table, with no gradual dimensions, resulting in some data redundancy. For example, in the geographic dimension table, there are two records for country A, province B, city C and country A, province B, city D, meaning that the information for country A and province B is stored twice, resulting in redundancy.

Characteristics

  • There is only one fact table, and each dimension has a separate table.

  • Each tuple in the fact table is a foreign key pointing to the primary key of the dimension table.

  • The columns of each dimension table are all attributes that make up that dimension.

  • The fact table and dimension tables are associated through primary and foreign keys, with no associations between dimension tables, similar to many stars revolving around a central star, hence the name star schema.

Advantages

  • The star schema is the simplest and most commonly used model.

  • Since the star schema has only one large table, it is more suitable for big data processing compared to other models.

  • Other models can be transformed into a star schema through certain conversions.

Snowflake Schema#

When one or more dimension tables are not directly connected to the fact table but are connected through other dimension tables, the diagram resembles multiple snowflakes connected together, hence the name snowflake schema. The snowflake schema further hierarchizes the dimension tables of the star schema, where the original dimension tables may be expanded into smaller fact tables, forming some local "hierarchical" areas, and these decomposed tables are connected to the main dimension table rather than the fact table.
image.png

As shown in the diagram above, the geographic dimension table is further decomposed into country, province, city, etc. Its advantage is: by minimizing data storage and combining smaller dimension tables, it improves query performance and eliminates data redundancy.

Data warehouses are often more suitable for building underlying Hive tables using the star schema, enhancing query efficiency through significant redundancy. The star schema is quite friendly to OLAP analysis engines, which is particularly evident in Kylin. The snowflake schema is very common in relational databases like MySQL and Oracle, especially in e-commerce database tables.

Fact Constellation Schema#

The constellation schema is a more complex model that includes multiple fact tables, with dimension tables being shared among them. This pattern can be seen as a collection of star schemas, hence it is called galaxy schema or fact constellation.
image.png

Summary#

Advantages of Dimensional Modeling:
Dimensional modeling is very intuitive, closely centered around the business model, and can intuitively reflect business problems within the business model. It does not require special abstract processing to complete dimensional modeling.

Disadvantages of Dimensional Modeling:

  • A large amount of data preprocessing is required before constructing the star schema, leading to significant data processing work.

  • When business changes necessitate redefining dimensions, it often requires reprocessing of dimension data, which can lead to substantial data redundancy during these processing stages.

  • Relying solely on dimensional modeling cannot guarantee the consistency and accuracy of data sources, and at the underlying level of the data warehouse, it is not particularly suitable for dimensional modeling methods.

The field of dimensional modeling is mainly applicable to the data mart layer, and its greatest role is actually to address performance issues in data warehouse modeling. Dimensional modeling struggles to provide a complete abstract method for describing the complex relationships between real business entities.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.