Wednesday, March 28, 2012

Lecture 17 & 18 Dimensional Modeling

Data cleansing and data profiling take around 80% of data analysis process, playing an extremely important role. The first thing we need to do is to look at the schema and try to understand what information the schema and attributes tell you. We can come up a graph to visualize the relationship among attributes. A few things we can check before analyzing data, are investigation of each attribute on each table, business rules that govern every attribute, such as referential/data integrity, combination of attributes, and carnality. Most common situations that we need to profile are data from different tables in different format because of merger and acquisition or different information system.

We will create system catalog in the data warehouse. Each database has a list of tables. Each table has a list of attributes. Further, Each attribute has a list of constraints. For constraints, it's a good practice to define a default value for each constraint. Otherwise, once you want to merge multiple tables with null value for the attributes associated with constraints, we will need to turn off the constraint to make data merge. The common case would be people forget to turn on the constraints, which might be used to follow data integrity, again, causing a mess of data.

Numerous vendors alleged that they provide MDM (Master Data Management) tools to help companies extract valuable information from the Big data in the information system. Things that companies usually not know is that MDM is largely a human efforts, that is, what we are looking for is not about tools but services. Profiling is a large part of the MDM, extracting, transforming, and loading data.

The core idea of dimensional modeling is star schema with a few factors, primarily including fact and dimensions. From the graph below from Laynetworks, Facts is like the heart of the whole data and all dimension tables connect to fact, so the referential relationship is one to many between dimension and fact. The size of dimension tables are not always fixed, depending on different types and characteristics. Only time dimension table has stable table size since it only relates to the granularity of fact. The maximum size of fact table is product table times customer table times time dimension table, which is the best case that every customer buys every product every time unit.

Source: http://www.laynetworks.com/Diamension-Modeling5.htm


No comments:

Post a Comment