Monday, March 19, 2012

Lecture 15 & 16 BI infrastructure


OLAP is Online analytical processing, which is used in most BI infrastructure to look at the trend of purchase, aggregating query data/large amount of data and analyzing them.
OLTP is Online Transaction Processing, which is regularly used in operational database, analyzing single query data and small amount data.

There are two primary BI infrastructure. First is data warehouses, which was first brought up by Bill Inmon. The concept is starting from a big enterprise system, eventually divide into small division data marts. All supply chain systems are pulled into data warehouse. The other one is data marts, which was proposed by Ralph Kimball. The idea is starting from small data marts and come together to a big enterprise system. The two actually generate the same result, but different approaches. In the modern age, Kimball is more common.


Big data is generated everyday, but dirty data is everywhere, so data profiling becomes a very important part to help make good decisions. The fundamental assumption of data profiling is that the data is not in the form you want to use, so we need to profile them and extract useful information by data quality analyzing tool. There are a lot methods to cleansing data. The regular process is ETL (Extraction, Transformation, Loading). The common problem of data cleansing is the latency. Nowadays, people try to shorter the latency to a few hours to achieve real-time, but still can't get the real "real-time". Furthermore, the more real time model you need, the more price you need to pay, so maybe capability is enough to do it but it's to expensive.

Some companies will create an ODS (operational data store) between srouce and data warehouse to do profiling and cleansing by reorganizing the data to fit the form of data in the data warehouse.

No comments:

Post a Comment