AYFG is a gym with several branch nationwide. Provided ER diagram, we are going to develop a dimensional model step by step to answer the business questions and address business problems.
The first idea is star schema. Figure 1 below is the conceptual schema of the business. We can see there are several primary entities, including members, memberships, salesinvoices, merchandises, etc. An interesting thing we noticed is that there is only merchandises sales has quantity attribute. The reason of this is based on the assumption that sales quantities of other products, such as memberships and oneday pass, are defined as one for every record.
Wednesday, March 28, 2012
Lecture 17 & 18 Dimensional Modeling
Data cleansing and data profiling take around 80% of data analysis process, playing an extremely important role. The first thing we need to do is to look at the schema and try to understand what information the schema and attributes tell you. We can come up a graph to visualize the relationship among attributes. A few things we can check before analyzing data, are investigation of each attribute on each table, business rules that govern every attribute, such as referential/data integrity, combination of attributes, and carnality. Most common situations that we need to profile are data from different tables in different format because of merger and acquisition or different information system.
We will create system catalog in the data warehouse. Each database has a list of tables. Each table has a list of attributes. Further, Each attribute has a list of constraints. For constraints, it's a good practice to define a default value for each constraint. Otherwise, once you want to merge multiple tables with null value for the attributes associated with constraints, we will need to turn off the constraint to make data merge. The common case would be people forget to turn on the constraints, which might be used to follow data integrity, again, causing a mess of data.
We will create system catalog in the data warehouse. Each database has a list of tables. Each table has a list of attributes. Further, Each attribute has a list of constraints. For constraints, it's a good practice to define a default value for each constraint. Otherwise, once you want to merge multiple tables with null value for the attributes associated with constraints, we will need to turn off the constraint to make data merge. The common case would be people forget to turn on the constraints, which might be used to follow data integrity, again, causing a mess of data.
Monday, March 19, 2012
Lecture 15 & 16 BI infrastructure
OLAP is Online analytical processing, which is used in most BI infrastructure to look at the trend of purchase, aggregating query data/large amount of data and analyzing them.
OLTP is Online Transaction Processing, which is regularly used in operational database, analyzing single query data and small amount data.
There are two primary BI infrastructure. First is data warehouses, which was first brought up by Bill Inmon. The concept is starting from a big enterprise system, eventually divide into small division data marts. All supply chain systems are pulled into data warehouse. The other one is data marts, which was proposed by Ralph Kimball. The idea is starting from small data marts and come together to a big enterprise system. The two actually generate the same result, but different approaches. In the modern age, Kimball is more common.
Sunday, March 4, 2012
Lecture 13 & 14 Network Analysis
Before Analyzing network, we need to have to know basic structural properties. Centrality measures are used to examine how close the relationship in the network. In the centrality, we use different measures to how the network connection.
- Degree Centrality: counting the number of link, i.e., how many people can a particular person like directly. In- and out- link are counted as the same in the degree centrality. If every node in a network has both in- and out-link, we call that fully connected network. Clique means that every node connects to every other node in a network. If a network is not "clique", there must have some bridges between unconnected directly nodes.
- Between Centrality: How likely is a node to be in the direct route between two nodes
- Closeness Centrality: the distance of a node to link to all other nodes in a network
- Eigenvector centrality: How well is a person's network overall. It bases on the influence of nodes to assign score of each node.
Subscribe to:
Comments (Atom)