Saturday, April 21, 2012

Lecture 25 & 26 Visualization and social media monetization

Key takeaway from the video, Hans Rosling shows the best stats you've ever seen

  • Observe data in different dimensions trend/moving pattern over time, record the moving paths to perceive past trend and predict future pattern (Use data to do projection based on the past pattern)
  • Visualize data based on reasonable measure, such as population
  • Show graphs by both Aggregate data and split data
    • Dangerous to only use average data (can't see the big picture), since the details can be much different from average. Also, average can be affected by outlier and misrepresent the truth. 
  • Clearly show over time and regional pattern
  • Link data with internet the draw the whole picture and tell story about the data
  • Explain the behavior of the data - not just how data is moving but why
  • Volume of data makes things different - "big data" to be representative
  • How to Access to data and Cleaning of data are important
  • Present data in a visually appealing way
  • Data is not always in/about a company, rather, data outside the company, such as industry trend, is also important to predict future.

Takeaway from GOMC Competition

From a company's perspective, performance of campaign is the most important thing. Therefore it's critical to identify goal of campaign and estimate what the costs and benefits are to measure the overall ROI for the whole campaign. While comparing the performance of campaign, not only compare to pre-campaign but also to post campaign, to show the contribution of the campaign. If the conversions are much higher than in both pre-campaign and post-campaign period, that means that campaign did increase conversions, helping organization achieve goal.

For the pre-campaign comparison, we should not only compare the campaign period with previous month but also the same month in the past year. Comparing to the same month in different enables us to capture the seasonality that is not able to see in only compare months in the same year. We will know whether the increase of conversion is due to campaign or just in hot season. On the other hand, comparing the the months just prior campaign allow us to get the overall scale of the visits and conversions. We can perceive that whether the increase by year is actually attributed by campaign rather than just business expansion.

Friday, April 20, 2012

Lecture 23 & 24 Dimensional modeling and balanced scorecard


  • Dimensional modeling Miscellaneous details
    • Slowly changing dimension
      • Purpose: Changes happen in attribute's value over time (e.g. brand, department, …), but don't happen frequently, rather happen once a while
      • Type 0: nothing is ever going to change (this is not typical)
      • Type 1: directly update attributes, completed lose history but simple
      • Type 2: in order to preserve history, split tables
        • Every time value of attribute change, add a new tuple with a new surrogate key, add start time and end time to the new tuple
      • Type3: make changes to the table, Add a new column/new attribute called "new" territory, and change the previous one as "old" territory
    • Rule playing dimension
    • Junk dimension
      • Put data altogether
    • Semi-additive
    • Non-additive: average, minimum, maximum, can't add together, e.g. balance of bank statement

Wednesday, March 28, 2012

Lecture 19 & 20 Develop dimensional modeling for AYFG case

AYFG is a gym with several branch nationwide. Provided ER diagram, we are going to develop a dimensional model step by step to answer the business questions and address business problems.

The first idea is star schema. Figure 1 below is the conceptual schema of the business. We can see there are several primary entities, including members, memberships, salesinvoices, merchandises, etc. An interesting thing we noticed is that there is only merchandises sales has quantity attribute. The reason of this is based on the assumption that sales quantities of other products, such as memberships and oneday pass, are defined as one for every record.

Lecture 17 & 18 Dimensional Modeling

Data cleansing and data profiling take around 80% of data analysis process, playing an extremely important role. The first thing we need to do is to look at the schema and try to understand what information the schema and attributes tell you. We can come up a graph to visualize the relationship among attributes. A few things we can check before analyzing data, are investigation of each attribute on each table, business rules that govern every attribute, such as referential/data integrity, combination of attributes, and carnality. Most common situations that we need to profile are data from different tables in different format because of merger and acquisition or different information system.

We will create system catalog in the data warehouse. Each database has a list of tables. Each table has a list of attributes. Further, Each attribute has a list of constraints. For constraints, it's a good practice to define a default value for each constraint. Otherwise, once you want to merge multiple tables with null value for the attributes associated with constraints, we will need to turn off the constraint to make data merge. The common case would be people forget to turn on the constraints, which might be used to follow data integrity, again, causing a mess of data.

Monday, March 19, 2012

Lecture 15 & 16 BI infrastructure


OLAP is Online analytical processing, which is used in most BI infrastructure to look at the trend of purchase, aggregating query data/large amount of data and analyzing them.
OLTP is Online Transaction Processing, which is regularly used in operational database, analyzing single query data and small amount data.

There are two primary BI infrastructure. First is data warehouses, which was first brought up by Bill Inmon. The concept is starting from a big enterprise system, eventually divide into small division data marts. All supply chain systems are pulled into data warehouse. The other one is data marts, which was proposed by Ralph Kimball. The idea is starting from small data marts and come together to a big enterprise system. The two actually generate the same result, but different approaches. In the modern age, Kimball is more common.

Sunday, March 4, 2012

Lecture 13 & 14 Network Analysis

Before Analyzing network, we need to have to know basic structural properties. Centrality measures are used to examine how close the relationship in the network. In the centrality, we use different measures to how the network connection.

  • Degree Centrality: counting the number of link, i.e., how many people can a particular person like directly.   In- and out- link are counted as the same in the degree centrality. If every node in a network has both in- and out-link, we call that fully connected network. Clique means that every node connects to every other node in a network. If a network is not "clique", there must have some bridges between unconnected directly nodes.
  • Between Centrality: How likely is a node to be in the direct route between two nodes
  • Closeness Centrality: the distance of a node to link to all other nodes in a network
  • Eigenvector centrality: How well is a person's network overall. It bases on the influence of nodes to assign score of each node.