Monday, March 26, 2012

STAR Schema(AYEG Case) and Microsoft BI

OLAP VS OLTP



OLPT and OLAP are complementing technologies. You can't live without OLTP: it runs your business day by day. So, using getting strategic information from OLTP is usually first “quick and dirty” approach, but can become limiting later.

This post explores key differences between two technologies.

OLTP stands for On Line Transaction Processing and is a data modeling approach typically used to facilitate and manage usual business applications. Most of applications you see and use are OLTP based.

OLAP stands for On Line Analytic Processing and is an approach to answer multi-dimensional queries. OLAP was conceived for Management Information Systems and Decision Support Systems but is still widely underused: every day I see too much people making out business intelligence from OLTP data!

Starschemas are not normalized. Not updating and deleting tuples that u do in normal database. Facitilates queries instead of other operation. When selecting there is not update to index which makes use of Index easier.


 GRANULARITY - data warehouse typically stores data in different levels of granularity or summarization, depending on the data requirements of the business. If an enterprise needs data to assist strategic planning, then only highly summarized data is required. The lower the level of granularity of data required by the enterprise, the higher the number of resources (specifically data storage) required to build the data warehouse. The different levels of summarization in order of increasing granularity are:


  • Current operational data
  • Historical operational data
  • Aggregated data
  • Metadata
AYEG

Revenue Sources:
Pro Shop and juice bar
Membership and Day Passes
Problem : How to generate more revenue.Which set of customers should the gym target. Has the customer pattern changed over time ?
BI is all about how to maximize revenue based on internal or extranal data
Now Organize corp events. Customer pay fixed fees
What are the thoughts about 

Need to have a look at the OLTP data. Need to look at the trends and patterns. Good point to start with. Need to relate with DW Tool Kit by Ralph and Kimball. Can the abve questions be answered using OLTP DB. Yes ... Need to do some messaging the data over time.

For some questions you need the historical data. Even if u have multiple sources of data you need to dump it into single datawarehouse.

A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse which is usually oriented to a specific business line or team.

This can also be done by Virtual Integration. This could be a step towards complete integration.

With a virtual data warehouse, you don't have to go through all the hard work of integrating data, rummaging around old files and trying to understand what some programmer back in the 1960s had in mind. You don't need to take up space with replicated data. What a lovely idea. You just string some files together and access them as if they were a data warehouse. 



Apache Cassandra is an open source distributed database management system used by Hadoop. 

A dimension is a structure that categorizes data in order to enable users to answer business questions. Commonly used dimensions are customers, products, and time. For example, each sales channel of a clothing retailer might gather and store data regarding sales and reclamations of their Cloth assortment. The retail chain management can build a data warehouse to analyze the sales of its products across all stores over time and help answer questions such as:

What is the effect of promoting one product on the sale of a related product that is not promoted?
What are the sales of a product before and after a promotion?
How does a promotion affect the various distribution channels?
The data in the retailer's data warehouse system has two important components: dimensions and facts. The dimensions are products, customers, promotions, channels, and time. One approach for identifying your dimensions is to review your reference tables, such as a product table that contains everything about a product, or a promotion table containing all information about promotions. The facts are sales (units sold) and profits. A data warehouse contains facts about the sales of each product at on a daily basis.

A typical relational implementation for such a data warehouse is a Star Schema. The fact information is stored in the so-called fact table, whereas the dimensional information is stored in the so-called dimension tables. In our example, each sales transaction record is uniquely defined as for each customer, for each product, for each sales channel, for each promotion, and for each day (time).


BI Rules Learnt in today's Class :
  • Facts are quantitative attributes that can be added across every dimension !!!
  • Items are not quantitative. Quantity and $ amount are quantitative.
  • Dimension Tables are short and Fat
  • Fact table are tall(exhaustive) and skinny(very few attributes)
  • Surrogate key represent dimensions.


Assumption : Every GYM has a local database. Think beyond something which is not there in a regular database. 

Dimensions : 

Time : Day, Hour
Space : Location where the sale is happening. Location of the Gym. Gym Dimension
Membership : Type
Items : Things that you buy, Products

Facts:

Sales : $ sales amount
Total # of member :

Quantity has Members, Time and Item dimensions.
What do you do about memberships ? Separate Schema for memberships. Only include retail sales.
Attributes for membership : Types of membership, Membership ID

Importance of Surrpogate
Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.  

Slice and Dice
The ability to move between different combinations of dimensions when viewing data with an OLAP browser.

Staging -> Operational Data Store -> Datawarehouse






Wednesday, March 21, 2012

Dimensional Modelling



Considers a General Departmental Store. This is in third normal form. Dimensional modeling (DM) is the name of a set of techniques and concepts used in data warehouse design. It is considered to be different from entity-relationship modeling (ER). Dimensional Modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files. According to data warehousing consultant Ralph Kimball, DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance. According to him, although transaction-oriented ER is very useful for the transaction capture, it should be avoided for end-user delivery.




CUSTOMERS (CID, CName, Address, Phone,...,...)




PRODUCT (PId, PName, Brand, Catagory, Size ,...,...)




ORDERS(OId, Date, Total, Cash/Credit,...,...)




ORDERID(OId, ProdID, ProductQty, Line#, Amount)




This is OLTP Database. Inserting, Deleting, Updating and querying data for reporting. Now you need to do more analysis. Need to find seasonality, the demand, patters.




You can do that using the OLTPdatabase, but it will be a big problem because of joins / aggregate queries. Ways to do aggregate query :

Group By Cust ID, Product_ID

Sum (Quantity)

then slice and dice it. But its hard to do it because of the nature if data.

Indexing speed up retrieval but overheads are costtly. When you have too many indexes(Dense / Sparse Index) ... The system might slow down .. Insert / Delete queries might slow down. Every time there is a transcriptional update the Index need to update.






Benefits of the dimensional modeling are following:
Understandability - Compared to the normalized model, the dimensional model is easier to understand and more intuitive. In dimensional models, information is grouped into coherent business categories or dimensions, making it easier to read and interpret. Simplicity also allows software to navigate databases efficiently. In normalized models, data is divided into many discrete entities and even a simple business process might result in dozens of tables joined together in a complex way.
Query performance - Dimensional models are more denormalized and optimized for data querying, while normalized models seek to eliminate data redundancies and are optimized for transaction loading and updating. The predictable framework of a dimensional model allows the database to make strong assumptions about the data that aid in performance. Each dimension is an equivalent entry point into the fact table, and this symmetrical structure allows effective handling of complex queries. Query optimization for star join databases is simple, predictable, and controllable.
Extensibility - Dimensional models are extensible and easily accommodate unexpected new data. Existing tables can be changed in place either by simply adding new data rows into the table or executing SQL alter table commands. No queries or other applications that sit on top of the Warehouse need to be reprogrammed to accommodate changes. Old queries and applications continue to run without yielding different results. But in normalized models each modification should be considered carefully, because of the complex dependencies between database tables.


In Star Schema you have un normalized tables.








FACTS are additive and can add across all it's dimendions
DIMENSIONS


What kinds of questions you are going to ask. Sales in terms of customers / geography / products. Central to sales is the $ amount. What is the $ amount for the order. Fact as the quantitative attribute. Most Data Whorehouse has quantitative attributes.


Customer , Product become dimensions. One to man relations between dimensions and fact. Finer granularity of the fact more the detailed analysis I can do.


Granularity is usually mentioned in the context of dimensional data structures (i.e., facts and dimensions) and refers to the level of detail in a given fact table. The more detail there is in the fact table, the higher its granularity and vice versa. Another way to look at it is that the higher the granularity of a fact table, the more rows it will have.

Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.




Facts are analyzed based in Dimensions.

Monday, March 19, 2012

Data Quality , Profiling , Clensing, MDM and Data Forensics





Accurate data does not come for free. It requires careful attention to the design of systems, constant monitoring of data collection, and aggressive actions to correct problems that generate or propagate inaccurate data.
        -Jack Olsen Data Quality: The Accuracy Dimension



Data Warehouse (The Data Warehouse) is a database of unique data structure that allows relatively quick and easy performance of complex queries over large amounts of data. 


Data put into warehouse must be clean. Companies spend 80% of effort in cleaning and profiling. Most companies have problem with the quality of Data.


Data Profiling -> Uncovering the Business Rules from Data Given.


Problems with Profiling:
Lack of 


Approach


Study the Schema properly and then decide the test strategy and answer the following questions : 

  • How many distinct values are present
  • Primary key capability of  a field
  • Format of a particular field
  • Business Rules - 1-> Male 2-> Female OR M-> Male F-> Female
  • Checks on various fields 





Checks on individual data fields :
Tests : 
(1) Primary Key and Foreign Key Tests

  • Missing primary keys / Primary Key Violation
  • Unique ness across the table
  • Email Address should have a NON NULLValue



(2)Combinations Of Attributes





  • Referential Integrity
  • Relates to Criminality
  • Employee works in just one Dept. Employee had just one Dept ID assigned to him. Dept can have zero to maximum employees. So the combination of attribute check will point this out. There should be no missing values for a table.
  • Dept can have No employees. Need to go back to documentation and check the criminality of the tables
  • Business rule defined on Salary depending on role . Minimum Salary depends on level
  • Salary field withing minimum and Maximum ranges
  • Restriction related to dept number. Need to have a framework in mind. Individual attributes and different attributes across tables. Problem is most of the companies do not have documentation. You need to have a schema and then reverse engineer. This makes profiling complex.
  • Business rules are embedded in the procedures and look at what procedures are telling you.
  • Individual Attribute Tests 
  • Aggregate Columns



Reverse Engineer ER Model and check for businesss rules that apply.
Checks all realted to SQL.. Difference between Unique and Distinct 
Reference to Ataccama DQ tool. 
Write own programs to do tests. Data profiling tool to 
Business rules that govern individual attribute
Range Check on Salary field


Why Errors Creep in ?

  • Data comes from multiple sources and multiple sources might not have the constraints implemented 
  • Pause constraints function in Oracle and bring in data from other tables
  • When u start you have high quality data but as time passes the quality deteriorates.



Where can u find the Business rules of a company ?

  • Most relational DB has catalog which has constraints. 
  • System catalog  has tremendous amount of business information
  • Application Layer programs. Many companies embed business rules in application layer programs
  • E.g Employee can lead only one project. It can be used ref integrity. 
  • User can also be very critical who can be capable to give info about business rules





This is called Data Forensics :

  • If company insists on investigating every aspect of business intelligence reporting errors then we are talking about data forensic. It is mandatory to be aware about reporting mistakes that happened but too big focus on every detail of generated error (who made mistake, who signed, who defined process, was everything under procedure, what is the error impact and etc.) 








Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:

  • Find out whether existing data can easily be used for other purposes
  • Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
  • Give metrics on data quality, including whether the data conforms to particular standards or patterns
  • Assess the risk involved in integrating data for new applications, including the challenges of joins
  • Assess whether metadata accurately describes the actual values in the source database
  • Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
  • Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality.



What is Master Data Management ? (Interview Question)

  • Its not a tool but a process as a whole. Its a costly effort. It takes a while and cannot see the result(Dash Board) very soon. That takes time. MDM happens behind the scene.
  • In computing, master data management (MDM) comprises a set of processes and tools that consistently defines and manages the master data-of an organization (which may include reference data). MDM has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing such data throughout an organization to ensure consistency and control in the ongoing maintenance and application use of this information.







Data Quality

  • How well Data matches business rules
  • Integrity / Reliability




Aspects of data quality include: 

  • Accuracy
  • Completeness
  • Update status
  • Relevance
  • Consistency across data sources
  • Reliability
  • Appropriate presentation
  • Accessibility
According to Gartner, the Magic Quadrant aims to provide a qualitative analysis into a market and its direction, maturity and participants, thus possibly enabling a company to be a stronger competitor for that market. 

Most Companies make money by offering services and not in licensing. Ask for different stages in MDM. Services which are bring provided on MDM. Training employees on MDM process ? Road map / Tools / Functionalities on MDM.