Accurate data does not come for free. It requires careful attention to the design of systems, constant monitoring of data collection, and aggressive actions to correct problems that generate or propagate inaccurate data.
-Jack Olsen Data Quality: The Accuracy Dimension
Data Warehouse (The Data Warehouse) is a database of unique data structure that allows relatively quick and easy performance of complex queries over large amounts of data.
Data put into warehouse must be clean. Companies spend 80% of effort in cleaning and profiling. Most companies have problem with the quality of Data.
Data Profiling -> Uncovering the Business Rules from Data Given.
Problems with Profiling:
Lack of
Approach
Study the Schema properly and then decide the test strategy and answer the following questions :
- How many distinct values are present
- Primary key capability of a field
- Format of a particular field
- Business Rules - 1-> Male 2-> Female OR M-> Male F-> Female
- Checks on various fields
Checks on individual data fields :
Tests :
(1) Primary Key and Foreign Key Tests
- Missing primary keys / Primary Key Violation
- Unique ness across the table
- Email Address should have a NON NULLValue
(2)Combinations Of Attributes
- Referential Integrity
- Relates to Criminality
- Employee works in just one Dept. Employee had just one Dept ID assigned to him. Dept can have zero to maximum employees. So the combination of attribute check will point this out. There should be no missing values for a table.
- Dept can have No employees. Need to go back to documentation and check the criminality of the tables
- Business rule defined on Salary depending on role . Minimum Salary depends on level
- Salary field withing minimum and Maximum ranges
- Restriction related to dept number. Need to have a framework in mind. Individual attributes and different attributes across tables. Problem is most of the companies do not have documentation. You need to have a schema and then reverse engineer. This makes profiling complex.
- Business rules are embedded in the procedures and look at what procedures are telling you.
- Individual Attribute Tests
- Aggregate Columns
Reverse Engineer ER Model and check for businesss rules that apply.
Checks all realted to SQL.. Difference between Unique and Distinct
Reference to Ataccama DQ tool.
Write own programs to do tests. Data profiling tool to
Business rules that govern individual attribute
Range Check on Salary field
Why Errors Creep in ?
- Data comes from multiple sources and multiple sources might not have the constraints implemented
- Pause constraints function in Oracle and bring in data from other tables
- When u start you have high quality data but as time passes the quality deteriorates.
Where can u find the Business rules of a company ?
- Most relational DB has catalog which has constraints.
- System catalog has tremendous amount of business information
- Application Layer programs. Many companies embed business rules in application layer programs
- E.g Employee can lead only one project. It can be used ref integrity.
- User can also be very critical who can be capable to give info about business rules
This is called Data Forensics :
- If company insists on investigating every aspect of business intelligence reporting errors then we are talking about data forensic. It is mandatory to be aware about reporting mistakes that happened but too big focus on every detail of generated error (who made mistake, who signed, who defined process, was everything under procedure, what is the error impact and etc.)
Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:
- Find out whether existing data can easily be used for other purposes
- Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
- Give metrics on data quality, including whether the data conforms to particular standards or patterns
- Assess the risk involved in integrating data for new applications, including the challenges of joins
- Assess whether metadata accurately describes the actual values in the source database
- Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
- Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality.
What is Master Data Management ? (Interview Question)
- Its not a tool but a process as a whole. Its a costly effort. It takes a while and cannot see the result(Dash Board) very soon. That takes time. MDM happens behind the scene.
- In computing, master data management (MDM) comprises a set of processes and tools that consistently defines and manages the master data-of an organization (which may include reference data). MDM has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing such data throughout an organization to ensure consistency and control in the ongoing maintenance and application use of this information.
Data Quality
- How well Data matches business rules
- Integrity / Reliability
Aspects of data quality include:
- Accuracy
- Completeness
- Update status
- Relevance
- Consistency across data sources
- Reliability
- Appropriate presentation
- Accessibility
Most Companies make money by offering services and not in licensing. Ask for different stages in MDM. Services which are bring provided on MDM. Training employees on MDM process ? Road map / Tools / Functionalities on MDM.
No comments:
Post a Comment