GeekInterview.com
   Home |  Tech FAQ  |   Interview Questions |  Placement Papers |  Tech Articles |  Learn |  Freelance Projects |  Online Testing |  Geeks Talk |  Job Postings |  Knowledge Base | Site Search |  Add/Ask Question

  GeekInterview.com  >  Interview Questions  >  Data Warehousing  >  ETL

 Print  |  
Question:  Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.

How do we test it. We can't check every record as number of records are huge.




May 05, 2008 06:58:46 #8
 jryan999   Member Since: April 2008    Total Comments: 11 

RE: Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values. How do we test it. We can't check every record as numbe
 
Data Quality checks come in a number of forms:-

1. For FACT table rows, is there a valid lookup against each of the Dimensions

2. For FACT or DIMENSION rows, for each value:-
  • Is it Null when it shouldn’t be
  • Is the Data Type correct (eg. Number, Date)
  • Is the range of values or format correct
  • Is the row valid with relation to all the other source system business rules?
There is no magic way of checking the integrity of data.

You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).

Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.

Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.

If the warehouse includes a high percentage of incorrect or misleading values – what’s the point of having it ?
     

 

Back To Question