GeekInterview.com
   Home |  Tech FAQ  |   Interview Questions |  Placement Papers |  Tech Articles |  Learn |  Freelance Projects |  Online Testing |  Geeks Talk |  Job Postings |  Knowledge Base | Site Search |  Add/Ask Question

GeekInterview.com  >  Interview Questions  >  Data Warehousing  >  ETL
Go To First  |  Previous Question  |  Next Question 
 ETL  |  Question 39 of 58    Print  
Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.

How do we test it. We can't check every record as number of records are huge.

  
Total Answers and Comments: 8 Last Update: May 01, 2008     Asked by: Ram 
  
 Sponsored Links



 
 Best Rated Answer
Submitted by: jryan999
 
Data Quality checks come in a number of forms:-

1. For FACT table rows, is there a valid lookup against each of the Dimensions

2. For FACT or DIMENSION rows, for each value:-
  • Is it Null when it shouldn’t be
  • Is the Data Type correct (eg. Number, Date)
  • Is the range of values or format correct
  • Is the row valid with relation to all the other source system business rules?
There is no magic way of checking the integrity of data.

You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).

Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.

Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.

If the warehouse includes a high percentage of incorrect or misleading values – what’s the point of having it ?

Above answer was rated as good by the following members:
njvijay
August 17, 2006 17:22:40   #1  
2000reddy Member Since: August 2006   Contribution: 1    

RE: Lets suppose we have some 10,000 odd records in so...
dddf
 
Is this answer useful? Yes | No
September 12, 2006 08:14:34   #2  
jyothy        

RE: Lets suppose we have some 10,000 odd records in so...
Select count(*) From from both source table and Target table and compare the result.
 
Is this answer useful? Yes | No
December 21, 2006 14:36:39   #3  
prav2000 Member Since: December 2006   Contribution: 5    

RE: Lets suppose we have some 10,000 odd records in so...
You should have proper tesing conditions in your ETL jobs for validating all the important columns before they are loaded into the target. Always have proper rejects to capture records containing garbage values.
 
Is this answer useful? Yes | No
February 15, 2007 07:28:48   #4  
RufusA        

RE: Lets suppose we have some 10,000 odd records in so...
To do this, you must profile the data at the source to know the domain of all the values, get the actual number of rows in the source, get the types of the data in the source. After it is loaded into the target, this process can be repeated i.e. checking the data values with respect to range, type, etc and also checking the actual number of rows inserted. If the result before and after match, then we are OK. This process is automated typically in ETL tools.
 
Is this answer useful? Yes | No
May 07, 2007 09:32:53   #5  
Maharishi        

RE: Lets suppose we have some 10,000 odd records in so...
This is the base line for Error and Reject Handling. use IS_date, IS_number such function to check the data what we are loading following standard format. Don't forget to identify null values and existance of special characters in the data.
 
Is this answer useful? Yes | No
August 23, 2007 02:31:07   #6  
abhishek        

RE: Lets suppose we have some 10,000 odd records in so...
Go into workflow monitor after showing the status succeed click right button go into the property and you can see there no of source row and success target rows and rejected rows

Abhishek

 
Is this answer useful? Yes | No
April 22, 2008 10:22:31   #7  
masatjai Member Since: April 2008   Contribution: 1    

RE: Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values. How do we test it. We can't check every record as numbe
1.check the number of records on source and the Target.
Select count(*) from source
Select count(*) from Target

2.consider a column of data type numeric say A on source and the Target
select sum(A) from source group by some key
select sum(A) from Target group by some key.
paste these two results in excel and make the difference of these ,they should all have zeros as the result.

Thanks
masatjai

 
Is this answer useful? Yes | No
May 01, 2008 06:58:46   #8  
jryan999 Member Since: April 2008   Contribution: 12    

RE: Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values. How do we test it. We can't check every record as numbe
Data Quality checks come in a number of forms:-

1. For FACT table rows, is there a valid lookup against each of the Dimensions

2. For FACT or DIMENSION rows, for each value:-
  • Is it Null when it shouldn’t be
  • Is the Data Type correct (eg. Number, Date)
  • Is the range of values or format correct
  • Is the row valid with relation to all the other source system business rules?
There is no magic way of checking the integrity of data.

You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).

Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.

Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.

If the warehouse includes a high percentage of incorrect or misleading values – what’s the point of having it ?

 
Is this answer useful? Yes | NoAnswer is useful 1   Answer is not useful 0Overall Rating: +1    


 
Go To Top


 Sponsored Links



 
Sponsored Links

 
Sponsored Links

 




About Us  |   Privacy Policy  |   Terms and Conditions  |   Contact  |   Site Map  |   Add Question  |   Propose Category  |   RSS Feeds  |   Articles Sitemap  |   Site Updates  |   Add Resource

Copyright © 2005 - 2008 GeekInterview.com. All Rights Reserved
Page copy protected against web site content infringement by Copyscape