Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.
How do we test it. We can't check every record as number of records are huge.
1. For FACT table rows, is there a valid lookup against each of the Dimensions
2. For FACT or DIMENSION rows, for each value:-
Is it Null when it shouldn’t be
Is the Data Type correct (eg. Number, Date)
Is the range of values or format correct
Is the row valid with relation to all the other source system business rules?
There is no magic way of checking the integrity of data.
You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).
Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.
Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.
If the warehouse includes a high percentage of incorrect or misleading values – what’s the point of having it ?
Above answer was rated as good by the following members: njvijay
RE: Lets suppose we have some 10,000 odd records in so...
You should have proper tesing conditions in your ETL jobs for validating all the important columns before they are loaded into the target. Always have proper rejects to capture records containing garbage values.
RE: Lets suppose we have some 10,000 odd records in so...
To do this, you must profile the data at the source to know the domain of all the values, get the actual number of rows in the source, get the types of the data in the source. After it is loaded into the target, this process can be repeated i.e. checking the data values with respect to range, type, etc and also checking the actual number of rows inserted. If the result before and after match, then we are OK. This process is automated typically in ETL tools.
RE: Lets suppose we have some 10,000 odd records in so...
This is the base line for Error and Reject Handling. use IS_date, IS_number such function to check the data what we are loading following standard format. Don't forget to identify null values and existance of special characters in the data.
RE: Lets suppose we have some 10,000 odd records in so...
Go into workflow monitor after showing the status succeed click right button go into the property and you can see there no of source row and success target rows and rejected rows
RE: Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values. How do we test it. We can't check every record as numbe
1.check the number of records on source and the Target. Select count(*) from source Select count(*) from Target
2.consider a column of data type numeric say A on source and the Target select sum(A) from source group by some key select sum(A) from Target group by some key. paste these two results in excel and make the difference of these ,they should all have zeros as the result.
RE: Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values. How do we test it. We can't check every record as numbe
Data Quality checks come in a number of forms:-
1. For FACT table rows, is there a valid lookup against each of the Dimensions
2. For FACT or DIMENSION rows, for each value:-
Is it Null when it shouldn’t be
Is the Data Type correct (eg. Number, Date)
Is the range of values or format correct
Is the row valid with relation to all the other source system business rules?
There is no magic way of checking the integrity of data.
You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).
Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.
Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.
If the warehouse includes a high percentage of incorrect or misleading values – what’s the point of having it ?