What is data skew? how can you eliminate data skew while i am using partitiion by key?

Questions by prabhupurna   answers by prabhupurna

Showing Answers 1 - 5 of 5 Answers

donbuckme

  • Sep 22nd, 2006
 

Skew is basically the data is not evently distributed across partitions.  An uneven distribution degrades the performance of the overall execution as CPUs are sitting idle and waiting for some other partitions to finish their job with larger volumes.

AbInitio partitioning the key by hash.  Choose the key that has the most distinct values will incrase the chances of data being partitioned evently.

dontbuckme

  • Sep 22nd, 2006
 

data not evenly partitioned.   PBE uses hash function. choose the key that has the most distinct values might improve the data being distributed evenly.

  Was this answer useful?  Yes

hsevna

  • May 15th, 2008
 

The skew of a data or flow partition is the amount by which its size deviates from the average partition size, expressed as a percentage of the largest partition:

Skew of data =(partition size - avg.partition size)*100/(size of largest partition)

  Was this answer useful?  Yes

Give your answer:

If you think the above answer is not correct, Please select a reason and add your answer below.

 

Related Answered Questions

 

Related Open Questions