What is data skew? how can you eliminate data skew while i am using partitiion by key?

prabhupurna
Profile Answers by prabhupurna Questions by prabhupurna
Sep 16th, 2006
5
26939

Questions by prabhupurna answers by prabhupurna

Showing Answers 1 - 5 of 5 Answers

donbuckme

Sep 22nd, 2006

Skew is basically the data is not evently distributed across partitions. An uneven distribution degrades the performance of the overall execution as CPUs are sitting idle and waiting for some other partitions to finish their job with larger volumes.
AbInitio partitioning the key by hash. Choose the key that has the most distinct values will incrase the chances of data being partitioned evently.

dontbuckme
Profile Answers by dontbuckme

Sep 22nd, 2006

data not evenly partitioned. PBE uses hash function. choose the key that has the most distinct values might improve the data being distributed evenly.

hsevna
Profile Answers by hsevna

May 15th, 2008

The skew of a data or flow partition is the amount by which its size deviates from the average partition size, expressed as a percentage of the largest partition:

Skew of data =(partition size - avg.partition size)*100/(size of largest partition)