RE: what is data skew? how can you eliminate data skew...
Skew is basically the data is not evently distributed across partitions. An uneven distribution degrades the performance of the overall execution as CPUs are sitting idle and waiting for some other partitions to finish their job with larger volumes.
AbInitio partitioning the key by hash. Choose the key that has the most distinct values will incrase the chances of data being partitioned evently.
RE: what is data skew? how can you eliminate data skew while i am using partitiion by key?
The skew of a data or flow partition is the amount by which its size deviates from the average partition size expressed as a percentage of the largest partition:
Skew of data (partition size - avg.partition size)*100/(size of largest partition)