Which partition we have to use for Aggregate Stage in parallel jobs ?

Questions by izack   answers by izack

Showing Answers 1 - 18 of 18 Answers

By default this stage allows Auto mode of partitioning. The best partitioning is based on the operating mode of this stage and preceding stage. If the aggregator is operating in sequential mode, it will first collect the data and before writing it to the file using the default Auto collection method. If the aggregator is in parallel mode then we can put any type of partitioning in the drop down list of partitioning tab. Generally auto or hash can be used.



Thanks

Srinivas

  Was this answer useful?  Yes

I think the above answer is a little misleading. Most of the time you'll be using aggr. stage in parallel mode. Now if you use the auto partioning mode, it doesnt indicate that the key columns that you are grouping on will lie in the same partition. Thus the result will not be useful for this aggregation.

1) Identify the grouping keys you want to aggregate on.
2) In a stage prior to aggr. , Do a hash partition on the grouping keys. This will ensure that all the similiar group keys lie in a particular partition.
3) Now the result of partition will be appropriate.
4) I even think the entire partition method can be usefull, But it will be slightly higher overhead as compared to hash partitioning.

Hope that helps....

Thanks
Harish

  Was this answer useful?  Yes

Its  always preferable & appropriate that we must use a sort stage beore aggregate stage.
Hence based on the aggregate logic we should sort the incoming data by using hash partintion on keys.

Then we can use same partition on Aggregate stage.

This is most commonly used.

  Was this answer useful?  Yes

yassine

  • Jul 12th, 2017
 

Hello Harish I would like to ask you a question How I can choose the appropriate partition for each stage and job how can I analyse situation
thank you

  Was this answer useful?  Yes

Anjaneyulu Pagadala

  • Mar 15th, 2018
 

Hash partitioning and in link sorting on grouping keys give better performance and correct results if it is in parallel mode and Auto partition will give correct results if there is no sorting happened only one of the keys we are grouping in previous stage

  Was this answer useful?  Yes

Give your answer:

If you think the above answer is not correct, Please select a reason and add your answer below.

 

Related Answered Questions

 

Related Open Questions