Process 1TB Data and get Max age for each gender group

I have 1 TB of record with below format:
I want to fetch max age from each Gender group by only using Reformat component. How to achieve this?

Showing Answers 1 - 6 of 6 Answers

Shalini Sharma

  • Oct 5th, 2016

Use Rollup with gender as a key and get max(age) for output in age attribute. Use in-memory sorting in Rollup. The Rollups in-memory requirement is based on its expected output, not its expected input. As we have only two rows of output here using in-memory sorting will give faster results. In case of large output files use sort component before Rollup.

  Was this answer useful?  Yes

shreya gupta

  • Oct 22nd, 2016

Here is your solution step by step:

1) Configure the input file.

2) Take a sort component and sort it according to age (Desc) order. Take a reformat and add another column into the immediate next output as next_in_sequence(). This will add a serial number to your output that has been arranged according to max - min age.

3) Now, the highest age person would be the topmost and the lowest most would be last most record in your table.

4) Apply filter by expression and fetch out the record that has the serial number 1.

5) This is your record with max age.
P.S. You can also achieve this with sort + dedup sort. Let me know if you require that.

  Was this answer useful?  Yes


  • Dec 7th, 2016

Here is the flow of components.
Input file > partition by round robin (to process 1TB file) > Roll up {key gender} to take max(age) > gather > Roll up {key gender} to take max(age) > output file.
NOTE we cannot use partition by Key.

  Was this answer useful?  Yes

Step 1 : use output index in reformat to separate male and female in 2 flows
Step 2 : sort by age in desc.
Step 3 : filter by expression where next_in_sequece is 1

  Was this answer useful?  Yes


  • May 26th, 2021

i/p file --> reformat(as asked) - add output_indexes to separate flows --> sort(desc) on age --> FBE where next_in_sequence() == 1 --> concat/gather both output_indexes flows --> output file

  Was this answer useful?  Yes


  • Jun 24th, 2021

Hi Mahesh, will this solution work if the input data file is a mfs file and we are supposed to run this in parallel?

  Was this answer useful?  Yes

Give your answer:

If you think the above answer is not correct, Please select a reason and add your answer below.


Related Answered Questions


Related Open Questions