Join on partitioned flow

If I have 2 files containing field file1(A,B,C) and file2(A,B,D), if we partition both the files on key A using partition by key and pass the output to join component, if the join key is (A,B) will it join or not and WHY?

Questions by abinitio17

Showing Answers 1 - 24 of 24 Answers

Puneet123

  • Jul 1st, 2008
 

Partition component divides the data into different partitions depending upon the key. Join component expect data to be in a ordered flow if "Input must be sorted" is checked.
In this case join will not going fail but it will not give the correct output.
 

  Was this answer useful?  Yes

Key  is always important in Join component else you many not get the desired result.In abinitio everything  is key based if the key is wrong everything can go wrong but the graph will run successfully. sometime you many not get the result atall.

  Was this answer useful?  Yes

I believe  "Join component expect data to be in a ordered flow if you select " Input must be sorted" as checked so that the input to JOIN will be a ordered set of data.
Then I believe the join results would be as expected.

Anyone pls comment if thinks with this the expected output wont be there and if so why?

  Was this answer useful?  Yes

Subhra Dhar

  • Mar 25th, 2009
 

I do not think the join output would be correct. The partition key fields for the two input streams should be same as the join key fields in the join component, otherwise the data from stream 1 would be partitioned in a different way than data from stream 2 and won't find all matches in the join component.

  Was this answer useful?  Yes

vss34

  • Jul 16th, 2009
 

The partition key and join key do NOT have to be the exact same.  In order to join properly, you just have to make sure the records being compared are in the same partition. 

So if the partition key is broader than the join key (which it is in this case since the partition key is just field A, and the join key is A and B), then the join will work fine as long as you sort the data after the partition or make it an in-memory join.  For example, all records on both datasets with a value of 1 for field A will be placed in the same partition regardless of the value of field B.  So then values for field A,B as (1,X) where X is any value on both datasets will join up correctly since they will be in the same partition.

If the partition key is narrower than the join key (for example, the partition key is A and B, and the join key is just A), then the join will most likely not work correctly since you cannot guarantee the hashing algorithm of partition by key will place the proper records in the same partition.

  Was this answer useful?  Yes

Abhishek

  • Feb 5th, 2013
 

Yes, this is going to work fine provided u do it as in-memory. Let me explain why, firstly whenever you are using the field A as a key, for the same data in the both the files, would definitely go into the same partition. For example say the values in my key filed is 2,3,4 in both the files. Now, say by hash value calculation the first 2,3 goes to partition 1 and 3 goes to partition 2, then as we know that the WHOLE RECORD would be available in that particular partition, join would be working just fine.

If the Partition was done by A,B keys, then the performance of the graph would have been better.

This scenario would have failed if you would have given B,A as key in join instead of A,B.

Hope this helps.

  Was this answer useful?  Yes

Give your answer:

If you think the above answer is not correct, Please select a reason and add your answer below.

 

Related Answered Questions

 

Related Open Questions