If I have 2 files containing field file1(A,B,C) and file2(A,B,D), if we partition both the files on key A using partition by key and pass the output to join component, if the join key is (A,B) will it join or not and WHY?
Partition component divides the data into different partitions depending upon the key. Join component expect data to be in a ordered flow if "Input must be sorted" is checked. In this case join will not going fail but it will not give the correct output.
Key is always important in Join component else you many not get the desired result.In abinitio everything is key based if the key is wrong everything can go wrong but the graph will run successfully. sometime you many not get the result atall.
I believe Join component expect data to be in a ordered flow if you select Input must be sorted as checked so that the input to JOIN will be a ordered set of data. Then I believe the join results would be as expected.
Anyone pls comment if thinks with this the expected output wont be there and if so why?
I do not think the join output would be correct. The partition key fields for the two input streams should be same as the join key fields in the join component otherwise the data from stream 1 would be partitioned in a different way than data from stream 2 and won't find all matches in the join component.
The partition key and join key do NOT have to be the exact same. In order to join properly you just have to make sure the records being compared are in the same partition.
So if the partition key is broader than the join key (which it is in this case since the partition key is just field A and the join key is A and B) then the join will work fine as long as you sort the data after the partition or make it an in-memory join. For example all records on both datasets with a value of 1 for field A will be placed in the same partition regardless of the value of field B. So then values for field A B as (1 X) where X is any value on both datasets will join up correctly since they will be in the same partition.
If the partition key is narrower than the join key (for example the partition key is A and B and the join key is just A) then the join will most likely not work correctly since you cannot guarantee the hashing algorithm of partition by key will place the proper records in the same partition.