Manoj Seemar
Profile Answers by Manoj Seemar

Nov 2nd, 2007

Merge Joins
Sort merge joins can be used to join rows from two independent sources. Hash joins generally perform better than sort merge joins. On the other hand, sort merge joins can perform better than hash joins if both of the following conditions exist:
The row sources are sorted already.
A sort operation does not have to be done.
However, if a sort merge join involves choosing a slower access method (an index scan as opposed to a full table scan), then the benefit of using a sort merge might be lost.
Sort merge joins are useful when the join condition between two tables is an inequality condition (but not a nonequality) like <, <=, >, or >=. Sort merge joins perform better than nested loop joins for large data sets. You cannot use hash joins unless there is an equality condition.
In a merge join, there is no concept of a driving table. The join consists of two steps:
Sort join operation: Both the inputs are sorted on the join key.
Merge join operation: The sorted lists are merged together.
If the input is already sorted by the join column, then a sort join operation is not performed for that row source.
The optimizer can choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:
The join condition between two tables is not an equi-join.
OPTIMIZER_MODE is set to RULE.
HASH_JOIN_ENABLED is false.
Because of sorts already required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join.
The optimizer thinks that the cost of a hash join is higher, based on the settings of HASH_AREA_SIZE and SORT_AREA_SIZE.

To advise the optimizer to use a sort merge join, apply the USE_MERGE hint. You might also need to give hints to force an access path.
There are situations where it is better to override the optimize with the USE_MERGE hint. For example, the optimizer can choose a full scan on a table and avoid a sort operation in a query. However, there is an increased cost because a large table is accessed through an index and single block reads, as opposed to faster access through a full table scan.

Hash Joins
Hash joins are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows.
This method is best used when the smaller table fits in available memory. The cost is then limited to a single read pass over the data for the two tables.
However, if the hash table grows too big to fit into the memory, then the optimizer breaks it up into different partitions. As the partitions exceed allocated memory, parts are written to temporary segments on disk. Larger temporary extent sizes lead to improved I/O when writing the partitions to disk; the recommended temporary extent is about 1 MB. Temporary extent size is specified by INITIAL and NEXT for permanent tablespaces and by UNIFORM SIZE for temporary tablespaces.
After the hash table is complete, the following processes occur:
The second, larger table is scanned.
It is broken up into partitions like the smaller table.
The partitions are written to disk.
When the hash table build is complete, it is possible that an entire hash table partition is resident in memory. Then, you do not need to build the corresponding partition for the second (larger) table. When that table is scanned, rows that hash to the resident hash table partition can be joined and returned immediately.
Each hash table partition is then read into memory, and the following processes occur:
The corresponding partition for the second table is scanned.
The hash table is probed to return the joined rows.
This process is repeated for the rest of the partitions. The cost can increase to two read passes over the data and one write pass over the data.
If the hash table does not fit in the memory, it is possible that parts of it may need to be swapped in and out, depending on the rows retrieved from the second table. Performance for this scenario can be extremely poor.

The optimizer uses a hash join to join two tables if they are joined using an equijoin and if either of the following conditions are true:
A large amount of data needs to be joined.
A large fraction of the table needs to be joined.

SELECT o.customer_id, l.unit_price * l.quantity
FROM orders o ,order_items l
WHERE l.order_id = o.order_id;

Apply the USE_HASH hint to advise the optimizer to use a hash join when joining two tables together. If you are having trouble getting the optimizer to use hash joins, investigate the values for the HASH_AREA_SIZE and HASH_JOIN_ENABLED parameters.

Difference Between Hash Join & Merge Join

Merge Join :

Oracle performs a join between two sets of row data using the merge
join algorithm. The inputs are two separate sets of row data. Output is
the results of the join. Oracle reads rows from both inputs in an
alternating fashion and merges together matching rows in order to
generate output. The two inputs are sorted on join column.

Hash Join :

Oracle performs a join between two sets of row data using hash join
algorithm. Input and Output same as Merge Join. Oracle reads all rows
from the second input and builds a hash structure (like has table in
java), before reading each row from the first input one at a time. For
each row from the first input, the hash structure is probed and matching
rows generate output.

Interview Candidate
Jul 9th, 2005
4
27787

SQL

Answer

Showing Answers 1 - 4 of 4 Answers

mvg_mca
Profile Answers by mvg_mca

Mar 24th, 2010

A merge join basically sorts all relevant rows in the first table by the join key, and also sorts the relevant rows in the second table by the join key, and then merges these sorted rows.

A hash join (ideally) takes the smaller table (or row source), iterates over its rows and performs a hash algorithm on the columns for the where conditions between the tables and stores the result. After it has finished, it iterates over the other table and performes the same hashing algorithm on the joined columns. It then searches the previously built hashed values and if they match, it returns the row.