Now that we've got some data, it would be nice to connect data together from multiple tables. The way this is done is through one of three Join nodes. Before we get into that, it is important to discuss all of the different Join types.
It is easy to think of Joins only in the context of
FROM a INNER JOIN b ON a.Column = b.Column. However, there are more cases where data needs to be joined in some way. As a complete list, here are all of the possible join types on the SQL server:
- Cross Join
- Inner Equi-Join
- Left/Right Equi-Join
- Full Equi-Join
- Left/Right Semi Equi-Join
- Left/Right Anti Semi Equi-Join
- Inner Nonequi-Join
- Left/Right Nonequi-Join
- Full Nonequi-Join
- Left/Right Semi Nonequi-Join
- Left/Right Anti Semi Nonequi-Join
Now as you can tell, there is a pattern to each of these Join types, so let's talk about the basics that form the patterns.
The Cross Join feels like it stands alone from the others, but it is actually the absence of any of the Join flags. For the math geeks out there, the Cross Join is basically the Cartesian Product of the two inputs. For everyone else, this just means that every record on the left side is joined with every record on the right side. For large inputs, this will result in exponentially large outputs.
This refers to the pattern that we already know from JOIN clauses in the SQL statement. An Inner join will only return rows if both sides have at least one row that matches; a left join will always return the left row, whether the right row exists or not; etc.
Equi-Join vs Nonequi-Join
It may not be apparent, because few people write queries this way, but SQL does allow you to do a Join to find where the records do not match instead of where they do match. The following query is non-sensical logically, but is allowed and executed by the SQL server:
select * from Sales.SalesOrderHeader soh inner join Sales.SalesOrderDetail sod where soh.SalesOrderID != sod.SalesOrderID.
An Equi-Join is any join where at least one of the clauses is an
=, whereas a Nonequi-Join is a join where none of the clauses is an
=. If there are multiple clauses and at least one of them is an
=, then the server can perform any of the Join types quicker based on the equality clause(s), and then do a residual filter based on the remaining clauses.
Semi & Anti-Semi
The Semi and Anti-Semi Join patterns are based on the
IN clause. They are called Semi joins, because they are half-joins. The result set doesn't care about any of the actual data on one side of the Join, only about the existence (or lack thereof, for Anti-) of data on one side of the join. The following would use a Semi join:
select * from Sales.SalesOrderHeader soh where exists (select * from Sales.SalesOrderDetail sod where sod.SalesOrderID = soh.SalesOrderID). Notice that we don't care about the actual data in
SalesOrderDetail, we only care that data exists for the given
The Nested Loops node is the simplest and most obvious of the Join nodes. As you can see in the pseudo-code below for an Inner Join, all it does is iterate the right sub-tree for each record from the left sub-tree. For large data-sets, especially on the left side, it is by far the least efficient way to merge data. However, there is no set up, so for many cases this is a very effective way to join data, especially if the right side is primarily an Index seek based on the parameter from the left tree.
Nested Loops is also the only way to execute a Nonequi-Join. Both Merge Join and Hash Match are based on the fundamental basis of equality, and so they cannot execute without a basis for matching records.
One thing to notice about this node is that there generally is no filtering or comparison done inside this node. Instead, relevant parameters are passed down from this node to the right sub-tree, and any filtering will be done either in the data retrieval node or in the Filter node. Only in the most unusual cases will the Nested Loops node contain a Predicate to filter records after the join has occurred.
This will be the most common Join operator seen in execution plans, as it is the easiest to use and as long as Indexing is done properly, it works well in most cases.
IEnumerable<DataRow> NestedLoops(IEnumerable<DataRow> left, Function<IEnumerable<DataRow>> right)
foreach (var leftRow in left)
foreach (var rightRow in right(leftRow.ColumnA, leftRow.ColumnB))
yield return BuildCombinedRow(leftRow, rightRow);
The Merge Join is very powerful for dealing with large sorted data sets, but it can be difficult to explain in code. Here is an animation that illustrates how a Merge Join operates. There are two iterators, one for each sub-tree. The iterator starts moving forward on each, evaluating at each step the order of the join values for each side at the current row. If one side is before the other, then that iterator is moved forward until is equal or greater to the value of the other iterator. Depending on if the Join operation is an INNER, LEFT/RIGHT OUTER, or FULL OUTER JOIN, then rows that don't match may be returned or ignored.
The Merge Join requires that both sides are sorted according to the column(s) being joined. It is most efficient if both sides have an index on the column(s); otherwise, the unsorted data will have to be sorted before being used by the Merge Join. However, for large data sets, the time required to process the join can be reduced to a linear pass through both data sets.
The Hash Match is arguably the most powerful of the three Join nodes, being able to complete all Join operations. However, it also requires the most storage space, whether in memory or on disk, because it must keep all of the records from one sub-tree before executing the other sub-tree.
The two pieces of jargon you need to know in order to read the properties of a Hash Match are the build key and the probe key. As you might guess, the build key is used to build the hash table, and the probe key is used to probe (or query) against the hash table. All versions of the Hash Match will report both of these keys in the properties of the Hash Match node.
In-Memory Hash Join
This is the simple version of the Hash Match. In this version, the operator does exactly what you would expect from a basic hash table setup: it reads all of the records from the build sub-tree and builds a hash table using the appropriate columns; then for each record in the probe data set, the key is used to query the hash table and operate based on the results.
The limitation of the In-Memory Hash Join is that hash table must fit into the memory allocated by the server. If the data from the build side exceeds the allocation, then the server spills data to the disk, which will increase the time it takes to execute the probe stage significantly.
Here is an example pseudo-code for the Inner Join using the In-Memory Hash Join.
IEnumerable<DataRow> HashMatch(IEnumerable<DataRow> build, IEnumerable<DataRow> probe)
var hashTable = new HashTable();
foreach (var buildRow in build)
foreach (var probeRow in probe)
yield return BuildCombinedRow(hashTable[probeRow.Key], probeRow);