Tuesday, October 4, 2011

T-SQL Tuesday #023: Flip Side of the JOIN

T-SQL TuesdayUnbelievable… It’s been almost 5 months since I last posted something here. I had a lot going on since that last post: A daughter graduating from college, a son graduating from high school, a few family vacation trips, moving my son into college, and between all of that, I was juggling an overwhelming amount of work from 3 demanding clients (and still am, quite frankly).

But I’m back now, ready to finally re-JOIN the SQL blogging world once again.

And that is very apt, because this post is part of the October T-SQL Tuesday on the subject of JOINs, hosted by Stuart Ainsworth.

So, let’s not waste any time… Let’s plunge in…

We are all aware of the various JOINs that are available to us in the T-SQL syntax: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, and CROSS JOIN. But what I wanted to talk about today are two other kinds of JOINs: The (LEFT or RIGHT) Semi JOIN and the (LEFT or RIGHT) Anti Semi JOIN.

These are not available to us directly in the syntactical sense, but they are employed in a Query Plan when you use certain types of queries.

A LEFT Semi JOIN returns rows from the left side that have at least one matching row on the right side. At first glance, this seems very much like a regular INNER JOIN, except for one thing. With a LEFT Semi JOIN, the rows from the left table will be returned at most once. Even if the table on the right side contains hundreds of matches for the row on the left side, only one copy of the left-hand row will be returned.

For example, let’s find Handlebar Products (SubCategory=4) in the Products table that appeared on at least one Order. If we use a regular INNER JOIN…

select p.ProductID
from Production.Product p
join Sales.SalesOrderDetail sod on p.ProductID=sod.ProductID
where ProductSubcategoryID=4
/*
ProductID
---------
808
808
808
. . .
809
809
809
. . .
947
947
947
(1531 row(s) affected)
*/
…We get tons of duplicate Product IDs. The (actual) execution plan for the above query looks like so:



Note that the data flow arrow from the Products table showed that 8 rows were processed, and the total rows that were requested by the Nested Loops operator from the SalesOrderDetail Index Seek for those 8 product rows were 1531.

In order to eliminate all the duplicates, we have to introduce a DISTINCT to the query:

select distinct p.ProductID
from Production.Product p
join Sales.SalesOrderDetail sod on p.ProductID=sod.ProductID
where ProductSubcategoryID=4
/*
ProductID
---------
808
809
810
811
813
946
947
*/
Interestingly enough, if we look at the actual execution plan for that query, you would probably expect to find the same plan as before except with an Aggregate operator at the top of the tree to eliminate the duplicates and shrink the 1531 rows down to 7, but instead, this is what we find:



Note that the Nested Loops operator is a LEFT Semi JOIN. Somehow the optimizer is “smart enough” to realize (because of our INNER JOIN and the DISTINCT and our result set only consisting of columns from one side of the JOIN) that it would be more expensive to do the full INNER JOIN and then eliminate the duplicates, and so it employed a LEFT Semi JOIN. This is much more efficient because the Nested Loops operator only needs to request a single row from the SalesOrderDetail Index Seek operator for each Product processed… If a single row exists in SalesOrderDetails, then it can release the Product row to the Select operator for output. If no row exists, then it tosses out the Product row and moves on. If you hover over the data flow arrow coming out of the Index Seek, you’ll see that only 7 rows were passed along (because one of the 8 Product rows did not have a match).

I don’t know about you, but the hairs on the back of my neck stand up whenever I see a DISTINCT in a query. This same solution could be employed (more clearly in my opinion) via three other types of queries, all of which (by nature) involve a Semi JOIN… an INTERSECT query or an EXISTS query or an IN query:

select ProductID
from Production.Product
where ProductSubcategoryID=4
intersect
select
ProductID
from Sales.SalesOrderDetail

select ProductID
from Production.Product p
where ProductSubcategoryID=4
and exists (select *
from Sales.SalesOrderDetail
where ProductID=p.ProductID)

select ProductID
from Production.Product
where ProductSubcategoryID=4
and ProductID in (select ProductID
from Sales.SalesOrderDetail)
All three of the above queries produce the exact same plan, which is the very efficient Semi JOIN plan that we just examined.

It’s really a matter of style as to which approach that you use. I prefer EXISTS or IN. The INTERSECT operator is kind of cool, but it is very limiting. For example, let’s say we wanted the result set to include the Name of the Product as well. In the EXISTS and IN queries (and for that matter in our original INNER JOIN/DISTINCT query), we simply add the Name column to the SELECT clause and we’re done… And the query plan would remain unchanged except for the fact that an extra column will come from the Product table.

But with the INTERSECT query, we have to introduce the Name column to both sides of the INTERSECT, meaning that we have to add an extra JOIN to get the Name:

select ProductID
,Name
from Production.Product
where ProductSubcategoryID=4
intersect
select
sod.ProductID
,p.Name
from Sales.SalesOrderDetail sod
join Production.Product p on sod.ProductID=p.ProductID
/*
ProductID Name
--------- ----------------------
808 LL Mountain Handlebars
809 ML Mountain Handlebars
810 HL Mountain Handlebars
811 LL Road Handlebars
813 HL Road Handlebars
946 LL Touring Handlebars
947 HL Touring Handlebars
*/
And the execution plan now involves a lot more work:



We could also try to re-work the INTERSECT query using a CTE or a derived table to just get the ProductID’s and then JOIN the result to the Products table to get the Name column like so…

with ProductsInOrders as
(
select ProductID
from Production.Product
where ProductSubcategoryID=4
intersect
select ProductID
from Sales.SalesOrderDetail
)
select pio.ProductID
,p.Name
from ProductsInOrders pio
join Production.Product p on pio.ProductID=p.ProductID
/*
ProductID Name
--------- ----------------------
808 LL Mountain Handlebars
809 ML Mountain Handlebars
810 HL Mountain Handlebars
811 LL Road Handlebars
813 HL Road Handlebars
946 LL Touring Handlebars
947 HL Touring Handlebars
*/
…But we still can’t get around the fact that we have to access the Products table multiple times. In fact, the execution plan for the above query is quite amusing when you look at it:



For each Product row acquired in the Clustered Index Scan, it does a SEEK into the same Clustered Index to get the Name! What a waste of resources.

Now on to Anti Semi JOINs…

A LEFT Anti Semi JOIN returns rows from the left side that have no matching rows on the right side… It’s the exact opposite of the Semi JOIN. As you can probably guess, this kind of JOIN is employed when you execute a query using EXCEPT or NOT EXISTS or NOT IN:

select ProductID
from Production.Product
where ProductSubcategoryID=4
except
select
ProductID
from Sales.SalesOrderDetail
/*
ProductID
---------
812
*/

select ProductID
from Production.Product p
where ProductSubcategoryID=4
and not exists (select *
from Sales.SalesOrderDetail
where ProductID=p.ProductID)
/*
ProductID
---------
812
*/

select ProductID
from Production.Product
where ProductSubcategoryID=4
and ProductID not in (select ProductID
from Sales.SalesOrderDetail)
/*
ProductID
---------
812
*/
All three of the above queries produce the exact same execution plan using a LEFT Anti Semi JOIN:



So, for each of the 8 Product rows, the Nested Loops operator requests a row from the SalesOrderDetail table. If one exists, then it tosses the Product row aside and moves on. If one does not exist, then it releases the Product row up to the Select operator for output.

The EXCEPT operator has the same limitations as was described for the INTERSECT operator and therefore is not as useful as the NOT EXISTS or NOT IN types of queries.

One important note about NOT IN. It is only equivalent to the NOT EXISTS query if the column being checked is non-nullable. If the ProductID in Sales.SalesOrderDetail allowed NULLs, then the NOT IN query plan would look like this:



There’s a lot of logic employed in the plan to handle the fact that there may be NULLs in SalesOrderDetail. We can return back to our more simplified query, however, by adding a WHERE IS NOT NULL predicate to our IN subquery:

select ProductID
from Production.Product
where ProductSubcategoryID=4
and ProductID not in (select ProductID
from Sales.SalesOrderDetail
where ProductID is not null)
So if you prefer the NOT IN style over the NOT EXISTS style of querying, it’s a good idea to get in the habit of including a WHERE IS NOT NULL predicate to the subquery.

By the way, many people in the past have tried to emulate the Anti Semi JOIN behavior by doing a LEFT JOIN and adding a WHERE IS NULL to the query to only find rows that have no match on the right side, like so:

select p.ProductID
from Production.Product p
left join Sales.SalesOrderDetail sod on p.ProductID=sod.ProductID
where ProductSubcategoryID=4
and sod.ProductID is null
But this actually produces a plan that LEFT JOINs everything and then employs a Filter to only allow the IS NULL non-matches, creating a lot of unnecessary work and a much more inefficient query than the true Anti Semi JOIN:



I’ve seen people in the past saying that the LEFT JOIN/IS NULL approach is faster than the NOT EXISTS approach, but frankly, I can’t see it. If anyone has an example to offer, I’d certainly like to take a look.

6 comments:

  1. I love that we both touched on the OUTER + IS NULL inefficiency. :)
    http://sqlblog.com/blogs/rob_farley/archive/2011/10/04/joins-without-join.aspx

    ReplyDelete
  2. I can confirm that the left join + where is null trick is slower. I do not see a reason why this cannot be rewritten into a left anti semi join.

    ReplyDelete
  3. Great post, thanks. However, you write "One important note about NOT IN. It is only equivalent to the NOT EXISTS query if the column being checked is non-nullable." Would it be better to say "the column contains no NULL markers"? Nullable columns that have yet to contain any NULL markers would be OK? Yes, risky, but ...

    ReplyDelete
  4. Hi Anonymous...

    No, I believe the original statement I wrote is correct. The query plan that I illustrated in the blog was the result of me just changing the ProductID column to a NULLable column. I didn't actually INSERT any new rows into the table, so every single ProductID was filled in with something.

    --Brad

    ReplyDelete