Friday, April 29, 2011

The Index Tuning Detective

A Shocking Revelation!Part of tuning queries is being a detective in figuring out what indexes need to be created… and figuring out what indexes may need to be tweaked a bit.

So to all you Nancy Drews and Hardy Boys out there: Get your magnifying glass and let’s unravel some mysteries!

Our first adventure will be The Case Of The Missing Indexes.

This has been talked about before in other blogs, because SQL Server already provides some tools to find missing indexes, but hopefully I’ll go just a little bit farther with the concept.

But it’s the second adventure, The Case Of The Key Lookup Killer, that I’m looking forward to sharing with you.

It may help you uncover some shocking revelations about your queries.

Anyway, please read on…



The Case of the Missing Indexes

Consider the following query in AdventureWorks:

select SalesOrderID,OrderDate 
from Sales.SalesOrderHeader
where PurchaseOrderNumber is not null
When we look at the Estimated Execution Plan for this query in SQL2008, we get a helpful hint about a missing index. It says that we can cut down the cost of the query by 94.2762% if we were to add the index that it suggests. And we can even right-click on the plan and choose Missing Index Details and it will provide a code window with the code necessary to create that index:

SQL2008 Showing Missing Index

/*
Missing Index Details from SQLQuery6.sql-BRADPC\SQL08.AdventureWorks (BRADPC\Brad (53))
The Query Processor estimates that implementing the following index
could improve the query cost by 94.2762%.
*/

/*
USE [AdventureWorks]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Sales].[SalesOrderHeader] ([PurchaseOrderNumber])
INCLUDE ([SalesOrderID],[OrderDate])
GO
*/
Of course, the index is not a perfect suggestion because it suggests we INCLUDE the SalesOrderID column, which is ridiculous because SalesOrderID is the Clustered Index Key, so it would be part of the index automatically anyway.

But never mind that… This is still pretty cool stuff.

And it’s not limited to SQL2008 either. Yes, SQL2008 will provide the helpful hint when you look at the Estimated Plan, but SQL2005 still has the same information behind the scenes. If you look at the Execution Plan XML (by right-clicking on the Plan and choosing Show Execution Plan XML), you will find the Missing Index information buried in there… usually towards the top, but not always… there can potentially be many Missing Index sections in the XML if the plan is for a multiple-statement batch (particularly a batch with IF conditions).

/*
<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML ... xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
...
<MissingIndexes>
<MissingIndexGroup Impact="94.2762">
<MissingIndex Database="[AdventureWorks]" Schema="[Sales]" Table="[SalesOrderHeader]">
<ColumnGroup Usage="INEQUALITY">
<Column Name="[PurchaseOrderNumber]" ColumnId="9" />
</ColumnGroup>
<ColumnGroup Usage="INCLUDE">
<Column Name="[SalesOrderID]" ColumnId="1" />
<Column Name="[OrderDate]" ColumnId="3" />
</ColumnGroup>
</MissingIndex>
</MissingIndexGroup>
</MissingIndexes>
...
</ShowPlanXML>
*/
This is great if you happen to be looking at a specific plan, but what if you want to know about missing indexes across many plans?

There is a collection of DMV’s that report missing indexes for queries that have been executed on the server. The following is a query used by many DBA’s in order to find indexes that they can potentially add in order to speed up their queries:

select index_advantage=user_seeks*avg_total_user_cost*(avg_user_impact*0.01)
,migs.last_user_seek
,TableName=mid.statement
,mid.equality_columns
,mid.inequality_columns
,mid.included_columns
,migs.unique_compiles
,migs.user_seeks
,migs.avg_total_user_cost
,migs.avg_user_impact
from sys.dm_db_missing_index_group_stats migs with (nolock)
join
sys.dm_db_missing_index_groups mig with (nolock)
on migs.group_handle=mig.index_group_handle
join sys.dm_db_missing_index_details mid with (nolock)
on mig.index_handle=mid.index_handle
order by index_advantage desc
/*
index_advantage last_user_seek TableName
---------------- ----------------------- -------------------------------------------
3.23713751033778 2011-04-29 08:58:11.960 [AdventureWorks].[Sales].[SalesOrderHeader]

equality_columns inequality_columns included_columns
---------------- -------------------- ---------------------------
NULL [PurchaseOrderNumber [SalesOrderID], [OrderDate]

unique_compiles user_seeks avg_total_user_cost avg_user_impact
--------------- ---------- ------------------- ---------------
1 6 0.572255959259259 94.28
*/
You can see our familiar Impact figure of 94.28% in the last column, but since the DMV’s record the number of seeks that could have potentially been performed (based on the number of times the query was executed on the server), this query calculates a theoretical “advantage” figure using that information, and orders the data in descending order of that “advantage”.

Again, this is really cool that this information is recorded and we can effortlessly find out suggestions for indexes in our databases.

But these indexes shouldn’t be created blindly… they should be created after some consideration of how they will impact the system. If you know your database backwards and forwards, you probably have a good idea of what kind of queries would benefit from the creation of these indexes. But if you’re a consultant coming in cold, you don’t really know much about the queries in the system.

However, all is not lost!

We can look directly inside the query cache (via the sys.dm_exec_cached_plans DMV), hunting for queries that are running right now (or have run recently), and pick out those that have missing index suggestions in their Execution XML data. This can be used in concert with the Missing Index DMV query above to gain more knowledge about the queries that will benefit.

I put together the following query to do just that. The comments within the code should help you to figure out what is going on. Use the XML snippet shown earlier to follow how the code hunts for what it wants.

Note that I only wanted to pay attention to cached queries that had been used at least 5 times. I also wanted to limit the number of columns making up the key to 5 and the number of INCLUDEd columns to 5. I also only wanted to pay attention to Missing Indexes that have an Impact of at least 50%. You can fiddle with these quantities in the WHERE clause to match your needs.

The query provides the Key Column list and the INCLUDE Column List and the command you can use to actually create the index if you wish. It also provides a hyperlink to the code of the batch or procedure that would benefit from the index so you can see the query involved. There is also a hyperlink to the Execution Plan so you can investigate that as well.

The query is ORDERed BY the Impact figure in descending order, so the ones with the biggest benefit will be at the top.

with xmlnamespaces 
(
default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'
)
select Impact
,TableName=IxDB+'.'+IxSchema+'.'+IxTable
,KeyCols
,IncludeCols
,IndexCommand
,usecounts
,size_in_bytes
,objtype
,BatchCode
,QueryPlan=qp.query_plan
from sys.dm_exec_cached_plans qs
cross apply
--Get the Query Text
sys.dm_exec_sql_text(qs.plan_handle) qt
cross apply
--Get the Query Plan
sys.dm_exec_query_plan(qs.plan_handle) qp
cross apply
--Get the Code for the Batch in Hyperlink Form
(select BatchCode
=(select [processing-instruction(q)]=':'+nchar(13)+qt.text+nchar(13)
for xml path(''),type)
) F_Code
cross apply
--Find the Missing Indexes Group Nodes in the Plan
qp.query_plan.nodes('//MissingIndexes/MissingIndexGroup') F_GrpNodes(GrpNode)
cross
apply
--Pull out the Impact Figure
(select Impact=GrpNode.value('(./@Impact)','float')) F_Impact
cross apply
--Get the Missing Index Nodes from the Group
GrpNode.nodes('(./MissingIndex)') F_IxNodes(IxNode)
cross
apply
--Pull out the Database,Schema,Table of the Missing Index
(select IxDB=IxNode.value('(./@Database)','sysname')
,IxSchema=IxNode.value('(./@Schema)','sysname')
,IxTable=IxNode.value('(./@Table)','sysname')
) F_IxInfo
cross apply
--How many INCLUDE columns are there;
--And how many EQUALITY/INEQUALITY columns are there?
(select NumIncludes
=IxNode.value('count(./ColumnGroup[@Usage="INCLUDE"]/Column)','int')
,NumKeys
=IxNode.value('count(./ColumnGroup[@Usage!="INCLUDE"]/Column)','int')
) F_NumIncl
cross apply
--Pull out the Key Columns and the Include Columns from the various Column Groups
(select EqCols=max(case when Usage='EQUALITY' then ColList end)
,InEqCols=max(case when Usage='INEQUALITY' then ColList end)
,IncludeCols=max(case when Usage='INCLUDE' then ColList end)
from IxNode.nodes('(./ColumnGroup)') F_ColGrp(ColGrpNode)
cross apply
--Pull out the Usage of the Group? (EQUALITY of INEQUALITY or INCLUDE)
(select Usage=ColGrpNode.value('(./@Usage)','varchar(20)')) F_Usage
cross apply
--Get a comma-delimited list of the Column Names in the Group
(select ColList=stuff((select ','+ColNode.value('(./@Name)','sysname')
from ColGrpNode.nodes('(./Column)') F_ColNodes(ColNode)
for xml path(''))
,1,1,'')
) F_ColList
) F_ColGrps
cross apply
--Put together the Equality and InEquality Columns
(select KeyCols=isnull(EqCols,'')
+case
when EqCols is not null and InEqCols is not null
then ','
else ''
end
+isnull(InEqCols,'')
) F_KeyCols
cross apply
--Construct a CREATE INDEX command
(select IndexCommand='create index <InsertNameHere> on '
+IxDB+'.'+IxSchema+'.'+IxTable+' ('
+KeyCols+')'
+isnull(' include ('+IncludeCols+')','')) F_Cmd
where qs.cacheobjtype='Compiled Plan'
and usecounts>=5 --Only interested in those plans used at least 5 times
and NumKeys<=5 --Limit to the #columns we're willing to have in the index
and NumIncludes<=5 --Limit to the #columns we're willing to have in the INCLUDE list
and Impact>=50 --Only indexes that will have a 50% impact
order by Impact desc
Here is the output (without the hyperlinks) for the AdventureWorks query we were discussing:

/*
Impact TableName
------- -------------------------------------------
94.2762 [AdventureWorks].[Sales].[SalesOrderHeader]

KeyCols IncludeCols
--------------------- --------------------------
[PurchaseOrderNumber] [SalesOrderID],[OrderDate]

IndexCommand
----------------------------------------------------------------------------
create index <InsertNameHere> on [AdventureWorks].[Sales].[SalesOrderHeader]
([PurchaseOrderNumber]) include ([SalesOrderID],[OrderDate])

usecounts size_in_bytes objtype
--------- ------------- -------
9 40960 Adhoc
*/



The Case of the Key Lookup Killer

Now on to the part of this article I’m really excited about, because it will help you possibly tweak existing indexes to help improve your queries. This is something that you cannot get from any DMV’s.

Consider the following query:

select h.SalesOrderID
,h.CustomerID
,h.OrderDate
,d.LineTotal
from Sales.SalesOrderHeader h
join Sales.SalesOrderDetail d on h.SalesOrderID=d.SalesOrderID
where h.CustomerID in (117,119,126,196,236,435)
and d.ProductID=942
This is what its query plan looks like (click on the image to see a larger view):

Query with Key Lookups

Note that it is able to use the SalesOrderDetail’s index on ProductID to easily find the rows for ProductID 942. However, for each of those rows, it has to do a Key Lookup into the Clustered Index in order to get the columns for the LineTotal, which is a computed column based on the columns OrderQty, UnitPrice, and UnitPriceDiscount. The LineTotal is calculated by the Compute Scalar operator.

Similarly, the query makes use of SalesOrderHeader’s index on CustomerID to find the rows for the desired CustomerID’s. But again, it has to do Key Lookups to get the OrderDate.

What if we could eliminate those Key Lookups? If we were to INCLUDE the OrderQty and UnitPrice and UnitPriceDiscount columns in the ProductID index and INCLUDE the OrderDate column in the CustomerID index, then the query would be covered completely by those indexes. The query plan would then look like this:

Query Covered by Indexes

The cost of this query is 0.0128915, which is a 92% improvement over the Key Lookup query, which had a cost of 0.163265.

I put together a query to look in the cache for queries containing Key Lookups and then pull out the columns that we could potentially INCLUDE in an index in order to improve performance.

Here’s the section of the Execution Plan XML that involves the Sales.SalesOrderDetail Index Seek and its Key Lookup… I abbreviated it so that only the relevant portions are displayed:

/*
<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML ... xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
...
<RelOp ... PhysicalOp="Nested Loops" ...>
...
<NestedLoops ...>
...
<RelOp ... LogicalOp="Index Seek" ...>
...
<IndexScan ...>
...
<Object Database="[AdventureWorks]"
Schema="[Sales]"
Table="[SalesOrderDetail]"
Index="[IX_SalesOrderDetail_ProductID]"
Alias="[d]" ... />
</IndexScan>
</RelOp>
<RelOp ... LogicalOp="Compute Scalar" ...>
<ComputeScalar>
...
<RelOp ... LogicalOp="Clustered Index Seek" ...>
<OutputList>
<ColumnReference ... Column="OrderQty" />
<ColumnReference ... Column="UnitPrice" />
<ColumnReference ... Column="UnitPriceDiscount" />
</OutputList>
<IndexScan Lookup="true" ...>
...
<Object Database="[AdventureWorks]"
Schema="[Sales]"
Table="[SalesOrderDetail]"
Index="[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]"
Alias="[d]" ... />
</IndexScan>
</RelOp>
</ComputeScalar>
</RelOp>
</NestedLoops>
</RelOp>
...
</ShowPlanXML>
*/
You can see the Nested Loops RelOp Node has a node down its hierarchy for the Index Seek into the IX_SalesOrderDetail_ProductID index, and it also has a node down its hierarchy for the Clustered Index Seek.

Here’s what my query does, step by step…

It finds Clustered Index Seek RelOp Nodes with an Index Scan Node with the Lookup attribute equal to True. Those are the Key Lookups. For each one found, it saves the Database, Schema, Table, and (possible) Alias Name. It also gets the list of columns in the OutputList node and also counts how many columns there are.

Then it goes up the hierarchy, looking for a Nested Loops RelOp Node. It may be the immediate RelOp Node, or it could be two RelOp Nodes up the hierarchy. In this case, because of the Compute Scalar, we have to go up two RelOp Nodes.

Once it finds the appropriate Nested Loops RelOp Node, it goes down its hierarchy looking for any Index Seek or Index Scan RelOp Nodes.

For each of those found, it saves the Database, Schema, Table, and (possible) Alias Name. And it gets the name of the Index that was involved in the Seek or Scan.

So now we have Lookup Data, and we have to see if any of the Index Data we found (there could be more than one) match in terms of Database and Schema and Table and (possible) Alias. Once the match is found, we can output it.

It’s a bit complicated, but with a little study in reading the above steps and in reading the comments in the code, you can hopefully figure out what’s going on.

Like the Missing Index query, I only look for cached plans that have been used at least 5 times, and I’m only interested in an INCLUDE list of no more than 5 columns. The output includes the usual hyperlinks to the code and to the plan.

I order the query’s output by TableName and IndexName so you can see similar suggestions clustered together. Note also that the columns in the suggested INCLUDE list are in alphabetical order.

Here is what my query suggested for our AdventureWorks Key Lookup query:

/*
TableName IndexName
------------------------------------------- --------------------------------
[AdventureWorks].[Sales].[SalesOrderDetail] [IX_SalesOrderDetail_ProductID]
[AdventureWorks].[Sales].[SalesOrderHeader] [IX_SalesOrderHeader_CustomerID]

TableAliasInQuery ColumnsToInclude usecounts size_in_bytes objtype
----------------- ------------------------------------ --------- ------------- -------
[d] OrderQty,UnitPrice,UnitPriceDiscount 6 196608 Adhoc
[h] OrderDate 6 196608 Adhoc
*/
So it is suggesting that the SalesOrderDetail index called IX_SalesOrderDetail_ProductID should INCLUDE the columns OrderQty, UnitPrice, and UnitPriceDiscount. And it is making a similar suggestion to INCLUDE the OrderDate column in the IX_SalesOrderHeader_CustomerID index.

Again, as with any kind of tuning exercise, you should not make these changes blindly, but consider the possible implications before going forward.

The code for this query for INCLUDE Column Suggestions is below. I hope you found it as useful as I have. It actually helped me to find an error in a client’s stored procedure… In suggesting a column to INCLUDE, I could see by looking at the query that it was actually using an incorrect WHERE predicate, and so we were able to fix it before it caused a bug. I’ve also used its suggestions to improve some queries considerably.

So here’s the query… Please let me know if it’s been useful to you… Enjoy!

with xmlnamespaces 
(
default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'
)
select TableName=IxDB+'.'+IxSchema+'.'+IxTable
,IndexName=IxIndex
,TableAliasInQuery=isnull(IxAlias,IxTable)
,ColumnsToInclude=ColList
,usecounts
,size_in_bytes
,objtype
,BatchCode
,QueryPlan=qp.query_plan
from sys.dm_exec_cached_plans qs
cross apply
--Get the Query Text
sys.dm_exec_sql_text(qs.plan_handle) qt
cross apply
--Get the Query Plan
sys.dm_exec_query_plan(qs.plan_handle) qp
cross apply
--Get the Code for the Batch in Hyperlink Form
(select BatchCode
=(select [processing-instruction(q)]=':'+nchar(13)+qt.text+nchar(13)
for xml path(''),type)
) F_Code
cross apply
--Find the Key Lookups in the Plan
qp.query_plan.nodes
(
'//RelOp[@LogicalOp="Clustered Index Seek"]/IndexScan[@Lookup=1]'
) F_Lookup(LookupNode)
cross
apply
--Get the Database,Schema,Table of the Lookup
--Also get the Alias (if it exists) in case the table
-- is used more than once in the query
(select LookupDB=LookupNode.value('(./Object[1]/@Database)','sysname')
,LookupSchema=LookupNode.value('(./Object[1]/@Schema)','sysname')
,LookupTable=LookupNode.value('(./Object[1]/@Table)','sysname')
,LookupAlias=isnull(LookupNode.value('(./Object[1]/@Alias)','sysname'),'')
,ColumnCount=LookupNode.value('count(../OutputList[1]/ColumnReference)','int')
) F_LookupInfo
cross apply
--Get the Output Columns
(select stuff(
(select ','+ColName
from LookupNode.nodes('(../OutputList[1]/ColumnReference)') F_Col(ColNode)
cross apply
(select ColName=ColNode.value('(./@Column)','sysname')) F_ColInfo
order by ColName
for xml path(''),type).value('(./text())[1]','varchar(max)')
,1,1,'')
)
F_ColList(ColList)
outer
apply
--Get the Parent RelOp Node, hoping that it is a Nested Loops operator.
--Use OUTER APPLY because we may not find it
LookupNode.nodes
(
'(./../../..[@PhysicalOp="Nested Loops"])'
) F_ParentLoop(ParentLoopNode)
outer
apply
--Get the GrandParent RelOp Node, hoping that it is a Nested Loops operator.
--Use OUTER APPLY because we may not find it
LookupNode.nodes
(
'(./../../../../..[@PhysicalOp="Nested Loops"])'
) F_GrandParentLoop(GrandParentLoopNode)
cross
apply
--Get the Nested Loop Node... Could be the Parent or the GrandParent
(select LoopNode=isnull(ParentLoopNode.query('.')
,GrandParentLoopNode.query('.'))
) F_LoopNode
cross apply
--Now that we (hopefully) have a Nested Loops Node, let's find a descendant
--of that node that is an Index Seek or Index Scan and acquire its Object Information
LoopNode.nodes
(
'//RelOp[@LogicalOp="Index Scan" or @LogicalOp="Index Seek"]
/IndexScan[1]/Object[1]'
) F_SeekNode(SeekObjNode)
cross
apply
--Get the Database,Schema,Table and Index of the Index Seek/Scan
--Also get the Alias (if it exists) so we can match it up with
-- the Lookup Table
(select IxDB=SeekObjNode.value('(./@Database)','sysname')
,IxSchema=SeekObjNode.value('(./@Schema)','sysname')
,IxTable=SeekObjNode.value('(./@Table)','sysname')
,IxAlias=isnull(SeekObjNode.value('(./@Alias)','sysname'),'')
,IxIndex=SeekObjNode.value('(./@Index)','sysname')
) F_SeekInfo
where qs.cacheobjtype='Compiled Plan'
and usecounts>=5 --Only interested in those plans used at least 5 times
and LookupDB=IxDB --( Lookup and IndexSeek/Scan )
and LookupSchema=IxSchema --( Database,Schema,Table, )
and LookupTable=IxTable --( and [possible] Alias )
and LookupAlias=IxAlias --( must match )
and ColumnCount<=5 --Limit to the #columns we're willing to INCLUDE
order by TableName
,IndexName
,ColumnsToInclude

Tuesday, April 12, 2011

T-SQL Tuesday #017: APPLY: It Slices! It Dices! It Does It All!

T-SQL TuesdayThis blog entry is participating in T-SQL Tuesday #017, hosted this month by Matt Velic.

You are invited to visit his blog and join the party and read more blogs participating in this month’s theme: APPLY Knowledge.

I’ve been a fan of the APPLY operator since the beginning. That’s why it’s incorporated into the name of my blog. And I've blogged about its power about a dozen times.

I don’t know how anyone lived without it before SQL2005.

Practically everyone knows that you can use it to invoke table-valued functions. That is its most obvious usage and it’s about the only way you’ll see APPLY demonstrated in 97% of the books on SQL Server (if they even mention it at all).

APPLY is a cool cat, baby!But, as you can see by the twinkle in APPLY’s eyes at the left (even behind the cool shades), he’s got other things up his sleeve.

(Okay, the goofy pointed hat and the goatee are a little much, but hey, if you want great artwork, you’re in the wrong place… Michael J. Swart or Kendra Little are the masters of illustration).

APPLY is capable of soooooo much more than just invoking TVF’s. It is incredibly versatile. It seems like it can do anything!

Let’s take a look at what you can do!



Call Table-Valued Functions!

Yawn. Okay, so this is the most common way people use APPLY. Here’s a quick demo… For each Vista credit card expiring in Jun2008, let’s get the Contact information for that card using a built-in function in AdventureWorks.

select f.FirstName
,f.LastName
,f.JobTitle
,f.ContactType
,cc.CardNumber
from Sales.CreditCard cc
join Sales.ContactCreditCard ccc on cc.CreditCardID=ccc.CreditCardID
cross apply dbo.ufnGetContactInformation(ccc.ContactID) f
where cc.ExpYear=2008
and cc.ExpMonth=6
and cc.CardType='Vista'
/*
FirstName LastName JobTitle ContactType CardNumber
--------- -------- ---------------- ------------- --------------
Peggy Justice Owner Store Contact 11119759315644
John McClane Purchasing Agent Store Contact 11119490672347
Laura Cai NULL Consumer 11112813884091
Natalie Gonzales NULL Consumer 11114369564985
Jarrod Sara NULL Consumer 11116045498593
Katherine Bailey NULL Consumer 11119100149656
Stephanie Gray NULL Consumer 11112324154556
Shawna Sharma NULL Consumer 11116413893110
Mindy Rai NULL Consumer 11115163407997
Jackson Jai NULL Consumer 11112011871602
And so on... (74 rows total)
*/
That’s very convenient, but kind of boring in the grand scheme of things. So let’s move on.

Execute SubQueries!

Why even bother to create a function when you can just create a table on the fly via a correlated subquery with APPLY?

You are only limited by your imagination.

Here’s an example…

For each store with a main office in Wisconsin, let’s look at the top 3 products (and their dollar amounts) that they bought in terms of dollars.

select c.CustomerID
,s.Name
,f.ProductID
,ProductName=p.Name
,f.PurchaseAmt
from Sales.Customer c
join Sales.Store s on c.CustomerID=s.CustomerID
join Sales.CustomerAddress ca on c.CustomerID=ca.CustomerID
join Person.Address a on ca.AddressID=a.AddressID
join Person.StateProvince sp on a.StateProvinceID=sp.StateProvinceID
cross apply (select top 3 ProductID
,PurchaseAmt=sum(LineTotal)
from Sales.SalesOrderHeader soh
join Sales.SalesOrderDetail sod on soh.SalesOrderID=sod.SalesOrderID
where CustomerID=c.CustomerID
group by ProductID
order by sum(LineTotal) desc) f
join Production.Product p on f.ProductID=p.ProductID
where ca.AddressTypeID=3 --MainOffice
and sp.StateProvinceCode='WI'
/*
CustomerID Name ProductID ProductName PurchaseAmt
---------- --------------------------- --------- -------------------------- ------------
418 Good Bike Shop 795 Road-250 Black, 52 30367.350000
418 Good Bike Shop 794 Road-250 Black, 48 24136.807500
418 Good Bike Shop 792 Road-250 Red, 58 23508.517500
453 Unique Bikes 773 Mountain-100 Silver, 44 16319.952000
453 Unique Bikes 771 Mountain-100 Silver, 38 12239.964000
453 Unique Bikes 772 Mountain-100 Silver, 42 12239.964000
543 Friendly Neighborhood Bikes 782 Mountain-200 Black, 38 9638.958000
543 Friendly Neighborhood Bikes 868 Women's Mountain Shorts, M 671.904000
543 Friendly Neighborhood Bikes 869 Women's Mountain Shorts, L 335.952000
606 Little Bicycle Supply Shop 717 HL Road Frame - Red, 62 1717.800000
606 Little Bicycle Supply Shop 838 HL Road Frame - Black, 44 858.900000
606 Little Bicycle Supply Shop 738 LL Road Frame - Black, 52 809.328000
*/
That’s pretty slick, huh?

Shred XML!

Using the .nodes() function, coupled with the .value() and .query() functions, we can use APPLY to do some cool tricks with XML.

For the first 10 JobCandidates, let’s pull information out of the Resume column, which is of type XML. We’ll get their Name and the schools (there might be more than one) that they attended, listing them in order of their graduation date.:

with xmlnamespaces 
(
'http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/Resume' as ns
)
select JobCandidateID
,Name
,Education=stuff(EduList,1,2,'')
from HumanResources.JobCandidate
cross apply
Resume.nodes('/ns:Resume') F_ResumeNode(ResumeNode)
cross apply

ResumeNode
.nodes('(./ns:Name)') F_NameNode(NameNode)
cross apply

(select Name=NameNode.value('(./ns:Name.First[1])','nvarchar(50)')
+' '
+NameNode.value('(./ns:Name.Last[1])','nvarchar(50)')
) F_Name
cross apply
(select EduList=ResumeNode.query('for $p in (./ns:Education)
order by $p/ns:Edu.EndDate
return concat("; ",string($p/ns:Edu.School))'
).value('.','nvarchar(200)')
) F_Edu
where JobCandidateID<=10
/*
JobCandidateID Name Education
-------------- -------------------- ----------------------------------------------------
1 Shai Bassli Midwest State University
2 Max Benson Evergreen High School ; Everglades State College
3 Krishna Sunkammurali Western University
4 Stephen Jiang Louisiana Business College of New Orleans
5 Thierry D'Hers Université d'Aix-Marseille
6 Christian Kleinerman Lycée technique Émile Zola ; Université de Perpignan
7 Lionel Penuchot Université de Lyon
8 Peng Wu Western University
9 Shengda Yang Evergreen High School ; Soutern State College
10 Tai Yee Midwest State University
*/
As Miley Cyrus would say: That’s really cool.

Introduce New Columns!

This is probably the best use of APPLY because it makes code so much more clear.

Consider the following query, which groups the 2002 Sales by Month. That’s done by the DATEADD/DATEDIFF logic, but it has to be repeated in the GROUP BY and the SELECT and the ORDER BY:

select Mth=datename(month
,dateadd(month
,datediff(month,'19000101',OrderDate)
,'19000101'))
,Total=sum(TotalDue)
from Sales.SalesOrderHeader
where OrderDate>='20020101'
and OrderDate<'20030101'
group by dateadd(month
,datediff(month,'19000101',OrderDate)
,'19000101')
order by dateadd(month
,datediff(month,'19000101',OrderDate)
,'19000101')
/*
Mth Total
--------- ------------
January 1605782.1915
February 3130823.0378
March 2643081.0798
April 1905833.9088
May 3758329.2949
June 2546121.9618
July 3781879.0708
August 5433609.3426
September 4242717.7166
October 2854206.7518
November 4427598.0006
December 3545522.738
*/
I don’t know about you, but I hate all that repetition, and it looks a little busy. So APPLY to the rescue:

select Mth=datename(month,FirstDayOfMth)
,Total=sum(TotalDue)
from Sales.SalesOrderHeader
cross apply
(
select FirstDayOfMth=dateadd(month
,datediff(month,'19000101',OrderDate)
,'19000101')
) F_Mth
where OrderDate>='20020101'
and OrderDate<'20030101'
group by FirstDayOfMth
order by FirstDayOfMth
/*
Mth Total
--------- ------------
January 1605782.1915
February 3130823.0378
March 2643081.0798
April 1905833.9088
May 3758329.2949
June 2546121.9618
July 3781879.0708
August 5433609.3426
September 4242717.7166
October 2854206.7518
November 4427598.0006
December 3545522.738
*/
Now isn’t that much clearer as to what’s going on? And it costs nothing at all! The query plans of both of the queries above are exactly the same!

Perform Complicated Calculations!

This is the part of APPLY that I really love. Let’s look at an example.

Let’s say that you have a table of comma-delimited lists of one or more integers:

create table #t
(
ID int identity(1,1)
,ListOfNums varchar(50)
)
insert #t
values ('279,37,972,15,175')
,('17,72')
,('672,52,19,23')
,('153,798,266,52,29')
,('77,349,14')
select * from #t
/*
ID ListOfNums
-- -----------------
1 279,37,972,15,175
2 17,72
3 672,52,19,23
4 153,798,266,52,29
5 77,349,14
*/
Your job: Pull out only the rows that have the 4th number in the list less than 50 and sort the output by the 3rd number in the list.

Easy, right? Ha ha ha ha ha ha ha hee hee hee hee hee hee ho ho ho ho haw haw giggle chuckle guffaw!

Before the APPLY operator, SQL2000 folks would have to resort to something ludicrous like this in order to accomplish this task:

select ID
,ListOfNums
from #t
where substring(ListOfNums+',,,,',charindex(',',ListOfNums+',,,,',
charindex(',',ListOfNums+',,,,',charindex(',',ListOfNums+',,,,')+1)+1)+1,
(charindex(',',ListOfNums+',,,,',charindex(',',ListOfNums+',,,,',
charindex(',',ListOfNums+',,,,',charindex(',',ListOfNums+',,,,')+1)+1)+1)-
charindex(',',ListOfNums+',,,,',charindex(',',ListOfNums+',,,,',
charindex(',',ListOfNums+',,,,')+1)+1))-1)
< 50
order by substring(ListOfNums+',,,,',charindex(',',ListOfNums+',,,,',
charindex(',',ListOfNums+',,,,')+1)+1,(charindex(',',ListOfNums+',,,,',
charindex(',',ListOfNums+',,,,',charindex(',',ListOfNums+',,,,')+1)+1)-
charindex(',',ListOfNums+',,,,',charindex(',',ListOfNums+',,,,')+1))-1)
/*
ID ListOfNums
-- -------------
2 17,72
5 77,349,14
3 672,52,19,23
1 279,37,972,15
*/
But now, through the magic of APPLY, you can have a much clearer query:

select ID
,ListOfNums
from #t
cross apply (select WorkString=ListOfNums+',,,,') F_Str
cross apply (select p1=charindex(',',WorkString)) F_P1
cross apply (select p2=charindex(',',WorkString,p1+1)) F_P2
cross apply (select p3=charindex(',',WorkString,p2+1)) F_P3
cross apply (select p4=charindex(',',WorkString,p3+1)) F_P4
cross apply (select Num3=convert(int,substring(WorkString,p2+1,p3-p2-1))
,Num4=convert(int,substring(WorkString,p3+1,p4-p3-1))) F_Nums
where Num4<50
order by Num3
/*
ID ListOfNums
-- -------------
2 17,72
5 77,349,14
3 672,52,19,23
1 279,37,972,15
*/
See how I used APPLY to write a little program of a sort? First, I added commas to the end of the column to account for possibly missing numbers in the list. Then I calculated the position of the first comma in that string (p1). Then I calculated the position of the second comma (p2), and that can only be done by using the p1 position I calculated in the previous step. I continue on getting the position of the third and fourth comma. And now that I have those, I can pull out Num3 (from between the second and third comma) and Num4 (from between the third and fourth comma). And I can now use those values in my WHERE and ORDER BY clause.

And the best part? NO COST! The above two queries are exactly the same as far as the optimizer is concerned. All those CROSS APPLYs are glommed together into a Compute Scalar operator, essentially coming up with really complicated expressions like you see in the first query. Take a look at the query plan yourself and you’ll see.

Replace the UNPIVOT operator!

Throw the UNPIVOT operator out the window… The optimizer really translates it into an APPLY operator under the hood anyway… and you can have control over NULLs and differing datatypes.

Look at the following example, which is a function that accepts a CustomerID and spits out information on the customer in a vertical fashion (note that there can be multiple contacts for a customer… this just spits out the first one… thus the ROW_NUMBER() logic):

create function VerticalMainOfficeData
(
@CustomerID int
)
returns table
as
return
with
BaseData as
(
select SeqNo=row_number() over (order by ContactType)
,Name,AddressLine1,AddressLine2,City,StateProvinceName
,PostalCode,CountryRegionName
,ContactType,ContactName,Phone,EmailAddress
,YearOpened,NumberEmployees,Specialty
,SquareFeet,Brands,AnnualSales
from AdventureWorks.Sales.vStoreWithDemographics
cross apply (select ContactName=isnull(Title+' ','')
+FirstName+' '
+isnull(MiddleName+' ','')
+LastName
+isnull(' '+Suffix,'')) F_Name
where CustomerID=@CustomerID
and AddressType='Main Office'
)
select Property,Value
from BaseData
cross apply
(values ('NAME AND ADDRESS:','')
,(' Name',Name)
,(' Address',AddressLine1)
,(' ',AddressLine2)
,(' City',City)
,(' State/Province',StateProvinceName)
,(' Postal Code',PostalCode)
,(' Country/Region',CountryRegionName)
,('','')
,('CONTACT:','')
,(' Type',ContactType)
,(' Name',ContactName)
,(' Phone',Phone)
,(' EmailAddress',EmailAddress)
,('','')
,('DEMOGRAPHIC INFO:','')
,(' Year Opened',str(YearOpened,4))
,(' Number of Employees',convert(varchar(10),NumberEmployees))
,(' Specialty',Specialty)
,(' Square Feet',convert(varchar(10),SquareFeet))
,(' Brands',Brands)
,(' Annual Sales','$'+convert(varchar(20),AnnualSales,1))) P(Property,Value)
where SeqNo=1
and Value is not null
Watch it in action:

select * from VerticalMainOfficeData(34)
/*
Property Value
--------------------- -----------------------------
NAME AND ADDRESS:
Name Cycles Wholesaler & Mfg.
Address Science Park South, Birchwood
Stanford House
City Warrington
State/Province England
Postal Code WA3 7BH
Country/Region United Kingdom

CONTACT:
Type Owner
Name Ms. Barbara J. German
Phone 1 (11) 500 555-0181
EmailAddress barbara4@adventure-works.com

DEMOGRAPHIC INFO:
Year Opened 1999
Number of Employees 15
Specialty Touring
Square Feet 21000
Brands 4+
Annual Sales $800,000.00
*/
Can UNPIVOT do that? Not in a million years. It’s toast!

Make JOINs Extinct!
`
Okay, this is really kind of a joke, but really, think about it… What is a JOIN? For each row in the first table, I want to JOIN it somehow with a row or rows in the second table. That sounds like an APPLY type of thing, doesn’t it? Well it is!

Look at the following traditional JOIN query, which finds all the Accessories that are Yellow, Blue, White:

select SubCategoryName=s.Name 
,p.ProductID
,ProductName=p.Name
,p.Color
from Production.ProductSubCategory s
join Production.Product p on s.ProductSubcategoryID=p.ProductSubcategoryID
where s.ProductCategoryID=3 --Accessories
and p.Color in ('Yellow','Blue','White')
order by SubCategoryName
,p.ProductID
/*
SubCategoryName ProductID ProductName Color
--------------- --------- ------------------------------- ------
Jerseys 881 Short-Sleeve Classic Jersey, S Yellow
Jerseys 882 Short-Sleeve Classic Jersey, M Yellow
Jerseys 883 Short-Sleeve Classic Jersey, L Yellow
Jerseys 884 Short-Sleeve Classic Jersey, XL Yellow
Socks 709 Mountain Bike Socks, M White
Socks 710 Mountain Bike Socks, L White
Socks 874 Racing Socks, M White
Socks 875 Racing Socks, L White
Vests 864 Classic Vest, S Blue
Vests 865 Classic Vest, M Blue
Vests 866 Classic Vest, L Blue
*/
You can replace that JOIN with a CROSS APPLY:

select SubCategoryName=s.Name 
,p.ProductID
,ProductName=p.Name
,p.Color
from Production.ProductSubCategory s
cross apply (select *
from Production.Product
where ProductSubcategoryID=s.ProductSubcategoryID) p
where s.ProductCategoryID=3 --Accessories
and p.Color in ('Yellow','Blue','White')
order by SubCategoryName
,p.ProductID
/*
SubCategoryName ProductID ProductName Color
--------------- --------- ------------------------------- ------
Jerseys 881 Short-Sleeve Classic Jersey, S Yellow
Jerseys 882 Short-Sleeve Classic Jersey, M Yellow
Jerseys 883 Short-Sleeve Classic Jersey, L Yellow
Jerseys 884 Short-Sleeve Classic Jersey, XL Yellow
Socks 709 Mountain Bike Socks, M White
Socks 710 Mountain Bike Socks, L White
Socks 874 Racing Socks, M White
Socks 875 Racing Socks, L White
Vests 864 Classic Vest, S Blue
Vests 865 Classic Vest, M Blue
Vests 866 Classic Vest, L Blue
*/
The query plans for both of those are exactly the same!

And LEFT JOINs can be replaced by OUTER APPLYs!

Today UNPIVOTs and JOINs… Tomorrow the world! Bwu hu hu ha ha ha ha haaaaaaa (Diabolical laughter).

Do it ALL!

For my final example, I’ll do ALL of the above (except for the JOIN replacement, which was just kind of a joke/trick anyway).

In doing the examples above, my query cache got populated with the text and plans of the queries I executed. We will look in the cache for the CROSS APPLY(TOP 3) query that was in the Execute SubQueries! section above, shred its query plan, looking for the operators, figure out their percentage cost, and list them in descending order of that cost. For Scans and Seeks and Joins, we will show the table, column and/or index used. And it will be presented in a vertical manner.

Note that the challenge here is finding the cost of each operator… it is not stored in the plan. Each operator has a Total Subtree Cost, but that is the cost of the operator itself PLUS the Subtree Costs of each of its immediate children operators. So for each operator, I had to find its children, total up their Subtree Costs and subtract that from the Subtree Cost of the operator to get the Individual Cost of the operator. This is done in the CROSS APPLY of the OperatorCosts CTE.

Hopefully the comments are self-explanatory:

with xmlnamespaces 
(
default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'
)
,
OperatorData as
(
select ParentNodeID
,NodeID
,OperatorDesc
,ScanSchema,ScanTable,ScanIndex
,LoopSchema,LoopTable,LoopColumn
,HashSchema,HashTable,HashColumn
,SubTreeCost
from sys.dm_exec_query_stats qs
cross apply
--Get the Query Text
sys.dm_exec_sql_text(qs.sql_handle) qt
cross apply
--Get the Query Plan
sys.dm_exec_query_plan(qs.plan_handle) qp
cross apply
--Get the RelOp nodes from the Plan
qp.query_plan.nodes('//RelOp') F_RelNodes(RelNode)
cross apply
--Pull out the various attributes from the RelOp Node
--And also extract the ParentNodeID of the operator
(select ParentNodeID=RelNode.value('(../../@NodeId)','int')
,NodeID=RelNode.value('(./@NodeId)','int')
,LogicalOp=RelNode.value('(./@LogicalOp)','varchar(50)')
,PhysicalOp=RelNode.value('(./@PhysicalOp)','varchar(50)')
,SubTreeCost=RelNode.value('(./@EstimatedTotalSubtreeCost)','float')
) F_OpInfo
cross apply
--Make a nice description out of the Operator
(select OperatorDesc=case
when LogicalOp=PhysicalOp
then PhysicalOp
else PhysicalOp+' ('+LogicalOp+')'
end
) F_OpDesc
outer apply
--Get child nodes having to do with a Scan/Seek
--Note that OUTER APPLY is used since there may not
--be any child nodes of this type
RelNode.nodes('(./IndexScan[1]/Object[1])')
F_ScanNode
(ScanNode)
outer apply
--And pull out their Table/Index information
(select ScanSchema=ScanNode.value('(./@Schema)','varchar(50)')
,ScanTable=ScanNode.value('(./@Table)','varchar(50)')
,ScanIndex=ScanNode.value('(./@Index)','varchar(100)')
) F_ScanInfo
outer apply
--Get child nodes having to do with Nested Loops
--Note that OUTER APPLY is used since there may not
--be any child nodes of this type
RelNode.nodes('(./NestedLoops[1]/OuterReferences[1]/ColumnReference[1])')
F_LoopNode
(LoopNode)
outer apply
--And pull out their Table/Column information
(select LoopSchema=LoopNode.value('(./@Schema)','varchar(50)')
,LoopTable=LoopNode.value('(./@Table)','varchar(50)')
,LoopColumn=LoopNode.value('(./@Column)','varchar(50)')
) F_LoopInfo
outer apply
--Get child nodes having to do with Hash Joins
--Note that OUTER APPLY is used since there may not
--be any child nodes of this type
RelNode.nodes('(./Hash[1]/HashKeysBuild[1]/ColumnReference[1])')
F_HashNode
(HashNode)
outer apply
--And pull out their Table/Column information
(select HashSchema=HashNode.value('(./@Schema)','varchar(50)')
,HashTable=HashNode.value('(./@Table)','varchar(50)')
,HashColumn=HashNode.value('(./@Column)','varchar(50)')
) F_HashInfo
where qt.text like '%select top 3 ProductID%'
and qt.text not like '%with xmlnamespaces%' --Exclude this query
)
,
OperatorCosts as
(
--Calculate the Individual Costs by subtracting each Operator's
--SubTreeCost minus its immediate children's SubTreeCosts
select NodeID
,OperatorDesc
,ScanSchema,ScanTable,ScanIndex
,LoopSchema,LoopTable,LoopColumn
,HashSchema,HashTable,HashColumn
,OperatorCost=convert(numeric(16,8),SubTreeCost-ChildrenSubTreeCost)
from OperatorData o
cross apply
--Calculate the sum of the SubTreeCosts of the immediate children
(select ChildrenSubTreeCost=isnull(sum(SubTreeCost),0)
from OperatorData
where ParentNodeID=o.NodeID) F_ChildCost
)
,
CostPercents as
(
--Calculate the CostPercent using a window function
select NodeID
,OperatorDesc
,ScanSchema,ScanTable,ScanIndex
,LoopSchema,LoopTable,LoopColumn
,HashSchema,HashTable,HashColumn
,CostPercent=convert(numeric(5,1),100*OperatorCost/sum(OperatorCost) over ())
from OperatorCosts
)
select Information
from CostPercents
cross apply
--UNPIVOT the information into a vertical presentation
(values ('NodeID '+convert(varchar(5),NodeID)
+' ('+convert(varchar(10),CostPercent)+'%):')
,(' '+OperatorDesc)
,(' Table: '+ScanSchema+'.'+ScanTable)
,(' Index: '+ScanIndex)
,(' Table: '+LoopSchema+'.'+LoopTable)
,(' Column: '+LoopColumn)
,(' Table: '+HashSchema+'.'+HashTable)
,(' Column: '+HashColumn)) P(Information)
where Information is not null --Eliminate NULL rows
order by CostPercent desc
/*
Information
----------------------------------------------------------------
NodeID 13 (19.6%):
Sort (TopN Sort)
NodeID 15 (19.6%):
Sort
NodeID 9 (18.2%):
Clustered Index Scan
Table: [Sales].[CustomerAddress]
Index: [PK_CustomerAddress_CustomerID_AddressID]
NodeID 11 (14.5%):
Clustered Index Seek
Table: [Sales].[Store]
Index: [PK_Store_CustomerID]
NodeID 20 (8.7%):
Clustered Index Seek
Table: [Sales].[SalesOrderDetail]
Index: [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]
NodeID 5 (5.8%):
Hash Match (Inner Join)
Table: [Person].[Address]
Column: AddressID
NodeID 12 (4.9%):
Clustered Index Seek
Table: [Sales].[Customer]
Index: [PK_Customer_CustomerID]
NodeID 18 (4.7%):
Index Seek
Table: [Sales].[SalesOrderHeader]
Index: [IX_SalesOrderHeader_CustomerID]
NodeID 30 (2.3%):
Clustered Index Seek
Table: [Production].[Product]
Index: [PK_Product_ProductID]
NodeID 7 (0.6%):
Index Seek
Table: [Person].[StateProvince]
Index: [AK_StateProvince_StateProvinceCode_CountryRegionCode]
NodeID 8 (0.6%):
Index Seek
Table: [Person].[Address]
Index: [IX_Address_StateProvinceID]
NodeID 6 (0.2%):
Nested Loops (Inner Join)
Table: [Person].[StateProvince]
Column: StateProvinceID
NodeID 3 (0.2%):
Nested Loops (Inner Join)
Table: [Sales].[CustomerAddress]
Column: CustomerID
NodeID 0 (0.0%):
Nested Loops (Inner Join)
Table: [Sales].[SalesOrderDetail]
Column: ProductID
NodeID 1 (0.0%):
Nested Loops (Inner Join)
Table: [Sales].[Customer]
Column: CustomerID
NodeID 2 (0.0%):
Nested Loops (Inner Join)
Table: [Sales].[CustomerAddress]
Column: CustomerID
NodeID 19 (0.0%):
Compute Scalar
NodeID 16 (0.0%):
Compute Scalar
NodeID 17 (0.0%):
Nested Loops (Inner Join)
Table: [Sales].[SalesOrderHeader]
Column: SalesOrderID
NodeID 14 (0.0%):
Stream Aggregate (Aggregate)
*/
Is that, like, waaaaay cool, or what?!

I hope I’ve convinced you how powerful the APPLY operator can be. I couldn’t live without it.