Sunday, August 23, 2009

ALL, ANY, and SOME: The Three Stooges

Okay, I know there’s a customer in the Northwind database who placed an order on July 18, 1996, but I don’t know who it is. I don’t know how many orders the customer has placed… he may have placed a dozen for all I know… but that doesn’t matter. What I do know for sure is that this customer placed ALL his orders on 1996-07-18… every single one of ‘em.

So let’s try the following approach to find him:

select CustomerID
from Orders
group by CustomerID
having min(OrderDate)=max(OrderDate)
and min(OrderDate)='1996-07-18'
/*
CustomerID
----------
CENTC
*/
Well that’s nice, but a CustomerID really doesn’t mean anything to me. I really want to know the Company’s Name. So let’s take the above query and JOIN it with the Customers table:

select o.CustomerID
,c.CompanyName
from (select CustomerID
from Orders
group by CustomerID
having min(OrderDate)=max(OrderDate)
and min(OrderDate)='1996-07-18') o
join Customers c on o.CustomerID=c.CustomerID
/*
CustomerID CompanyName
---------- ----------------------------------------
CENTC Centro comercial Moctezuma
*/
Yep, that’s the guy! Centro commercial Moctezuma.

But wait a second… What was it that I had said again? I said, “this customer placed ALL his orders on 1996-07-18.”



Hey, isn’t there an ALL operator in SQL that hardly anyone ever uses? We could put together a query using that, and the query will be much more intuitive and easier to read! Hmmm… I wonder why this cool ALL operator isn’t used much… it seems to be pretty useful in this particular situation.

So let’s put together the following query, using a correlated sub-query, to return customers who have ALL their orders placed on 1996-07-18:

select CustomerID,CompanyName
from Customers
where '1996-07-18' = all (select OrderDate
from Orders
where CustomerID=Customers.CustomerID)
/*
CustomerID CompanyName
---------- ----------------------------------------
CENTC Centro comercial Moctezuma
FISSA FISSA Fabrica Inter. Salchichas S.A.
PARIS Paris spécialités
*/
Now ho-old on thar’! Our very first query only returned one customer. Who the heck are these other two guys?

Wait… let’s check their orders:

select OrderID,OrderDate
from Orders
where CustomerID in ('FISSA','PARIS')
/*
OrderID OrderDate
----------- -----------------------
NOTHING! NADA! ZIP! NO ORDERS FOR THESE GUYS!
DO NOT PASS GO! DO NOT COLLECT $200!
*/
Huh? What’s going on here? I thought the query with the ALL operator was quite specific and quite straightforward. Why did these jokers with no orders come out in the result?

Just to experiment further, instead of equals ALL, what about other comparison operators… greater than ALL, less than ALL, and not equal to ALL?

FISSA and PARIS appear in the result set of every one of those queries too!

Jeez, these two guys really get around… They’re like magicians. All their orders are EQUAL to 1996-07-18, and all their orders are GREATER than 1996-07-18, and all their orders are LESS then 1996-07-18, and all their orders are NOT EQUAL to 1996-07-18.

That just sounds… strange.

So what gives? Why does ALL give these weird results?

Well, to explain the behavior of the ALL operator, we must also look at the operators ANY and SOME. ANY and SOME are interchangeable and exactly the same in every respect, so I will just address ANY for the rest of this article. (Sorry ‘bout that, SOME).

To gain an understanding of the behavior of ALL and ANY, try the following experiment. Get a small box, and label it Box#1. Put a nickel (5 cents), a dime (10 cents), and a quarter (25 cents) inside. Give it to a family member and ask him/her the following questions:

Question 1: “Is twenty-five cents worth more than ANY of the coins in Box#1?”
Question 2: “Is twenty-five cents worth more than ALL (EACH) of the coins in Box#1?”
Question 3: “Is twenty-five cents worth the same as ANY of the coins in Box#1?”
Question 4: “Is twenty-five cents worth the same as ALL (EACH) of the coins in Box#1?”

Your family member might think this is strange, and (s)he might pause a bit at the awkwardly-constructed questions, but (s)he should be able to answer the questions accurately (with the answers yes, no, yes, and no).

Now take a completely EMPTY box and label it Box#2 and give it to the same person and ask those same 4 questions again (substituting the word “Box#1” with “Box#2”).

What kind of answers did you get? Did they look confused? Did they get angry? Even voilent, perhaps? I did this experiment with my family last week and they’ve been avoiding me ever since.

Try asking those same questions to T-SQL. Here’s what you get:

;with BoxOf3Coins(CoinValue) as (select  5 union all
select 10 union all
select 25)
,
EmptyBox as (select CoinValue
from BoxOf3Coins
where rand()<0)
select 'Is twenty-five cents worth more than ANY of the coins in Box#1?'
,case when 25 > any (select CoinValue from BoxOf3Coins)
then 'Yes' else 'No' end
union
all
select 'Is twenty-five cents worth more than ALL (EACH) of the coins in Box#1?'
,case when 25 > all (select CoinValue from BoxOf3Coins)
then 'Yes' else 'No' end
union
all
select 'Is twenty-five cents worth the same as ANY of the coins in Box#1?'
,case when 25 = any (select CoinValue from BoxOf3Coins)
then 'Yes' else 'No' end
union
all
select 'Is twenty-five cents worth the same as ALL (EACH) of the coins in Box#1?'
,case when 25 = all (select CoinValue from BoxOf3Coins)
then 'Yes' else 'No' end
union
all
select '----------',''
union all
select 'Is twenty-five cents worth more than ANY of the coins in Box#2?'
,case when 25 > any (select CoinValue from EmptyBox)
then 'Yes' else 'No' end
union
all
select 'Is twenty-five cents worth more than ALL (EACH) of the coins in Box#2?'
,case when 25 > all (select CoinValue from EmptyBox)
then 'Yes' else 'No' end
union
all
select 'Is twenty-five cents worth the same as ANY of the coins in Box#2?'
,case when 25 = any (select CoinValue from EmptyBox)
then 'Yes' else 'No' end
union
all
select 'Is twenty-five cents worth the same as ALL (EACH) of the coins in Box#2?'
,case when 25 = all (select CoinValue from EmptyBox)
then 'Yes' else 'No' end
/*
Is twenty-five cents worth more than ANY of the coins in Box#1? Yes
Is twenty-five cents worth more than ALL (EACH) of the coins in Box#1? No
Is twenty-five cents worth the same as ANY of the coins in Box#1? Yes
Is twenty-five cents worth the same as ALL (EACH) of the coins in Box#1? No
----------
Is twenty-five cents worth more than ANY of the coins in Box#2? No
Is twenty-five cents worth more than ALL (EACH) of the coins in Box#2? Yes
Is twenty-five cents worth the same as ANY of the coins in Box#2? No
Is twenty-five cents worth the same as ALL (EACH) of the coins in Box#2? Yes
*/
No hesitation at all. Not even on the empty box questions. But the answers still seem strange.

To understand why T-SQL answers the questions this way, let’s dig a little deeper.

Look at the first question:

“Is twenty-five cents worth more than ANY of the coins in Box#1?”

As I hinted at before, that’s kind of an awkward-sounding question. In proper English, we would flip the question around, like so:

”Are there ANY coins in Box#1 that are worth less than twenty-five cents?”

Doesn’t that sound more natural? And it’s easy to answer. If you look in the box and see a dime or a nickel or a penny, you can immediately answer yes.

Apparently the second version of the question sounds more natural to T-SQL too, because that’s generally what it does with an ANY query. It internally translates the WHERE clause from this:

where 25 > any (select CoinValue from Box)
into this:

where exists (select * from Box where CoinValue<25)
Now let’s look at the second question of our experiment:

”Is twenty-five cents worth more than ALL (EACH) of the coins in Box#1?”

Again, it’s awkward-sounding. Instead you would more naturally phrase the question like this:

”Is EVERY coin in Box#1 worth less than twenty-five cents?”

Now how do you go about answering that? Do you actually look at every single coin? No, you don’t. Once you see a coin worth twenty-five cents or more, then you stop, and you can confidently answer no. If you can’t find a coin worth twenty-five cents or more, then you can answer yes.

Your internal thought process goes something like this:

”Well, let’s see… Are they any coins in Box#1 worth twenty-five cents or more? If so, then the answer to your question is no. If not, then the answer to your question is yes.

That’s exactly what T-SQL does with an ALL query. It translates it into an ANY query (using the opposite comparison operator) and takes the NOT of it. In other words, it translates this:

where 25 > all (select CoinValue from Box)
into this:

where not (25 <= any (select CoinValue from Box))
And then, remember it translates the ANY query into an EXISTS query:

where not exists (select * from Box where CoinValue>=25)
Now you can answer the empty box question:

Question: ”Is EVERY coin in this empty box worth less than twenty-five cents?”

Answer: ”Well, let’s see… Are there any coins in the box worth twenty-five cents or more? Nope, there aren’t. So the answer to your question must be yes.

Even though we’ve come to the answer of yes logically, it still just feels weird, doesn’t it?

But at least we’ve solved our mystery of the July 18, 1996 orders. We now know that T-SQL translated this ALL query:

select CustomerID,CompanyName
from Customers
where '1996-07-18' = all (select OrderDate
from Orders
where CustomerID=Customers.CustomerID)
into this ANY query:

select CustomerID,CompanyName
from Customers
where not '1996-07-18' <> any (select OrderDate
from Orders
where CustomerID=Customers.CustomerID)
and then in turn translated that query into an EXISTS query:

select CustomerID,CompanyName
from Customers
where not exists (select *
from Orders
where CustomerID=Customers.CustomerID
and OrderDate<>'1996-07-18')
All three of the above query plans are identical... you can see the plan below. A Left Anti Semi Join combines (JOINs) two tables and only outputs rows from the left-hand side (LEFT SEMI) if there are no matching rows (ANTI) on the right-hand side. In our example, the left side was Customers and the right side was Orders for the Customer with an OrderDate not equal to 1996-07-18. In other words, a WHERE NOT EXIST query.



So, in conclusion, I think we can see now why ALL, ANY, and SOME are very rarely used. They are awkward-sounding and kind of backwards as far as natural English conversation is concerned, and they may not give the results you expect when working with a sub-query that returns the empty set.

They’re strange, eccentric, goofy, and kooky. I certainly wouldn’t hire them for any serious T-SQL work. Nyuk-nyuk-nyuk.

Thursday, August 20, 2009

Implicit (and Therefore Perhaps Unexpected) Conversions

I was surprised this morning when I saw a colleague post a message in the MSDN T-SQL Forum where he passed a numeric column as the first argument to the RIGHT() function. In other words, something like this:

declare @t table (IntegerColumn int)
insert @t select 12345
union all select 987
union all select 23
select LastDigit=right(IntegerColumn,1) from @t
/*
LastDigit
---------
5
7
3
*/
Huh? I didn’t know you could do that!

If you look at Books Online for the RIGHT() function, it states the following about the first argument:

"character_expression … is an expression of character or binary data. character_expression can be a constant, variable, or column. character_expression can be of any data type, except text or ntext, that can be implicitly converted to varchar or nvarchar. Otherwise, use the CAST function to explicitly convert character_expression."

Hmmm... So it specifically states that it “can be of any data type” and will be “implicitly converted to varchar”.

Intrigued, I wondered if other string functions did the same thing.

The same information regarding “implicit conversion” is not in the documentation for several of the other string functions; however, with a couple exceptions, they all perform the implicit conversion, as you can see in my findings below (using SQL2005):

select left(12345,2)            --Returns '12'
select left($273.56,4) --Returns '273.'
select right(987.510,4) --Returns '.510'
select right(1.23e10,5) --Returns 'e+010'
select substring(12345,3,2) --ERROR: Data Type INT is invalid for argument 1

select len(12345) --Returns 5
select len(1e10) --Returns 6 (Because 1e10 converts to '1e+010')

select lower(12345) --Returns '12345'
select lower(1.23e10) --Returns '1.23e+010'
select upper(1.23e10) --Returns '1.23E+010'
select ltrim(12345) --Returns '12345'
select rtrim(12345) --Returns '12345'

select ascii(12345) --Returns 49, the ASCII value of '1'
select unicode(12345) --Returns 49, the UNICODE value of '1'

select replace(12345,2,9) --Returns '19345'
select stuff(12345,3,1,9) --Returns '12945'

select replicate(23,3) --Returns '232323'
select reverse(123.45) --Returns '54.321'

select charindex(4,543210) --ERROR: Data Type INT is invalid for argument 1
select charindex('4',543210) --Returns 2
select patindex('[0-9]%',12345) --ERROR: Data Type INT is invalid for argument 2

select soundex(12345) --Returns '0000'
select difference(12345,98765) --Returns 4, meaning least possible SOUNDEX difference

select quotename(12345) --Returns '[12345]'
Just for kicks, what about datetime?:

declare @dt datetime, @dt2 datetime
set
@dt='2009-08-15 11:15:00'
set @dt2='2009-01-31'

select left(@dt,6) --Returns 'Aug 15'
select right(@dt,9) --Returns '9 11:15AM'
select substring(@dt,5,2) --ERROR: Data Type DATETIME is invalid for argument 1

select len(@dt) --Returns 19
select lower(@dt) --Returns 'aug 15 2009 11:15am'
select upper(@dt) --Returns 'AUG 15 2009 11:15AM'
select ltrim(@dt) --Returns 'Aug 15 2009 11:15AM'
select rtrim(@dt) --Returns 'Aug 15 2009 11:15AM'

select ascii(@dt) --Returns 65, the ASCII value of 'A'
select unicode(@dt) --Returns 65, the UNICODE value of 'A'

select replace(@dt,15,99) --Returns 'Aug 99 2009 11:99AM'
select stuff(@dt,5,2,@dt2) --Returns 'Aug Jan 31 12:00AM 2009 11:15AM'

select replicate(@dt,2) --Returns 'Aug 15 2009 11:15AMAug15 2009 11:15AM'
select reverse(@dt) --Returns 'MA51:11 9002 51 guA'

select charindex('5',@dt) --Returns 6
select patindex('A%',@dt) --ERROR: Data Type DATETIME is invalid for argument 2

select soundex(@dt) --Returns 'A200'
select soundex(@dt2) --Returns 'J500'
select difference(@dt,@dt2) --Returns 2

select quotename(@dt) --Returns '[Aug 15 2009 11:15AM]'
Is this a good thing, or a bad thing? I haven’t decided yet, though I’m leaning more towards “bad thing”. I thought T-SQL was a little more strongly-typed than this. Anyway, keep this knowledge in the back of your head and try not to accidentally use a numeric or datetime column in a string function in your coding.

Wednesday, August 12, 2009

The Art of Regressing Linearly

Let’s use our Wayback Machine to travel back in time to AdventureWorks headquarters in July of 2004. The company just closed the books for the month of June 2004 and the middle managers are barking at their subordinates because their superiors have been hounding them to come up with some sales projections. The minions of the company are working doggedly and panting from exhaustion. (Okay, I guess I’ve run the canine references into the ground… I suppose it was the WayBack Machine reference that put a picture of Mr Peabody into my mind… sorry ‘bout that).

Anyway, AdventureWorks management wants to know, given their sales statistics history, what sales figures they can expect in 6 months and 12 months and 18 months from now. They also want to know when they can expect to reach $10 million per month in sales. How about $11 million per month?

With no qualms about changing history, we interfere and offer our help.

Let’s see what kind of sales statistics we have up through June of 2004:

select Mth=OrderDate-day(OrderDate)+1
,Sales=sum(TotalDue)
from Sales.SalesOrderHeader
where OrderDate<'2004-07-01'
group by OrderDate-day(OrderDate)+1
order by Mth
/*
Mth Sales
----------------------- ---------------------
2001-07-01 00:00:00.000 1172359.4289
2001-08-01 00:00:00.000 2605514.9809
2001-09-01 00:00:00.000 2073058.5385
2001-10-01 00:00:00.000 1688963.2744
. . .
2004-03-01 00:00:00.000 5272786.8106
2004-04-01 00:00:00.000 4722890.7352
2004-05-01 00:00:00.000 6518825.2262
2004-06-01 00:00:00.000 6728034.9873
*/
If we were to put the data into Excel and graph those points and have Excel add a trend line, this is what we get:



There are lots of up months and down months, but, in general, the trend is certainly up (always a good thing), with sales somewhere in the $2 million per month range back in 2001 and in the $6 million per month range by Jun2004.

So how can we make sales projections using SQL?

Like Excel did in generating the trend line, we will have to mathematically calculate the same thing using Linear Regression. You may recall from Algebra class that a line has the formula of y=mx+b, where m is the slope and b is the y-intercept. Given a collection of (x,y) values, the following Linear Regression formulas will calculate the slope and y-intercept values for us:




That looks like complicated stuff, but it’s actually very easy to implement.

Our Sales data has a datetime field (the 1st day of a month) and a Sales figure for that month. Those are our theoretical x and y values, but we can’t really use a datetime for our x value… we have to have an actual numeric value of some kind. So let’s just assume our first chronological month just has a value of 1, and the next month is 2, and so on.

For simplicity’s sake, let’s take our query we used above and put our sales statistics into a temporary file called #MonthlySales:

select Mth=OrderDate-day(OrderDate)+1
,Sales=sum(TotalDue)
into #MonthlySales
from Sales.SalesOrderHeader
where OrderDate<'2004-07-01'
group by OrderDate-day(OrderDate)+1
And now we can get our (x,y) values from that as shown below. Note how the x value was calculated using DATEDIFF and a scalar sub-query to get the month number. This is done just to make it more general without having to make assumptions about the underlying data.

select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
order by X
/*
X Y
------ ---------------------
1 1172359.4289
2 2605514.9809
3 2073058.5385
4 1688963.2744
. . .
33 5272786.8106
34 4722890.7352
35 6518825.2262
36 6728034.9873
*/
From that, it’s easy to calculate all those nasty sums (Σ’s) needed by the slope and y-intercept formulas:

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
/*
SumX SumY SumXX SumXY N
------ ---------------- -------- ---------------- ------
666 140651405.9023 16206 3050875537.8704 36
*/
The actual values don’t mean anything to us, but that doesn’t matter because we will just use these to plug into our slope and y-intercept formulas below. Note that I use two CROSS APPLYs below because the value of the y-intercept requires that we calculate the value of the slope first. (Those of you who know me already know that I’m a big-time CROSS APPLY fan… click here for a look at previous blog entries about this frequently-misunderstood, underused operator).

Also note that I introduced 1.0 into the Slope and Intercept calculations in order to force decimal/numeric math to come into play as opposed to possible integer math. In other words, if both our x and y values were integers, then all the SUM figures would be integers and therefore when we do division, we could end up with unexpected integer results. For example, SQL will calculate 500/251 as being equal to 1, but it will calculate 1.0*500/251 as 1.992031. We want that more accurate figure.

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
,
CalcSums as
(
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
)
select Slope
,Intercept
from CalcSums
cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope
cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept
/*
Slope Intercept
------------------------ -----------------
115527.5492092277992277 1769723.83691539
*/
Again, these figures appear pretty meaningless, but the Slope figure does tell us that we generally increase our monthly sales by about $115,528 per month, and that our sales at Month Number Zero was about $1.77 million.

Anyway, we now have the values to use in our y=mx+b formula for our trend line. So we can use that formula to make sales projections. The management wanted to know what sales figures we can expect (given the current trend) in Dec2004, Jun2005, and Dec2005.

So we will add another CTE (called ProjMonths) onto the existing query we have to introduce those months, and then finally calculate the projected sales for those months. Note how we use a CROSS APPLY to incorporate the Slope and Intercept values we calculated, then we use another CROSS APPLY to convert our Projected Months into appropriate x values, and then we use yet another CROSS APPLY to plug those x values into the y=mx+b (or Y=Slope*X+Intercept) formula:

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
,
CalcSums as
(
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
)
,
SlopeIntercept as
(
select Slope
,Intercept
from CalcSums
cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope
cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept
)
,
ProjMonths as
(
select ProjMth=cast('2004-12-01' as datetime)
union all
select '2005-06-01'
union all
select '2005-12-01'
)
select ProjMth
,ProjSales
from ProjMonths
cross apply (select Slope,Intercept from SlopeIntercept) IncludeSlopeIntercept
cross apply (select ProjX=1+datediff(month
,(select min(Mth) from #MonthlySales)
,
ProjMth)) CalcProjX
cross apply (select ProjSales=ProjX*Slope+Intercept) CalcProjSales
/*
ProjMth ProjSales
----------------------- ------------------------
2004-12-01 00:00:00.000 6621880.90370296
2005-06-01 00:00:00.000 7315046.19895832
2005-12-01 00:00:00.000 8008211.49421369
*/
So it looks like we can expect about $8 million in sales by Dec2005.

Management also wanted to know when we can expect to reach $10 million and $11 million per month. So instead of taking an x value and plugging it into y=mx+b to get a y value, we need to do the opposite… we will take a y value and plug it into x=(y-b)/m to get an x value.

So we will use that in our query below. Blindly plugging in a y value of 10,000,000 would give us an x value of (10000000 - 1769723.83691539) / 115527.5492092277992277 = 71.24. That indicates that we’ll reach the $10 million mark about 24% of the way into Month Number 71. But our “month number” must be an integer, so we use the FLOOR() function to make it an actual integer ordinal number, and then we use the final CROSS APPLY to figure out which actual month that translates into:

;with XYData as
(
select X=1+datediff(month,(select min(Mth) from #MonthlySales),Mth)
,Y=Sales
from #MonthlySales
)
,
CalcSums as
(
select SumX=sum(X)
,SumY=sum(Y)
,SumXX=sum(X*X)
,SumXY=sum(X*Y)
,N=count(*)
from XYData
)
,
SlopeIntercept as
(
select Slope
,Intercept
from CalcSums
cross apply (select Slope=1.0*(N*SumXY-SumX*SumY)/(N*SumXX-SumX*SumX)) CalcSlope
cross apply (select Intercept=1.0*(SumY-Slope*SumX)/N) CalcIntercept
)
,
ProjSales as
(
select ProjSales=10000000
union all
select 11000000
)
select ProjSales
,ProjMth
from ProjSales
cross apply (select Slope,Intercept from SlopeIntercept) IncludeSlopeIntercept
cross apply (select ProjX=floor((ProjSales-Intercept)/Slope)) CalcProjX
cross apply (select ProjMth=dateadd(month
,ProjX
,(select min(Mth) from #MonthlySales))) CalcProjMth
/*
ProjSales ProjMth
----------- -----------------------
10000000 2007-06-01 00:00:00.000
11000000 2008-02-01 00:00:00.000
*/
Now we can tell management that they will have to wait until Jun2007 to reach $10 million per month in sales and it will take another 8 months to reach $11 million in Feb2008.

So it looks like our time-travelling adventure yielded some good. We were able to give the management of AdventureWorks what they wanted. I’m sure they will use the information to make the company even stronger… perhaps because of our help, they’ll soon become the Amazon.com of bicycle sales. Many people will buy bicycles who hadn’t bought before, and…

Oops… perhaps we did change history. Perhaps we set something in motion that could cause a time paradox of some kind… perhaps something that … I don’t know… might make me just wink out of existence! Naaahhh… that’s impossible… it’s ridiculous! After all, if that happened, then how could I write this blog? How could I type this very sentence? How could I th