Normally I write in German, but I hope that maybe some Microsoft Engineer might read it and put the problem on their schedule. So, I apologise to my German readers for writing in poor English…
When the SQL Server optimizer internally transforms a SQL Statement to an other, it normally results in a better execution plan. When You write for example a subquery the optimizer tries to make a join instead. This is fine as long a the result set is the same. But now we discovered a situation when this is not true: If you use a non-deterministic function in a query, the optimizer use the same query plan as if the function were deterministic. Let us examine this simple example:
We have a table with 5 rows: the key consists of the tupel (ID1,ID2). We add a new column to add a GUID later.
| ID1 | ID2 | GUID |
|---|---|---|
| 1 | 1 | NULL |
| 1 | 2 | NULL |
| 2 | 1 | NULL |
| 2 | 2 | NULL |
| 3 | 1 | NULL |
For each different value of ID1 we want to add a GUID with the non-deterministic function NewID():
ID1
, NEWID() AS [NewId]
FROM Table1
GROUP BY ID1
This results in something like:
| ID1 | NewId |
|---|---|
| 1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
| 2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
| 3 | 12E9F691-7F8B-41F8-829B-8CED5E26C127 |
Now, we want to use the above query to assign the new values to the column GUID. To make the execution plan easier to understand, I used a SELECT, not an UPDATE. But the result is the same:
t.ID1
, sub.NewId
FROM (SELECT
ID1
, NEWID() AS [NewId]
FROM Table1
GROUP BY ID1) AS sub
JOIN Table1 AS t
ON t.ID1=sub.ID1
It should result in something like this:
| ID1 | NewId |
|---|---|
| 1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
| 1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
| 2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
| 2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
| 3 | 12E9F691-7F8B-41F8-829B-8CED5E26C127 |
But this is not the case. The result is:
| ID1 | NewId |
|---|---|
| 1 | B30AD595-926B-40F6-A815-D8871C81CE89 |
| 1 | 4EC62D6E-8ABC-4563-8A81-2DB8E655D3CA |
| 2 | 6842624D-DC40-4A3F-A2EB-A8178814D12E |
| 2 | 73B7386E-688D-48D2-A585-C66C6911EFC3 |
| 3 | 12E9F691-7F8B-41F8-829B-8CED5E26C127 |
Do you see the difference? The reason for this is the wrong execution plan:
|–Merge Join(Inner Join, MERGE:([tempdb].[dbo].[Table1].[ID1])=([t].[ID1]), RESIDUAL:([tempdb].[dbo].[Table1].[ID1]=[tempdb].[dbo].[Table1].[ID1] as [t].[ID1]))
|–Sort(DISTINCT ORDER BY:([tempdb].[dbo].[Table1].[ID1] ASC))
| |–Table Scan(OBJECT:([tempdb].[dbo].[Table1]))
|–Sort(ORDER BY:([t].[ID1] ASC))
|–Table Scan(OBJECT:([tempdb].[dbo].[Table1] AS [t]))
The optimizer choose to do the join first and to execute the function for each row. This is not correct.
So the general rule I recommend: Do not use non-deterministic functions in complex SQL queries. The result depends on the chosen query plan. Today the result may be correct, but if the row numbers change or you add an index the result may get incorrect! This is not as it should be.
Unfortunately I was not able to convince the German support that this is a bug. They send me a workaround like this:
SELECT @Rows = Count(*) FROM Table1;
SELECT
t.ID1
, sub.NewId
FROM (SELECT TOP(@Rows)
ID1
, NEWID() AS [NewId]
FROM Table1
GROUP BY ID1
ORDER BY NewId) AS sub
JOIN Table1 AS t
ON t.ID1=sub.ID1
The performance is very poor and I would recommend using a temporary table to store the new GUIDs instead, if you have a big row number. This seems to perform a little better.
Update (21.5.2008): Encouraged by Christoph I send the problem to Craig Freedman. He answered the very next day, that he presented the problem to the developer responsible for this code for his comments on that question. Here is the response of the developer Craig forwarded to me:
In general, SQL Server does not guarantee the timing of execution of scalar operators. For non-deterministic scalars (built-in and user-defined), that means the timing semantics (number of times executed and when) is not defined. In addition, it may change from one plan to another, or from one release to the next.
Craig added:
Unfortunately, I am not aware of (and was unable to identify) a better workaround than simply storing the GUIDs in a temp table before joining with the original table.
(My Thanks to Craig! I am very impressed that he ask the responsible developer. And thanks to the unknown developer.)
Update (11.6.2008): Itzik Ben-Gan did it! Microsoft confirmed to him this is a bug.
Currently I read the very good book "T-SQL Querying" written by Itzik. In the very first chapter he described that the optimizer can make shurtcuts in the physical execution plan if the result set is the same like using the logical execution plan. I asked him about his opinion about this problem and he made Microsoft to accept that this is a bug. He wrote:
I got back a response from Microsoft saying it is a bug, and apparently a regression from SQL Server 2000.
If you think Microsoft should fix this bug, then "vote" for this bug, please. The more people vote, the more likely MS will fix it.
Update (19.6.2008): Thanks for your support. The MS Developer Jim answered to this bug:
Thankyou for this bug report. One of the most interesting discussions around.
This hits to the very heart of the issue – is optimization allowed to change a program's semantics? Ie: if a program yields certain answers, but runs slowly, is it legitimate for a Query Optimizer make that program run faster, yet also change the results given?
Before shouting "NO!" (my own personal inclination too
, consider: the good news is that, in 99% of cases, the answers ARE the same. So Query Optimization is a clear win. The bad news is that, if the query contains side-effecting code, then different plans CAN indeed yield different results. And NEWID() is one such side-effecting (non-deterministic) 'function' that exposes the difference. [Actually, if you experiment, you can devise others – for example, short-circuit evaluation of AND clauses: make the second clause throw an arithmetic divide-by-zero – different optimizations may execute that second clause BEFORE the first clause] This reflects Craig's explanation, elsewhere in this thread, that SqlServer does not guarantee when scalar operators are executed.
So, we have a choice: if we want to guarantee a certain behavior in the presence of non-deterministic (side-effecting) code – so that results of JOINs, for example, follow the semantics of a nested-loop execution – then we can use appropriate OPTIONs to force that behavior – as UC points out. But the resulting code will run slow – that's the cost of, in effect, hobbling the Query Optimizer.
All that said, we are moving the Query Optimizer in the direction of "as expected" behavior for NEWID() – trading off performance for "results as expected".
[…]
Anyhow, this bug is now assigned to the QO Dev team for a deeper look.
I presume "QO Dev team" is the "Query Optimizer Development team".
Update (27.7.2008): In the last issue of the SQL Server Magazine Newsletter Itzik Ben-Gan wrote some bad news about this bug:
I posted the bug on Microsoft Connect (FeedbackID=350485), and after consideration, Microsoft decided to close the item and mark it as “Won’t Fix”. The reasoning behind the decision not to fix the bug is that in the vast majority of the cases, the optimization aspects that lead to the bug yield better performance without sacrificing the correctness of the query, and if you fall into one of the unusual cases where the correctness of the query is compromised, you can consider alternatives (e.g., physically materialize the data along with the NEWID values in a table).
This was quite unexpected for me. But I think there is nothing we can do about it. Thats a pity…



Hallo,
müssen es unbedingt GUIDs sein?
Alternativ ginge es sonst auch so:
(
SELECT
ID1
,row_number() over(ORDER BY (SELECT 0)) Grp
FROM #t
GROUP BY
ID1
)
SELECT
t.ID1
,cte.Grp
FROM cte
INNER JOIN #t t
ON
t.ID1 = cte.ID1
Bei dem Änderungsvorschlag von MS finde ich es bedenklich, das @Rows als bigint deklariert wird, aber count() int als Rückgabetypen liefert und dieses count() überhaupt durchgeführt wird. Stattdessen liesse sich der maximale int-wert von 2147483647 (oder der von bigint) benutzen:
t.ID1
,sub.newid
FROM
(
SELECT top(2147483647)
ID1
,newid() newid
FROM #t
GROUP BY
ID1
ORDER BY
newid
) sub
INNER JOIN #t t
ON
t.ID1 = sub.ID1
"I hope that maybe some Microsoft Engineer might read it and put the problem on their schedule" -> das kann ich mir nicht vorstellen, aber vielleicht hat Craig Freedman Lust, sich damit auseinander zu setzen: http://blogs.msdn.com/craigfr/
Vg
Christoph Ingenhaag
Hallo,
Deine Idee mit der ROW_NUMBER() gefällt mir gut. Sie dürfte sogar sehr performant sein, wenn der Index passt. Leider sind die Kollegen in dem Projekt ziemlich auf GUIDs festgelegt, das weiss ich noch von anderen Beratungen. Ich gestehe: deswegen habe ich gar nicht versucht ihnen das in dem konkreten Fall auszureden. Außerdem war ich ziemlich sicher, dass das ratz-fatz als Bug anerkannt wird und machte gleich einen Incident dazu auf. Andere weniger offensichtliche Probleme wurden meiste recht bald anerkannt, was mir leider falsche Hoffnung machte. Aber das Problem greift sehr tief in den Optimizer ein. Deswegen waren sie möglicherweise auf dem Dampfer: Was nicht sein darf, das ist auch nicht.
Die Idee mit Craig Freedman greife ich auf, Danke. Ich wollte mir sowieso überlegen, wo ich das am besten ins Rennen schicke…
Die Sache mit dem COUNT war mir noch gar nicht aufgefallen! Aber stimmt, du hast recht. Du hast einen guten Blick.
PS: Damit Dein SQL so hübsch angezeigt wird, wie Du ihn gemacht hast, habe ich sie mittels "<code lang="sql"></code>" geklammert. Das ist auch grundsätzlich für alle in Kommentaren erlaubt.