Dynamics AX: JOINS in SQL

Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?
This:

view source print ?
1.SELECT l.*
2.FROM    t_left l
3.LEFT JOIN
4.        t_right r
5.ON      r.value = l.value
6.WHERE   r.value IS NULL
, this:
view source print ?
1.SELECT l.*
2.FROM    t_left l
3.WHERE   l.value NOT IN
4.        (
5.        SELECT value
6.        FROM    t_right r
7.        )
or this:
view source print ?
1.SELECT l.*
2.FROM    t_left l
3.WHERE   NOT EXISTS
4.        (
5.        SELECT NULL
6.        FROM    t_right r
7.        WHERE   r.value = l.value
8.        )

Differences between the methods

These methods are quite different.

First of all, LEFT JOIN / IS NULL and NOT EXISTS are semantically equivalent, while NOT IN is not. These method differ in how they handle NULL values in t_right
LEFT JOIN is guaranteed to return every row from t_left, and then filtering is applied to the values returned from t_right. If for some row in t_left there is no corresponding row in t_right (which means no row with that exact value is present in t_right), the row from t_left will be returned once, and the NULL values will be substituted instead of t_right's actual values.
Since NULL values can never satisfy an equality JOIN condition, the NULL values returned by the query are guaranteed to be substituted by the LEFT JOIN, not fetched out of the actual t_right's row. This means that LEFT JOIN / IS NULL is guaranteed to return at most one row from t_left, and these row's value is not equal to one of those in t_right.

The same holds for NOT EXISTS. Since it's a predicate, not a JOIN condition, the rows from t_left can only be returned at most once too. EXISTS always returns TRUE or FALSE and it will return TRUE as soon as it finds only a single matching row in t_right, or FALSE, if it find none.

NOT EXISTS, therefore, will return TRUE only if no row satisfying the equality condition is found in t_right (same as for LEFT JOIN / IS NULL).

Note that NULL values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL and NOT EXISTS will always return rows from t_left that have value set to NULL, even is there are rows with value IS NULL in t_right.

NOT IN, however, behaves differently.

IN predicate (unlike EXISTS) is trivalent, i. e. it can return TRUE, FALSE or NULL:

TRUE is returned when the non-NULL value in question is found in the list
FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values
NULL is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

IN predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL values on either side of the expression, returning NULL instead.

This of course makes no difference when using the positive form of NULL: predicates returning NULL are filtered out by the WHERE clause as well as those returning FALSE.
However, NOT IN is different, since negation of NULL is NULL as well.

That's why NOT IN condition will never hold for any list with a NULL value in it.

If a row is found in the list, IN will return TRUE and NOT IN, therefore, will return FALSE
If a row is not found in the list, IN will return NULL, and NOT IN on its turn will also return NULL

Both conditions will of course be filtered out by the WHERE clause.

Let's illustrate it with two simple queries that compare (1, NULL) in t_left with (2, NULL) in t_right:

WITH    t_left AS
        (
        SELECT  1 AS value
        UNION ALL
        SELECT  NULL
        ),
        t_right AS
        (
        SELECT  2 AS value
        UNION ALL
        SELECT  NULL
        )
SELECT  l.*
FROM    t_left l
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_right r
        WHERE   r.value = l.value
        )

value
1
NULL
2 rows fetched in 0.0001s (0.0006s)

This query, using NOT EXISTS, returns both values from t_left, since neither of them is equal to any of the values from t_right.

WITH    t_left AS
        (
        SELECT  1 AS value
        UNION ALL
        SELECT  NULL
        ),
        t_right AS
        (
        SELECT  2 AS value
        UNION ALL
        SELECT  NULL
        )
SELECT  l.*
FROM    t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    t_right
        )

value
0 rows fetched in 0.0001s (0.0005s)

This query, on the other hand, returns nothing. Since there is a NULL in t_right, NOT IN returns NULL rather than TRUE if the value is not found among the defined values. Just in case.
IN (and NOT IN) are too chicken to say something definite about lists with NULL unless they are completely sure that the value is there.

However, if the values in both tables are non-nullable, NULL, all three method describe above are semantically identical.

Efficiency comparison

Let's see how efficient are these methods.
To do that, we will create two sample tables:

Table creation details
Table t_left contains 100,000 rows with 10,000 distinct values.
Table t_right contains 1,000,000 rows with 10,000 distinct values.
There are 10 rows in t_left with values not present in t_right.
Let's run the queries against these tables.

NOT IN

SELECT  l.id, l.value
FROM    [20090915_anti].t_left l
WHERE   l.value NOT IN
        (
        SELECT  value
        FROM    [20090915_anti].t_right r
        )

View query results, details and execution plan

id	value
10000	0
20000	0
30000	0
40000	0
50000	0
60000	0
70000	0
80000	0
90000	0
100000	0
10 rows fetched in 0.0005s (0.2716s)

Table 't_left'. Scan count 1, logical reads 499, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_right'. Scan count 1, logical reads 3112, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 266 ms,  elapsed time = 271 ms.

  |--Merge Join(Right Anti Semi Join, MERGE:([r].[value])=([l].[value]), RESIDUAL:([test].[20090915_anti].[t_left].[value] as [l].[value]=[test].[20090915_anti].[t_right].[value] as [r].[value]))
       |--Stream Aggregate(GROUP BY:([r].[value]))
       |    |--Index Scan(OBJECT:([test].[20090915_anti].[t_right].[IX_right_value] AS [r]), ORDERED FORWARD)
       |--Index Scan(OBJECT:([test].[20090915_anti].[t_left].[IX_left_value] AS [l]), ORDERED FORWARD)

As we can see, this query uses Merge Anti Semi Join which is extremely efficient if there is a cheap way to obtain two ordered resultsets (like in example above). Since value is indexed in both tables, the indexes serve as such resulsets.
Merge Join means that the server iterates both resultsets from lower values to higher ones, keeping a pointer to the current value and advancing it in both resultsets.
Anti Semi Join above means that as soon as the engine meets a match in t_right it just skips all matching values in both t_left and t_right. Since values from t_right are pregrouped using Stream Aggregate (making the right resultset 100 times as small), the values are only skipped in t_left (10 at once).

The whole query takes as little as 0.271 s.

NOT EXISTS

SELECT  l.id, l.value
FROM    [20090915_anti].t_left l
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    [20090915_anti].t_right r
        WHERE   r.value = l.value
        )

View query results, details and execution plan

id	value
10000	0
20000	0
30000	0
40000	0
50000	0
60000	0
70000	0
80000	0
90000	0
100000	0
10 rows fetched in 0.0004s (0.2719s)

Table 't_left'. Scan count 1, logical reads 499, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_right'. Scan count 1, logical reads 3112, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 281 ms,  elapsed time = 272 ms.

  |--Merge Join(Right Anti Semi Join, MERGE:([r].[value])=([l].[value]), RESIDUAL:([test].[20090915_anti].[t_right].[value] as [r].[value]=[test].[20090915_anti].[t_left].[value] as [l].[value]))
       |--Stream Aggregate(GROUP BY:([r].[value]))
       |    |--Index Scan(OBJECT:([test].[20090915_anti].[t_right].[IX_right_value] AS [r]), ORDERED FORWARD)
       |--Index Scan(OBJECT:([test].[20090915_anti].[t_left].[IX_left_value] AS [l]), ORDERED FORWARD)

Exactly same plan and exactly same execution time as above.
In SQL Server, NOT IN and NOT EXISTS are complete synonyms in terms of the query plans and execution times (as long as both columns are NOT NULL).

LEFT JOIN / IS NULL

SELECT  l.id, l.value
FROM    [20090915_anti].t_left l
LEFT JOIN
        [20090915_anti].t_right r
ON      r.value = l.value
WHERE   r.value IS NULL

View query results, details and execution plan

id	value
10000	0
20000	0
30000	0
40000	0
50000	0
60000	0
70000	0
80000	0
90000	0
100000	0
10 rows fetched in 0.0004s (0.8125s)

Table 't_left'. Scan count 3, logical reads 1984, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 't_right'. Scan count 3, logical reads 3418, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. 

SQL Server Execution Times:
   CPU time = 1563 ms,  elapsed time = 810 ms.

  |--Parallelism(Gather Streams)
       |--Filter(WHERE:([test].[20090915_anti].[t_right].[value] as [r].[value] IS NULL))
            |--Hash Match(Left Outer Join, HASH:([l].[value])=([r].[value]))
                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([l].[value]))
                 |    |--Index Scan(OBJECT:([test].[20090915_anti].[t_left].[IX_left_value] AS [l]))
                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([r].[value]))
                      |--Index Scan(OBJECT:([test].[20090915_anti].[t_right].[IX_right_value] AS [r]))

Here, the results are the same but performance details are very different.
SQL Server's optimizer cannot discern an ANTI JOIN in a LEFT JOIN / IS NULL construct.
That's why it just build the complete resultset (as with a common LEFT JOIN) and filters out the matching values.

Since we have lots of values to filter in this case (almost 10,000,000), it's a hard job to filter such a tremendous lot of values. This operation is performed using quite an efficient Hash Match which can be and is parallelized, but filtering the values out still takes the most time.

That's why the LEFT JOIN / IS NULL query takes 810 ms, or 3 times as much as the NOT EXISTS / NOT IN query.

Dynamics AX

Tuesday, 22 December 2015

JOINS in SQL

Differences between the methods

Efficiency comparison

NOT IN

NOT EXISTS

LEFT JOIN / IS NULL

No comments:

Post a Comment