Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?
This:
, this:1.
SELECT
l.*
2.
FROM
t_left l
3.
LEFT
JOIN
4.
t_right r
5.
ON
r.value = l.value
6.
WHERE
r.value
IS
NULL
or this:1.
SELECT
l.*
2.
FROM
t_left l
3.
WHERE
l.value
NOT
IN
4.
(
5.
SELECT
value
6.
FROM
t_right r
7.
)
1.
SELECT
l.*
2.
FROM
t_left l
3.
WHERE
NOT
EXISTS
4.
(
5.
SELECT
NULL
6.
FROM
t_right r
7.
WHERE
r.value = l.value
8.
)
Differences between the methods
These methods are quite different.First of all,
LEFT JOIN / IS NULL
and NOT EXISTS
are semantically equivalent, while NOT IN
is not. These method differ in how they handle NULL
values in t_right
LEFT JOIN
is guaranteed to return every row from t_left
, and then filtering is applied to the values returned from t_right
. If for some row in t_left
there is no corresponding row in t_right
(which means no row with that exact value is present in t_right
), the row from t_left
will be returned once, and the NULL
values will be substituted instead of t_right
's actual values.Since
NULL
values can never satisfy an equality JOIN
condition, the NULL
values returned by the query are guaranteed to be substituted by the LEFT JOIN
, not fetched out of the actual t_right
's row. This means that LEFT JOIN / IS NULL
is guaranteed to return at most one row from t_left
, and these row's value
is not equal to one of those in t_right
.The same holds for
NOT EXISTS
. Since it's a predicate, not a JOIN
condition, the rows from t_left
can only be returned at most once too. EXISTS
always returns TRUE
or FALSE
and it will return TRUE
as soon as it finds only a single matching row in t_right
, or FALSE
, if it find none.NOT EXISTS
, therefore, will return TRUE
only if no row satisfying the equality condition is found in t_right
(same as for LEFT JOIN / IS NULL
).Note that
NULL
values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL
and NOT EXISTS
will always return rows from t_left
that have value
set to NULL
, even is there are rows with value IS NULL
in t_right
.NOT IN
, however, behaves differently.IN
predicate (unlike EXISTS
) is trivalent, i. e. it can return TRUE
, FALSE
or NULL
:TRUE
is returned when the non-NULL
value in question is found in the listFALSE
is returned when the non-NULL
value is not found in the list and the list does not containNULL
valuesNULL
is returned when the value isNULL
, or the non-NULL
value is not found in the list and the list contains at least oneNULL
value
IN
predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL
values on either side of the expression, returning NULL
instead.This of course makes no difference when using the positive form of
NULL
: predicates returning NULL
are filtered out by the WHERE
clause as well as those returning FALSE
.However,
NOT IN
is different, since negation of NULL
is NULL
as well.That's why
NOT IN
condition will never hold for any list with a NULL
value in it.- If a row is found in the list,
IN
will returnTRUE
andNOT IN
, therefore, will returnFALSE
- If a row is not found in the list,
IN
will returnNULL
, andNOT IN
on its turn will also returnNULL
WHERE
clause.Let's illustrate it with two simple queries that compare
(1, NULL)
in t_left with (2, NULL)
in t_right
:01.
WITH
t_left
AS
02.
(
03.
SELECT
1
AS
value
04.
UNION
ALL
05.
SELECT
NULL
06.
),
07.
t_right
AS
08.
(
09.
SELECT
2
AS
value
10.
UNION
ALL
11.
SELECT
NULL
12.
)
13.
SELECT
l.*
14.
FROM
t_left l
15.
WHERE
NOT
EXISTS
16.
(
17.
SELECT
NULL
18.
FROM
t_right r
19.
WHERE
r.value = l.value
20.
)
value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NULL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2 rows fetched in 0.0001s (0.0006s) |
This query, using
NOT EXISTS
, returns both values from t_left
, since neither of them is equal to any of the values from t_right
.01.
WITH
t_left
AS
02.
(
03.
SELECT
1
AS
value
04.
UNION
ALL
05.
SELECT
NULL
06.
),
07.
t_right
AS
08.
(
09.
SELECT
2
AS
value
10.
UNION
ALL
11.
SELECT
NULL
12.
)
13.
SELECT
l.*
14.
FROM
t_left l
15.
WHERE
l.value
NOT
IN
16.
(
17.
SELECT
value
18.
FROM
t_right
19.
)
value | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 rows fetched in 0.0001s (0.0005s) |
NULL
in t_right
, NOT IN
returns NULL
rather than TRUE
if the value is not found among the defined values. Just in case.IN
(and NOT IN
) are too chicken to say something definite about lists with NULL
unless they are completely sure that the value is there.However, if the values in both tables are non-nullable,
NULL
, all three method describe above are semantically identical.Efficiency comparison
Let's see how efficient are these methods.To do that, we will create two sample tables:
Table creation details
Table
t_left
contains 100,000 rows with 10,000 distinct values.Table
t_right
contains 1,000,000 rows with 10,000 distinct values.There are 10 rows in
t_left
with values not present in t_right
.Let's run the queries against these tables.
NOT IN
1.
SELECT
l.id, l.value
2.
FROM
[20090915_anti].t_left l
3.
WHERE
l.value
NOT
IN
4.
(
5.
SELECT
value
6.
FROM
[20090915_anti].t_right r
7.
)
As we can see, this query uses
Merge Anti Semi Join
which is extremely efficient if there is a cheap way to obtain two ordered resultsets (like in example above). Since value
is indexed in both tables, the indexes serve as such resulsets.Merge Join
means that the server iterates both resultsets from lower values to higher ones, keeping a pointerto the current value and advancing it in both resultsets.
Anti Semi Join
above means that as soon as the engine meets a match in t_right
it just skips all matching values in both t_left
and t_right
. Since values from t_right
are pregrouped using Stream Aggregate
(making the right resultset 100 times as small), the values are only skipped in t_left
(10 at once).The whole query takes as little as 0.271 s.
NOT EXISTS
1.
SELECT
l.id, l.value
2.
FROM
[20090915_anti].t_left l
3.
WHERE
NOT
EXISTS
4.
(
5.
SELECT
NULL
6.
FROM
[20090915_anti].t_right r
7.
WHERE
r.value = l.value
8.
)
Exactly same plan and exactly same execution time as above.
In SQL Server,
NOT IN
and NOT EXISTS
are complete synonyms in terms of the query plans and execution times (as long as both columns are NOT NULL
).LEFT JOIN / IS NULL
1.
SELECT
l.id, l.value
2.
FROM
[20090915_anti].t_left l
3.
LEFT
JOIN
4.
[20090915_anti].t_right r
5.
ON
r.value = l.value
6.
WHERE
r.value
IS
NULL
Here, the results are the same but performance details are very different.
SQL Server's optimizer cannot discern an
ANTI JOIN
in a LEFT JOIN / IS NULL
construct.That's why it just build the complete resultset (as with a common
LEFT JOIN
) and filters out the matching values.Since we have lots of values to filter in this case (almost 10,000,000), it's a hard job to filter such a tremendous lot of values. This operation is performed using quite an efficient
Hash Match
which can be and is parallelized, but filtering the values out still takes the most time.That's why the
LEFT JOIN / IS NULL
query takes 810 ms, or 3 times as much as the NOT EXISTS
/ NOT IN
query.
No comments:
Post a Comment