June 08, 2012

NOT IN with NULLs in the Subquery

A coworker came to me with a perplexing issue. He wanted to know why these two queries were not returning the same results:

mysql> SELECT COUNT(*) 
    -> FROM parent
    -> WHERE id NOT IN (SELECT parent_id FROM child);
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (7.84 sec)
mysql> SELECT COUNT(*)
    -> FROM parent p
    -> WHERE NOT EXISTS(SELECT 1 
    ->                  FROM child c
    ->                  WHERE p.id = c.parent_id);
+----------+
| count(*) |
+----------+
|     5575 |
+----------+
1 row in set (2.95 sec)

At first (and second, and third) glance these two queries look identical. It obviously is an exclusion join and because the MySQL optimizer is what it is, I decided to rewrite it as a LEFT JOIN to see what results came back:

mysql> SELECT COUNT(*) FROM parent p
    -> LEFT JOIN child c ON p.id = c.parent_id
    -> WHERE c.id IS NULL;
+----------+
| COUNT(*) |
+----------+
|     5575 |
+----------+
1 row in set (2.90 sec)

The LEFT JOIN returned the same results as the NOT EXISTS version. Why did the NOT IN query return 0 results? Since it was a beautiful day and I am completely unable to solve problems while sitting at the computer, I went for a walk.

After I returned and changed my thunderstorm-soaked clothes, I had a thought of something…

mysql> SELECT COUNT(*)FROM child WHERE parent_id IS NULL;
+----------+
| count(*) |
+----------+
|     3686 |
+----------+
1 row in set (0.20 sec)

There are NULLs in the pseudo-foreign key (this database does not have explicit foreign key constraints). What if the NULLs are excluded from the dependent subquery in the NOT IN clause?

mysql> SELECT COUNT(*)
    -> FROM parent
    -> WHERE id NOT IN (SELECT parent_id 
    ->                  FROM child 
    ->                  WHERE parent_id IS NOT NULL);
+----------+
| count(*) |
+----------+
|     5575 |
+----------+
1 row in set (7.67 sec)

Sure enough, the NULLs were the issue. But, why? Let’s think about a simpler case with constant values instead of subqueries. These two queries (using the sakila database) are logically equivalent:

mysql> SELECT COUNT(*) FROM actor WHERE actor_id NOT IN (1,2,3,4);
+----------+
| COUNT(*) |
+----------+
|      199 |
+----------+
1 row in set (0.13 sec)

mysql> SELECT COUNT(*) FROM actor 
    -> WHERE actor_id <> 1 
    ->   AND actor_id <> 2
    ->   AND actor_id <> 3
    ->   AND actor_id <> 4;
+----------+
| COUNT(*) |
+----------+
|      199 |
+----------+
1 row in set (0.06 sec)

What if there is a NULL in there?

mysql> SELECT COUNT(*) FROM actor
    -> WHERE actor_id <> 1
    ->   AND actor_id <> 2
    ->   AND actor_id <> 3
    ->   AND actor_id <> 4
    ->   AND actor_id <> NULL ;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

The reason why this returns 0 results is that column <> NULL (or column = NULL, for that matter) always evaluates to NULL. NULL doesn’t equal anything. It is just NULL. TRUE AND TRUE AND TRUE AND NULL always evaluates to NULL. An illustration:

mysql> SELECT TRUE AND FALSE;
+----------------+
| TRUE AND FALSE |
+----------------+
|              0 |
+----------------+
1 row in set (0.00 sec)

mysql> SELECT TRUE AND TRUE;
+---------------+
| TRUE AND TRUE |
+---------------+
|             1 |
+---------------+
1 row in set (0.00 sec)

mysql> SELECT TRUE AND NULL;
+---------------+
| TRUE AND NULL |
+---------------+
|          NULL |
+---------------+
1 row in set (0.00 sec)

mysql> SELECT 1=NULL;
+--------+
| 1=NULL |
+--------+
|   NULL |
+--------+
1 row in set (0.00 sec)

Here’s the final proof of the NULL problem:

mysql> SELECT COUNT(*) FROM actor WHERE actor_id NOT IN (1,2,3,4,NULL);
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

The moral of the story is, if you use NOT IN with a dependent subquery, make sure you exclude NULLs. This problem would also go away if there had been explicit foreign key constraints, which would enforce referential integrity. Additionally, as you can see from the performance differences above, NOT IN with a subquery is horribly optimized in MySQL pre-5.6, so just don’t do this at all on MySQL and use one of the other two derivations instead.