We know that we need to take special attention about NULL values while writing queries because they may lead us into writing incorrect query logics. NULLs may also affect the performance of aggregation function MAX() and MIN(). I have submitted this issue to Microsoft Connect. There are number of ways to get around it. I think the improvement can also be done at query engine level. Hope this can be fixed in next or future version of SQL Server. But for now, we need to know it and know how to get around it.

First, let’s produce the issue.

use tempdb
if object_id('test') is not null
	drop table test
create table test(id int, data char(7000) not null default(replicate('a', 7000)))
create clustered index id on test(id)
insert into test(id) values(null)
go 100
insert into test(id) values(1)
select * from test
-- return 101 rows

Now I create test table in a none unique index with 101 rows. Each row will take a data page in SQL Server. Now let’s execute

set statistics io on
select MAX(id) from test
--(1 row(s) affected)
--Table 'test'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Results from the query indicate that 2 logical reads take place for the execution. Query Engine performs 2 logical reads. That’s great. There is no performance issue at all. Now let’s do the same for MIN

select MIN(id) from test
--(1 row(s) affected)
--Table 'test'. Scan count 1, logical reads 103, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Notice the last result from previous query, it takes 103 logical reads! This is very significant – My table gets fully scanned. Think about this, if my table has billions of rows, scanning such table isn’t a fun at all. Full scan is what we should eliminate in our query in this case at least! Comparing the query plan of those two, they are exactly the same – index scan + stream aggregate. But why the performance is so different?

This happens when there are massive amount of NULLs in the index. Let’s assume our index is in ascending order. While calculating MAX value, SQL Server gets the root page of the index, navigates the B-Tree, finds the maximum value at the right side, and returns the first value from that end (the last value of the index). So it only needs to access 2 pages, root and leaf, in this case, to get the expected value. Although the query plan says table scan, SQL Server will actually not scan entire table since there is no need for that.

While calculating MIN value, SQL Server use the same algorithm, access the root page, navigates the B-Tree, and finds the minimum values at the left side of the tree. Before returning the minimum value, SQL Server has to verify the value to be return is not NULL. To do so, SQL Server scans the pages at leaf level through doubly linked chain. In this case, the first not null values is at the end of the link, so SQL Server scans entire index to reach the first not null value – entire table is scanned.

Obviously, the query processor can’t use NULL as an implicit predicate to utilize the B-Tree efficiently. To get around this issue, there are 2 kind of approaches.

  1. Change existing algorithm
  2. Use existing algorithm but add additional index

Change existing algorithm
This is sample, just add an WHERE clause to force the query engine to seek the index. For instance:

select MIN(id) from test where id is not null

Use existing algorithm but add additional index
This approach applies to the environment that when you are maintaining a third party software where code change will violate service agreements.

  • Change/Add index with filter where id is not null
  • create an index view where id is not null. The indexes of the view can be utilized even your query is not referencing to the index view.

This is brought you by John Huang, http://www.sqlnotes.info

NULL Values Impact the Performance of MAX() and MIN()

You May Also Like

3 thoughts on “NULL Values Impact the Performance of MAX() and MIN()

  1. If I just re-create the cluster index to descend order, there is no impact to the performance for both MIN() and MAX(). But a non-cluster index on id column regardless the sorting order does reduce the logical read for MIN() to 2 as well. Any thoughts about this?

    1. Thanks Guang, you made great point here. It seems descending order of the index will not help the situation. I have corrected my post. I did reproduced your results using non-clustered index – the performance for both MAX and MIN is good. But if you insert 1000 NULLs to the table, then do a test again with none clustered index, you will see the differences.

      1. You are right! The MIN() takes much more logical reads than MAX() using non-clustered index. It should be the similar reason as clustered index that causes the slowness.

Leave a Reply

Your email address will not be published. Required fields are marked *

C# | HTML | Plain Text | SQL | XHTML | XML | XSLT |

This site uses Akismet to reduce spam. Learn how your comment data is processed.