In my last post, I described Hash Joins which is showing as Hash Match physical operator in the graphical execution plan. Another place hash match is used is for aggregation. When columns in group by clause do have have indexes or SQL Server cannot determine whether the rows are sorted or not, SQL Server will perform a Hash match to get aggregates. If distinct keywork appears on the select list, but there is no indexes on the selected columns, a hash match will be used as well. This operation is called hash aggregate.
While aggregating, records are read from the source one by one. The source can be a table or a set of the result of joins. Hashed keys are build based on the columns in the group-by clause. Each value combination of columns in the group by clause will generate one unique hash key in memory – hash list (table). When a record is read from the source,
- If the hashed key is not in the hash table, hashed key will be added.
- If the hash key is already in the hash table, the aggregate values in the hash list will be updated.
Hash aggregate requires CPU power to perform hash algorithm and hash list search. It also requires memory to hold the hash table and aggregates. Although it’s just one row (in hash table) for a group of values, it’s still possible that query is not able to allocate enough memory for the hash table. When this happen, SQL Server will put some hash partitions in to tempdb. That’s great! But what will happen next?
Now let’s say we have part of the hash list in tempdb and part of them in memory. When a new row comes into the pipe, if the hash is in the hash partition(s) in memory, perfrom the aggregate. If the hash is not in the hash partition(s) in memory, then the hash value will be directly saved in the tempdb without any aggregation. Until all rows are processed, the hash table in memory will be sent to client and discarded. The hash partitions in tempdb will be read into memeory to have aggregates calculated and sent to client. Until all hash partitions in tempdb are processed, the query execution completes.
This is a very heavy process especially when hash list runs out of memory. They way to make it efficient is to have values in the group by list sorted. This will not make Hash Match efficient but it will lead Query Processor to use different physical operator to perform aggregate — Stream Aggregate, i will talk about it in my next post.
I find your posts very informative and so far I have not find a book that can help me perform efficient query tuning. Most people will give you the basics that everyone else already knows but I wish there was a book that would go in detail about each operator and when the optimizer decides to use it. If you could include a sample TSQL for us readers to try it ourselves. I plan to come up with a few examples to see when the Hash aggregates are used.
Thank you.
Alex, thanks for visiting my blog. I should have few examples here. For the hash aggregate, when your query does aggregate without index on group by columns and with many rows taking part of the aggregate, SQL Server will use hash aggregate.