How do you generate MD5 hash in SQL Server? You might immediately tell that it can be generated MD5 by calling HASHBYTES built-in function. That’s true, however, it only accepts generating MD5 hashes from variables which has less than 8000 bytes. How would you generate MD5 hash from big variables?
First Let’s do a quick test here. We will generate MD5 from a string below
declare @value varbinary(max) select @value = cast(replicate('a', 8000) as varbinary(max)) select LEN(@value) select HASHBYTES ('md5', @value) /* -------------------- 8000 (1 row(s) affected) ------------------------------------ 0x030D638F1246092F18DB00C1C29A915D (1 row(s) affected) */
The lengh of the string is 8000. We successfully get MD5 of it. Now let’s add one more bytes to @values
declare @value varbinary(max) select @value = cast(replicate('a', 8000) as varbinary(max)) + 0x60 select LEN(@value) select HASHBYTES ('md5', @value) /* -------------------- 8001 (1 row(s) affected) ------------------------------------------- Msg 8152, Level 16, State 10, Line 4 String or binary data would be truncated. */
In this case, you get error message 8152, “String or binary data would be truncated.” This is because HashBytes only accept 8000 characters.
MD5 hash is one of the fastest hash algorithm in SQL Server one-way encryption. Many data warehouse applications use it to detect row changes. For instance, a table has a row with 50 fields with varchar(max) or image field in the row. Now the updates are delivered by ETL application from the source systems. If the values in a row before the updates are the same as the values after the updates, then the updates should be ignored. To implement this, one option is that you can put all fields into the predicate(where clause) in your update statement. If no changes, then don’t perform the update. However, not all the data types support comparison, such as image data type. For resolving such issues, you can employ MD5 hash.
Before the data is inserted to the table, calculate MD5 on all fields combined and save the MD5 value to the row as a field. Before updating the row, generate MD5 for the new row and then compare the new MD5 with existing MD5. If they are the same, then discard the changes, otherwise, perform update with the new MD5. With the limitatioin as stated above of HASHBYTES system function, you are also limiting the size of the row. You can always develop a CLR function to call .net library to perfrom MD5 for big field. But actually, SQL Server already provides such functionality in master database. The procedure name is called sys.fn_repl_hash_binary. It can only perform MD5 hash from a binary, which is not as diverse as HASHBYTES. However, it simplifies a lot of work for you, such as development, test, deployment, of CLR etc.
declare @value varbinary(max) select @value = cast(replicate('a', 8000) as varbinary(max)) + 0x60 select LEN(@value) select master.sys.fn_repl_hash_binary(@value) /* -------------------- 8001 (1 row(s) affected) ---------------------------------- 0x5475642AF686A60B809A8959FE04A3E8 (1 row(s) affected) */
13 thoughts on “Generate MD5 Value from Big Data”
Great article! Thank you very much!
Is this function: fn_repl_hash_binary new in Sql Server 2008?
I can’t find it in SQL Server 2005.
That function is written by .Net as well. It does not exist in all the version of SQL Server after version 2005
This doesn’t work for strings over the 8000 character limit. If I run the following the hashes are the same, when they should differ because of the final character over the 8000 character limit. If you replicate one less time and repeat, the hashes are different, so only the first 8000 characters are considered when calculating the hash
declare @value varbinary(max)
select @value = cast(replicate(‘a’, 8000) + ‘a’ as varbinary(max)) + 0x60
select @value = cast(replicate(‘a’, 8000) + ‘b’ as varbinary(max)) + 0x60
Paul, it’s not the limitation of hash_binary. It’s the limitation of replicate function. try this
Noticed it won’t persist in a computed column, which is an issue if you want to index. Still cool, though.
unless you write a CLR function for MD5 hashing
So, is there a similar function for SHA1 of strings > 8000 bytes in SQL?
No, I don’t think so.
This function is available with no need to compile, etc in the SQL# library (which I am the author of). There are two versions of it,
Util_Hashwhich returns a string version of the bytes and
Util_HashBinarywhich returns a binary version of the bytes. These functions support the following algorithms: MD5, SHA1, SHA256, SHA384, and SHA512.
In SQL Server 2017 the 8000 character limit on HASHBYTES has been lifted, so it can be used for any data.
To bring things up to date, note that SQL Server 2017 removes the 8000 character limit on HASHBYTES.