While I am reading back to my blog Generate MD5 Value from Big Data, I should have mention MD5 collision. This issue is within the standard algorithm not SQL Server itself. It happens rarely for character based hashing. It’s still a very reliable algorithm for change detecting but you need to be aware of this. Here is the code snippet of MD5 collision.
declare @a varbinary(130) = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70 declare @b varbinary(130) = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70 select LEN(@a) LengthA, LEN(@b) LengthB select case when @a = @b then 1 else 0 end [@a = @b ?] select HASHBYTES('MD5', @a) [Hash from @a], HASHBYTES('MD5', @b) [hash from @b]; /* LengthA LengthB ----------- ----------- 128 128 (1 row(s) affected) @a = @b ? ----------- 0 (1 row(s) affected) Hash from @a hash from @b ---------------------------------- ---------------------------------- 0x79054025255FB1A26E4BC422AEF54EB4 0x79054025255FB1A26E4BC422AEF54EB4 (1 row(s) affected) */
4 thoughts on “MD5 Collision”
I knew that collisions could happen but I hadn’t seen an actual example before so this was interesting. I had an application where I had to detect duplicate email attachments so I don’t load the same data twice just because someone did a reply all. I chose SHA1 rather than MD5 and it hasn’t let me down. However, my risk isn’t great as it isn’t a security issue if a collision did occur and we would get a followup if a collision caused my application to ignore something that wasn’t a duplicate. This is something we have to keep an eye on as you can see here:
Wow, John, interesting post. I assume you found this example collision by brute force looping through sample data? Or did you use some other method? I’m just wondering what the statistical likelihood of a collision is for any given set of data…
MD5 algorithm is broken (wiki), you can pretty easily find even 2 block collison collisions (2^18), or generate a file with the same hash as some other has (common prefix attack). Example of some other collisions.
I don’t think I can find 2 unmatched binary but with the same MD5 in my life if I use brute force looping…;) I found this pair long time ago and saved it on my hard drive. I think someone else use mathematical approach figure it out.
You might need to check Google to find statistical likelihood of a collision. From database world, it’s very unlikely getting collision except comparing small images.