The Taxonomy lacks “Limitation”, simply because this topic derives a certain complexity. We need to differ between the argument count limitation, and the bit length limitation. As the taxonomy already reflects, CHECKSUM and BINARY_CHECKSUM do accept more than one column as an input. On the other hand, HASHBYTES only accepts one input column. With that said, based on the chosen hash function, an additional limitation may apply: The bit-length of an argument.
|SQL Server 2014 (12.0) and older||SQL Server 2016 (13.0) and newer|
|CHECKSUM||No limit||No limit|
|BINARY_CHECKSUM||varbinary: No limit|
varchar: 255 characters
|varbinary: No limit|
varchar: 255 characters
|HASHBYTES||8000 bytes||No limit|
In order to analyse the uniqueness, lets have a closer look on the functions return values. (We consider using SHA2_512 with HASHBYTES)
|Data-Type||Max. Unique Representations|
|CHECKSUM||integer (4 bytes)||4 294 967 296|
|BINARY_CHECKSUM||integer (4 bytes)||4 294 967 296|
|HASHBYTES||16/20/32/64 bytes||18 446 744 073 709 551 616|
Even thought CHECKSUM and BINARY_CHECKSUM do have no input bit-length limitation (more or less) they only do return 4 bytes, in order to represent a hash value. Even if we manage to hash 4.294.967.296 unique data entries, we will run consequently into hash intersections upon the next insert. (Best case) HASHBYTES does offer the best unique representation and narrows down the possibility intersections.
CHECKSUM and BINARY_CHECKSUM is a Transact-SQL proprietary implementation and its source code is not available for the public. Because those two functions are only available on the SQL Server, one won’t be able to pre-calculate hash values outside of the domain. Hence, if you require to calculate the hashes within your application, you are out of luck.
There is an reverse engineered version for CHECKSUM. Sadly, it only works for Strings; but not for binaries.
In other words; if you plan to target not only TSQL, consider using HASHBYTES right away. Furthermore, HASHBYTES offers more than one implementation: MD2, MD4, MD5, SHA, SHA1, SHA2_256, SHA2_512.
The benchmark will surely vary among different system specifications. But at the end of the day the overall result should reflect a rough insight. The benchmark hashes a 500KB file.
|Execution time in MS|
CHECKSUM + BINARY_CHECKSUM vs HASHBYTES
From my point of view, the three hashing algorithm do serve a different purpose. HASHBYTES focuses on a single value, while CHECKSUM and BINARY_CHECKSUM are capable of hashing an entire row; which may be handy in some cases. E.g.: to determin whether a row has been changed. Furthermore, HASHBYTES is system independent, comprehensible and suitable for cross database/platform software.
CHECKSUM vs BINARY_CHECKSUM
Here the fun begins.
- Both accept multiple columns
- Both accept the same input data types
- Both decline the same input data type
- Both return a 4 byte sequence result
- Both are proprietary
- Both are quite performant
The one and only BIG difference is their way of interpreting information.
- CHECKSUM interprets the provided data as an case insensitive String, depending on the SQL Servers collation and case configuration.
- BINARY_CHECKSUM interprests the provided data as a binary sequence.
So what, may you think. Well, lets have a look at following hash results.
|“Hello World!”||“hello world!”||“2Volvo Director 20”||“3Volvo Director 30”|
“Hello World!” and “hello world!” is, for a case insensitive SQL Server configuration, the same for the CHECKSUM function. However, the BINARY_CHECKSUM reads different bits; and thus, returns a different result.
“2Volvo Director20” and “3Volvo Director 30” yields the same binary representation, thus, returns the same hash value. However, CHECKSUM detects the String difference and returns a different hash value.
- Use CHECKSUM for nvarchar and varchar columns
- Use BINARY_CHECKSUM for varbinary columns
- Don’t store a HASHBYTES result in an integer or bigint. Else byte information will be lost and hash collisions are likely to occur
- Don’t compare CHECKSUM and BINARY_CHECKSUM hashes. They don’t have to, but may vary, based on the SQL Server configuration.