CHECKSUM vs BINARY_CHECKSUM vs HASHBYTES

Limitations

The Taxonomy lacks “Limitation”, simply because this topic derives a certain complexity. We need to differ between the argument count limitation, and the bit length limitation. As the taxonomy already reflects, CHECKSUM and BINARY_CHECKSUM do accept more than one column as an input. On the other hand, HASHBYTES only accepts one input column. With that said, based on the chosen hash function, an additional limitation may apply: The bit-length of an argument.

SQL Server 2014 (12.0) and olderSQL Server 2016 (13.0) and newer
CHECKSUMNo limitNo limit
BINARY_CHECKSUMvarbinary: No limit
varchar: 255 characters
varbinary: No limit
varchar: 255 characters
HASHBYTES8000 bytesNo limit

Uniqueness

In order to analyse the uniqueness, lets have a closer look on the function’s return values. (We consider using SHA2_512 with HASHBYTES)

Data-TypeMax. Unique Representations
CHECKSUMinteger (4 bytes)4 294 967 296‬
BINARY_CHECKSUMinteger (4 bytes)4 294 967 296‬
HASHBYTES16/20/32/64 bytes18 446 744 073 709 551 616

Even thought CHECKSUM and BINARY_CHECKSUM do have no input bit-length limitation (more or less, See Limitations) they only do return 4 bytes, in order to represent a hash value. Even if we manage to hash 4.294.967.296 unique data entries, we will run consequently into hash collision upon the next insert. (Best case) HASHBYTES does offer the best unique representation and narrows down the possibility of intersections.

Compatibility

CHECKSUM and BINARY_CHECKSUM is a Transact-SQL proprietary implementation and its source code is not available for the public. Because those two functions are only available on the SQL Server, one won’t be able to pre-calculate hash values outside of the domain. Hence, if you require to calculate the hashes within your application, you are out of luck.

There is an reverse engineered version for CHECKSUM. Sadly, it only works for Strings; but not for binaries.

In other words; if you plan to target not only TSQL, consider using HASHBYTES right away. Furthermore, HASHBYTES offers more than one implementation: MD2, MD4, MD5, SHA, SHA1, SHA2_256, SHA2_512.

Performance

The benchmark will surely vary among different system specifications. But at the end of the day the overall result should reflect a rough insight. The benchmark hashes a 500KB file.

Execution time in MS
CHECKSUM53
BINARY_CHECKSUM67
HASHBYTES MD2140
HASHBYTES MD471
HASHBYTES MD579
HASHBYTES SHA65
HASHBYTES SHA1101
HASHBYTES SHA2_25669
HASHBYTES SHA2_51288

CHECKSUM + BINARY_CHECKSUM vs HASHBYTES

From my point of view, the three hashing algorithm do serve a different purpose. HASHBYTES focuses on a single value, while CHECKSUM and BINARY_CHECKSUM are capable of hashing an entire row; which may be handy in some cases. E.g.: to determin whether a row has been changed. Furthermore, HASHBYTES is system independent, comprehensible and suitable for cross database/platform software.

CHECKSUM vs BINARY_CHECKSUM

Here the fun begins.

  • Both accept multiple columns
  • Both accept the same input data types
  • Both decline the same input data type
  • Both return a 4 byte sequence result
  • Both are proprietary
  • Both are quite performant

The one and only BIG difference is their way of interpreting information.

  • CHECKSUM interprets the provided data as an case insensitive String, depending on the SQL Servers collation and case configuration.
  • BINARY_CHECKSUM interprests the provided data as a binary sequence.

So what, may you think. Well, lets have a look at following hash results.

“Hello World!”“hello world!”“2Volvo Director 20”“3Volvo Director 30”
CHECKSUM-354637007-354637007-341465450-341453853
BINARY_CHECKSUM-719541813-685856309-1356512636-1356512636

“Hello World!” and “hello world!” is, for a case insensitive SQL Server configuration, the same for the CHECKSUM function. However, the BINARY_CHECKSUM reads different bits; and thus, returns a different result.

“2Volvo Director 20” and “3Volvo Director 30” yields, due to its hashing-algorithm implementation, the same hash value. However, CHECKSUM detects the String difference and returns a different hash value.

  • Use CHECKSUM for nvarchar and varchar columns
  • Use BINARY_CHECKSUM for varbinary columns

Pitfalls

  • Don’t store a HASHBYTES result in an integer or bigint. Else byte information will be lost and hash collisions are likely to occur
  • Don’t compare CHECKSUM and BINARY_CHECKSUM hashes. They don’t have to, but may vary, based on the SQL Server configuration.

Taxonomy

CompatibilityMulti-ColumnSemi UniquenessPerformance
CHECKSUM
BINARY_CHECKSUM
HASHBYTES

Resources

Microsoft Docs

CheckSum Reverse Engineered Code

0 0 vote
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments