A checksum is a function that computes an integer value from a string of bytes that is used to detect errors. Each checksum function has slightly different speed or robustness properties, so which one should you use? In my opinion, new applications should use the CRC32C algorithm, as it is very robust and supported in hardware in newer Intel CPUs. This article discusses checksums in general, then provides an implementation and some performance advice.
At the very robust end of the spectrum are cryptographic hash functions, which are typically quite large (16 bytes or more). They are computationally expensive, which means they should only be used when reliability is a necessity and the computational cost is acceptable. Cheaper functions are needed for high throughput applications.
On the "fast" end of the spectrum are functions designed to be easily computable in software. The Internet checksum, used for IPv4, TCP and UDP headers, is 16 bits/2 bytes long, and is very fast but also very weak, leading to undetected errors. Adler-32 is 32 bits/4 bytes long, and is stronger, while still being very efficient to compute in software. Unfortunately, it is particularly weak for very short messages (see the rationale for switching the SCTP protocol from Adler-32 to CRC32C). Fletcher's checksum is slightly stronger, while being slightly slower to compute.
Cyclic redundancy checks (CRCs) are functions that are a good middle ground. They have been well studied by mathematicians and can be easily implemented in hardware. In particular, the CRC known as CRC32C (Castagnoli) has been found to be particularly strong for communication applications. This particular CRC is now implemented in hardware in Intel CPUs as part of the SSE4.2 extensions.
This hardware support changes things significantly. On CPUs that support the CRC32 instruction, computing this relatively strong checksum is even cheaper than computing Adler-32 checksums. Thus, there is no need to use these weaker functions. As these CPUs are replaced and upgraded, this hardware instruction will be available everywhere. The only risk is that currently AMD, Via, and Intel Atom CPUs do not support this instruction. Thus, a software fallback is needed, and on these CPUs the computation will be slower than functions based on Adler-32 or Fletcher's checksum.
Implementing CRC32C efficiently has been well studied. The best algorithm is called Slicing-by-8, which uses an 8 kB lookup table. The author's experimental results from their paper show that even with cold caches, this outperforms the typical algorithm (with a 1 kB lookup table) as long as the message is longer than approximately 200 bytes. I've taken Intel's source code and ported it to GCC, and included it in my implementation (see link below).
Using the CRC32 instruction is straightforward using GCC's built-in functions. I experimented with this a bit, and it turns out there are two tweaks for getting slightly better performance. First, aligned memory accesses are typically faster. However, my results show that for the CRC32 instruction, it only matters for data blocks larger than 4096 bytes, and even then the difference is small. Since I'm typically using small blocks, my implementation doesn't bother to align the memory accesses. Secondly, when computing the CRC on the trailing bytes, my tests show that it is slightly better to use a switch
that combines the various sized CRC32 instructions than to use a loop around the single byte CRC32 instruction.
Finally, to use the CRC32 instruction, you need to detect if it is supported, and fall back to the software implementation if it is not found. To do this, you need to issue the CPUID instruction, and look for SSE4.2 support. My approach is to use a function pointer that initially points at a detection routine. On the first call, the detection routine checks if the CPU supports the CRC32 instruction, then modifies the function pointer to point to the appropriate implementation.
Compiles with GCC on Linux and Mac OS X: crc32c.tar.bz2 (BSD license)