Fix checksumming in the presence of large, mostly-zero buffers.
Tools like asan allocate large regions of memory (multi-TB) that are generally sparsely allocated using demand-paging. Currently, when checksumming a process that has such regions, we attempt to read the entire multi-TB region into a std::vector. This obviously does not work. Switch our checksum method to crc32c for performance and further improve performance by:
- Reading the pagemap to detect pages that were allocated, but are currently still 0 and waiting to demand-paged in. These can be fast-forwarded using a precomputed crc operator.
- Capping the amount of memory to be read at once at a reasonable max buffer size to avoid thrashing.
The crc32c implementation here is copied from Julia, which itself is cobbled together from various places around the web - it's not the prettiest, but keeping it aligned with the Julia version will make it easier to port any future hardware acceleration improvements over, if necessary.
How much of a performance improvement is this fancy crc32 implementation?
You can see the benchmark result I did a while ago when I implemented the arm versions here https://github.com/JuliaLang/julia/pull/22385 there isn’t any benchmark result from the x86 version when it was originally added afaict. https://github.com/JuliaLang/julia/pull/18297
I think the pagemap optimization makes sense because the speedup is nearly infinite in those pathological cases with lots of unreserved memory.
I'm less convinced about the benefits of accelerating the CRC implementation for data that we've had to read from the tracee. Wouldn't it be good enough to use the crc32() function from zlib? We could replace the existing crc code in util.cc with that too.
Personally I almost never use the memory checksumming. Maybe you and Keno use it a lot?