tinygraph icon indicating copy to clipboard operation
tinygraph copied to clipboard

Implement Muła's algorithm for AVX2-based popcount within a basic block

Open daniel-j-h opened this issue 1 year ago • 3 comments

Our succinct rank/select data structures heavily rely on popcount. At the moment we are using the popcount intrinsic but by now there are faster ways of running popcount over a long sequence of bytes. We should explore them.

  • https://arxiv.org/abs/1611.07612
  • https://news.ycombinator.com/item?id=11277891
  • http://0x80.pl/notesen/2008-05-24-sse-popcount.html#changelog

libraries

  • https://github.com/kimwalisch/libpopcnt
  • https://github.com/WojciechMula/sse-popcount

daniel-j-h avatar Feb 27 '25 14:02 daniel-j-h

I have read the paper and checked out the libraries linked above. Here's a quick summary 3. The paper introduces the AVX2 "Harley-Seal" algorithm and finds it 2.4x faster than the popcnt instruction 4. Harley-Seal works on 16 x 256 bit registers a time, that makes it great for popcounting long inputs 5. The paper shows another AVX2 impl: Muła's algorithm: it computes 4 popcounts (4x64bit numbers in a 256bit register)

With rank/select implementations e.g. as in

  • https://github.com/tinygraph/tinygraph/pull/41

the basic block is of size 512 bit (64 bytes, 1 cache line) and the implementation works as follows

  • rank: create an index to look up the 512bit block to rank in, then rank in the 512bit block. Here Muła's algorithm (or a slight adaption of it) would help with ranking in the 512bit block. Harley-Seal does not work on 512bit blocks, it would work for 4096bit blocks, tho, making the index smaller but the within-block search computationally heavier.
  • select: we sample 1s and then scan the rank index to find a block where the ith bit has to be in. Then we select within a 512bit block. Selecting within a 512bit block runs popcounts until it overshoots and then finds the bit in the word before. Again Harley-Seal will not work here, and Muła's algorithm will need slight adaptions. What could help is a block size of at least 4096bits in the rank structure so that in select scanning over the rank blocks is faster and we can make use of Harley-Seal but that requires changing the rank select approach altogether.

A few more notes on what it means for our library

  1. The approach depends on what instruction set is available: we'd need different implementations for AVX512, AVX2, SSE, and ARM and PPC64, a popcnt fallback, and a scalar fallback for the worst case. This makes it complicated.
  2. The libpopcnt library implements these fast algorithms for all these architectures and instruction sets and is a single file. We could just inline it here and make use of it. The downside is currently our project doesn't support AVX512 or SSE or ARM or PPC64 in the first place so having a fast popcount implementation is a good start but not good enough.

Conclusion: The simplest way forward is to work with AVX2 for now since the rest of the library is only built for that. For ranking within a 512bit block we try out Muła's algorithm and adapt it so that it works for our rank and select implementations. This also means only changing tinygraph_bits_rank_512 and tinygraph_bits_select_512. The expected performance boost for popcount is around x1.2-1.4 based on the paper's micro benchmarks but best to see and benchmark end to end on our own since we do a bit more than simply popcounting.

daniel-j-h avatar Mar 03 '25 10:03 daniel-j-h

Note that the libpopcnt is slightly different compared to the previous version found in the paper

  • https://github.com/kimwalisch/libpopcnt/blob/74dd1c1e3f277f339fdfaf6f33f77b6ae0e075a0/libpopcnt.h#L392-L415
  • https://github.com/kimwalisch/libpopcnt/blob/74dd1c1e3f277f339fdfaf6f33f77b6ae0e075a0/LICENSE

and from the author

  • https://github.com/WojciechMula/sse-popcount/blob/138c91e21c3e6dab7875521b5d33b995e0e4c85e/popcnt-avx2-lookup.cpp#L1-L116
  • https://github.com/WojciechMula/sse-popcount/blob/138c91e21c3e6dab7875521b5d33b995e0e4c85e/LICENSE

daniel-j-h avatar Mar 03 '25 10:03 daniel-j-h

Note: first have our rank/select data structure and use it for graph traversal; only then implement this and benchmark

  • https://github.com/tinygraph/tinygraph/pull/41

daniel-j-h avatar Mar 03 '25 20:03 daniel-j-h