bitintr
bitintr copied to clipboard
Use faster PEXT/PDEP implemetation on older/non-intel CPU
The ZP7 https://github.com/zwegner/zp7 implementation by Zach Wegner claims to be faster than the builtin instruction on some AMD architectures for most input masks. According to this twitter the performance on some AMD CPUs is input dependend and much worse than the 1 cycle throughput on intel.
The code is branch free and probably also faster than the naive loop currently used in bitintr. It uses CLMUL if available.
If I find the time I will do a rust implementation and benchmark it.