toys
toys copied to clipboard
SWAR movmask need not use a multiply high operation
Multiply-high instructions are typically more expensive both in terms of latency and throughput than multiply instrucitons, and certain ISA's/compilers (looking at you, MSVC) do not support emitting multiply-high without a function call at all.
In the course of writing a SWAR fallback for my Accelerated-Zig-Parser, I realized that the SWAR movmask operation need not use a multiply high. Instead, we can concentrate the bits in the upper byte, then shift to the lowest byte.
Effectively, the constants in your article can be bitshifted right by the byte size of the input:
old | new |
---|---|
0x204081020408100 | 0x2040810204081 |
2040810 | 204081 |
For 64 bit integers, we shift right by 56 to get the upper 8 bits, and for the 32 bit integers, we shift right by 28 to get the upper 4 bits.