Modulo could be accelerated with dot-product instructions to range-reduce the LHS

Open abadams opened this issue 3 years ago • 0 comments

Let A, B, C, D be the bytes of a 32-bit x, in big-endian order, so:

x = A * (1 << 24) + B * (1 << 16) + C * (1 << 8) + D

Reducing both sides modulo y, we get:

x % y = (A * ((1 << 24) % y) + B * ((1 << 16) % y) + C * ((1 << 8) % y) + D) % y

That RHS, except for the final modulo, can be computed cheaply using a dot product instruction (e.g. udot) on the bytes of x (provided that all those coefficients < 256, which is guaranteed if y < 256). The final modulo may now be much cheaper, because it only has to be correct on a range-reduced input. For y < 64, the dot product is guaranteed to fit in 16-bit, and a 16-bit modulo is much cheaper than a 32-bit modulo.

Consider, e.g. computing x % 3 on x86. The coefficients are all one, so the dot-product is just a horizontal add with a factor of four, which can be done with two pmaddubsw instructions and one phaddw. That produces a 10-bit intermediate. Reducing a 10-bit value modulo 3 requires only three more instructions (pmulhuw, pmullw, paddw). Reducing a 32-bit value modulo 3 is much more expensive (~25 instructions), because it requires 64-bit multiplies.

This would rely on the just-merged #6853 to view the 32-bit vector as its bytes.

Jul 20 '22 00:07 abadams