constantine
constantine copied to clipboard
Vectorized table select
The CMOV instruction that is used for conditional copy is likely optimal for 4~6 limbs.
From Agner Fog tables
https://www.agner.org/optimize/instruction_tables.pdf
The throughput is 0.5 hence 2 independent CMOV can be issued per cycle, hence 2-3 cycles are required per Fp element.
However when we have a table precomputed for scalar multiplication/signing with 8 EC elements, each composed of 3 Fp coordinates of 4-6 limbs, using SSE or AVX we can load 2x4 or 2x8 limbs per cycle (2 vector loads per cycle, bottlenecked by memory speed).
This would reduce the overhead of table access. Note that LSB set recoding (#73) uses table with 64 to 256 EC elements (192+ Fp hence thousands of limbs)
i.e. to vectorize: https://github.com/mratsim/constantine/blob/00ff59910618d683c96b5bd4ec3972ba92990ce1/constantine/elliptic/ec_endomorphism_accel.nim#L200-L206