flint
flint copied to clipboard
Faster NMOD_RED using different precomputation?
NMOD_RED
reduces a single limb modulo some given n
. It currently calls NMOD_RED2
which reduces a two-limb (a_hi, a_lo)
integer mod n
, where a_hi
must be < n
(when called for NMOD_RED
, it is zero).
Another way to reduce a single limb a
mod n
is to use the approach of n_mulmod_shoup
to compute 1*a mod n
, so with the fixed operand 1:
- precompute
q = floor(2**64 / n)
(which is the output ofn_mulmod_precomp_shoup
with input1
) - reduce using
n_mulmod_shoup(1L, a, q, n)
The precomputation is... a precomputation. It only depends on n
, not on a
.
The call to n_mulmod_shoup
simply does one umul_ppmm
to get some p_hi
which is either floor(a/n)
or floor(a/n)-1
, then one single-word multiplication res = a - p_hi * n
, and finally a single correction step (if res >= n
, then return res - n
else return res
).
In this particular case, this actually works even when n
has 64 bits, there is absolutely no restriction on n
or a
(usually n_mulmod_shoup
requires n
to have at most 63 bits).
NMOD_RED2
has an additional add_ssaaaa
, some shifts here and there, and two correction steps.
Using the mulmod_shoup
approach seems beneficial, for example when running nmod_vec/profile/p-reduce
on zen4 I get the following values (left is existing code, right is using mulmod_shoup
as showed below):
NMOD_RED n_mulmod_shoup
bits 53, c/l = 4.0 bits 53, c/l = 2.4
bits 54, c/l = 4.1 bits 54, c/l = 2.4
bits 55, c/l = 4.1 bits 55, c/l = 2.4
bits 56, c/l = 4.1 bits 56, c/l = 2.4
bits 57, c/l = 4.1 bits 57, c/l = 2.4
bits 58, c/l = 4.1 bits 58, c/l = 2.4
bits 59, c/l = 4.1 bits 59, c/l = 2.4
bits 60, c/l = 4.1 bits 60, c/l = 2.4
bits 61, c/l = 4.1 bits 61, c/l = 2.4
bits 62, c/l = 4.1 bits 62, c/l = 2.4
bits 63, c/l = 4.1 bits 63, c/l = 2.4
bits 64, c/l = 2.8 bits 64, c/l = 2.4
(2.4 goes down to 2.0 if explicitly unrolling the loop; unrolling does not seem to help the NMOD_RED
variant)
These timings were obtained by simply changing the reduce function, adding the precomputation and calling n_mulmod_shoup
instead of NMOD_RED
:
void _nmod_vec_reduce(nn_ptr res, nn_srcptr vec, slong len, nmod_t mod)
{
const ulong one_precomp = n_mulmod_precomp_shoup(1L, mod.n);
slong i;
for (i = 0 ; i < len; i++)
{
//NMOD_RED(res[i], vec[i], mod);
res[i] = n_mulmod_shoup(1L, vec[i], one_precomp, mod.n);
}
}
There should be other places where this may be beneficial. However for more generalized use, it would probably help to add this one_precomp
to the struct nmod_t
, but I suspect there are good reasons for trying to avoid augmenting this struct?