flint icon indicating copy to clipboard operation
flint copied to clipboard

Faster NMOD_RED using different precomputation?

Open vneiger opened this issue 5 months ago • 3 comments

NMOD_RED reduces a single limb modulo some given n. It currently calls NMOD_RED2 which reduces a two-limb (a_hi, a_lo) integer mod n, where a_hi must be < n (when called for NMOD_RED, it is zero).

Another way to reduce a single limb a mod n is to use the approach of n_mulmod_shoup to compute 1*a mod n, so with the fixed operand 1:

  • precompute q = floor(2**64 / n) (which is the output of n_mulmod_precomp_shoup with input 1)
  • reduce using n_mulmod_shoup(1L, a, q, n)

The precomputation is... a precomputation. It only depends on n, not on a.

The call to n_mulmod_shoup simply does one umul_ppmm to get some p_hi which is either floor(a/n) or floor(a/n)-1, then one single-word multiplication res = a - p_hi * n, and finally a single correction step (if res >= n, then return res - n else return res).

In this particular case, this actually works even when n has 64 bits, there is absolutely no restriction on n or a (usually n_mulmod_shoup requires n to have at most 63 bits).

NMOD_RED2 has an additional add_ssaaaa, some shifts here and there, and two correction steps.

Using the mulmod_shoup approach seems beneficial, for example when running nmod_vec/profile/p-reduce on zen4 I get the following values (left is existing code, right is using mulmod_shoup as showed below):

NMOD_RED		n_mulmod_shoup

bits 53, c/l = 4.0      bits 53, c/l = 2.4
bits 54, c/l = 4.1      bits 54, c/l = 2.4
bits 55, c/l = 4.1      bits 55, c/l = 2.4
bits 56, c/l = 4.1      bits 56, c/l = 2.4
bits 57, c/l = 4.1      bits 57, c/l = 2.4
bits 58, c/l = 4.1      bits 58, c/l = 2.4
bits 59, c/l = 4.1      bits 59, c/l = 2.4
bits 60, c/l = 4.1      bits 60, c/l = 2.4
bits 61, c/l = 4.1      bits 61, c/l = 2.4
bits 62, c/l = 4.1      bits 62, c/l = 2.4
bits 63, c/l = 4.1      bits 63, c/l = 2.4
bits 64, c/l = 2.8      bits 64, c/l = 2.4

(2.4 goes down to 2.0 if explicitly unrolling the loop; unrolling does not seem to help the NMOD_RED variant)

These timings were obtained by simply changing the reduce function, adding the precomputation and calling n_mulmod_shoup instead of NMOD_RED:

void _nmod_vec_reduce(nn_ptr res, nn_srcptr vec, slong len, nmod_t mod)
{
    const ulong one_precomp = n_mulmod_precomp_shoup(1L, mod.n);
    slong i;
    for (i = 0 ; i < len; i++)
    {
        //NMOD_RED(res[i], vec[i], mod);
        res[i] = n_mulmod_shoup(1L, vec[i], one_precomp, mod.n);
    }
}

There should be other places where this may be beneficial. However for more generalized use, it would probably help to add this one_precomp to the struct nmod_t, but I suspect there are good reasons for trying to avoid augmenting this struct?

vneiger avatar Sep 04 '24 10:09 vneiger