flint Add hardcoded flint_mpn_aors

These are generated from dev/gen_ARCH_aors.jl. Also add tests for it. I haven't tested this on ARM machines yet.

The x86 code should probably not reside within Broadwell since this is Nehalem compatible, but basically all x86 processors today are Broadwell compatible.

@fredrik-johansson

Nov 28 '24 01:11 albinahlback

Quick timings:

   n  mpn_add_n  flint_      speedup
   1  2.300e-09  1.810e-09   1.271
   2  2.510e-09  1.810e-09   1.387
   3  2.960e-09  2.040e-09   1.451
   4  2.760e-09  2.050e-09   1.346
   5  3.010e-09  2.080e-09   1.447
   6  3.280e-09  2.120e-09   1.547
   7  3.710e-09  2.350e-09   1.579
   8  3.330e-09  2.180e-09   1.528
   9  3.600e-09  2.430e-09   1.481
  10  3.840e-09  2.630e-09   1.460
  11  4.300e-09  2.850e-09   1.509
  12  3.860e-09  3.060e-09   1.261
  13  4.130e-09  3.360e-09   1.229
  14  4.600e-09  3.590e-09   1.281
  15  5.050e-09  3.780e-09   1.336
  16  4.890e-09  4.020e-09   1.216
  17  5.320e-09  5.530e-09   0.962
  18  5.690e-09  5.500e-09   1.035
  19  6.210e-09  6.320e-09   0.983
  20  5.320e-09  5.270e-09   1.009
  21  5.540e-09  5.570e-09   0.995
  22  5.840e-09  5.900e-09   0.990
  23  6.120e-09  6.140e-09   0.997
  24  6.320e-09  6.340e-09   0.997
  25  6.520e-09  6.560e-09   0.994
  26  6.850e-09  6.830e-09   1.003
  27  6.990e-09  6.960e-09   1.004
  28  7.250e-09  7.270e-09   0.997
  29  7.550e-09  7.570e-09   0.997
  30  7.700e-09  7.690e-09   1.001
  31  7.940e-09  7.980e-09   0.995
  32  8.180e-09  8.160e-09   1.002

Looks worth extending a bit beyond 16.

Nov 28 '24 09:11 fredrik-johansson

By the way, we often call mpn_aors_n(rp, xp, yp, n) with the same operand for rp and xp. Would functions specialized for this case gain anything?

Nov 28 '24 09:11 fredrik-johansson

Quick timings:

...

Looks worth extending a bit beyond 16.

Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.

Nov 28 '24 09:11 albinahlback

By the way, we often call mpn_aors_n(rp, xp, yp, n) with the same operand for rp and xp. Would functions specialized for this case gain anything?

For ARM, no. For x86, it may be the case that we can use OPC reg, mem, i.e. acting directly on memory. At least this will remove one store instruction, so it's worth a try.

Nov 28 '24 09:11 albinahlback

Quick timings: ... Looks worth extending a bit beyond 16.

Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.

It looks reasonable considering that the O(1) function call overhead makes up a good chunk of the time for small n.

One can also do something like this (ignoring the carry-out for now):

if (n <= 8)
{
    switch (n)
    {
        case 0: break;
        case 1: z[0] += x[0] + y[0];
        case 2: NN_ADD_2(z, x, y); break;
        case 3: NN_ADD_3(z, x, y); break;
        case 4: NN_ADD_4(z, x, y); break;
        case 5: NN_ADD_5(z, x, y); break;
        case 6: NN_ADD_6(z, x, y); break;
        case 7: NN_ADD_7(z, x, y); break;
        case 8: NN_ADD_8(z, x, y); break;
    }
}
else
{
    flint_mpn_add_n(z, x, y, n);
}

We get something like this:

   n  mpn_add_n  flint_...  inlined     speedup flint_.../inlined
   1  2.270e-09  1.830e-09  9.680e-10   1.240   1.890
   2  2.510e-09  1.830e-09  8.100e-10   1.372   2.259
   3  2.970e-09  2.050e-09  9.730e-10   1.449   2.107
   4  2.760e-09  2.080e-09  1.210e-09   1.327   1.719
   5  2.990e-09  2.100e-09  1.380e-09   1.424   1.522
   6  3.230e-09  2.140e-09  1.640e-09   1.509   1.305
   7  3.670e-09  2.360e-09  1.890e-09   1.555   1.249
   8  3.270e-09  2.330e-09  2.070e-09   1.403   1.126
   9  3.520e-09  2.500e-09  2.650e-09   1.408   0.943
  10  3.750e-09  2.660e-09  2.880e-09   1.410   0.924
  11  4.190e-09  2.940e-09  3.150e-09   1.425   0.933
  12  3.810e-09  3.170e-09  3.350e-09   1.202   0.946
  13  4.060e-09  3.380e-09  3.560e-09   1.201   0.949
  14  4.300e-09  3.620e-09  3.790e-09   1.188   0.955
  15  4.750e-09  3.860e-09  4.040e-09   1.231   0.955
  16  4.840e-09  4.060e-09  4.240e-09   1.192   0.958
  17  5.410e-09  5.150e-09  5.980e-09   1.050   0.861
  18  5.630e-09  5.680e-09  6.370e-09   0.991   0.892
  19  6.350e-09  6.320e-09  7.040e-09   1.005   0.898
  20  5.220e-09  5.180e-09  5.690e-09   1.008   0.910
  21  5.520e-09  5.560e-09  5.640e-09   0.993   0.986

Then we gain a bit for small n but lose a bit for larger n because of the extra logic.

Nov 28 '24 10:11 fredrik-johansson

BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.

Nov 28 '24 10:11 fredrik-johansson

BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.

Yes, but I recall Clang being sort of good at this, at least with instrinsics.

Nov 28 '24 11:11 albinahlback

Using flint_mpn_aors_n instead of mpn_aors_n in mpn_mod_mat_mul seems to give a slight slowdown. However, there seems to be a very slight (~1%) speedup specializing the inner loops to call flint_mpn_aors_N directly for various N. I guess one needs to find more addition-heavy workloads for this to really pay off.

Nov 29 '24 15:11 fredrik-johansson

Keep in mind that the speedups for addition is not too significant in itself. The multiplication is a more important routine to optimize since its heavy, where optimizing it can lead to more time saved in absolute terms.

Nov 29 '24 15:11 albinahlback

Note to self: If we want n > 16, we need to shift ap, bp and rp so that we do not get larger instructions. With offsets larger than 127, the instruction size increases which, I think, would negatively impact the performance of these functions (at least it would increase the risk of cache misses when using these functions).

Dec 02 '24 10:12 albinahlback

This will fail on ARM, but now aorsrsh is implemented for x86.

Dec 02 '24 19:12 albinahlback

I believe using this flint_mpn_add_n in actual code slows things down because the penalty of a mispredicted indirect jump (and maybe a predicted one too?) outweighs the speedup of the addition itself.

So maybe there shouldn't be an flint_mpn_add_n, but there could still be flint_mpn_add_12 for use in routines with hardcoded lengths; we could define such fixed-length functions on all architectures by having a macro for mpn_add_n(..., 12) as the fallback (we should actually do this for multiplication routines too).

Dec 13 '24 15:12 fredrik-johansson

I would like to get this in, but maybe for now don't add flint_add_n/flint_sub_n, just do the fixed-length versions and macro fallbacks #define flint_mpn_add_3 mpn_add(..., 3)?

Jan 07 '25 12:01 fredrik-johansson

But does it really yield improvements? I don't know if I would be happy having these merged.

Jan 07 '25 13:01 albinahlback

Yes, it can yield improvements when one calls the hardcoded versions directly, for example in the hardcoded Karatsuba in routines https://github.com/fredrik-johansson/flint/commit/302e15711ef48cb130d46136b3cddbb2f7df0583

Jan 07 '25 13:01 fredrik-johansson

Not really beneficial, so I'll close this.

Oct 31 '25 14:10 albinahlback

Add hardcoded flint_mpn_aors_n for ARM and x86