Add hardcoded flint_mpn_aors_n for ARM and x86
These are generated from dev/gen_ARCH_aors.jl. Also add tests for it. I haven't tested this on ARM machines yet.
The x86 code should probably not reside within Broadwell since this is Nehalem compatible, but basically all x86 processors today are Broadwell compatible.
@fredrik-johansson
Quick timings:
n mpn_add_n flint_ speedup
1 2.300e-09 1.810e-09 1.271
2 2.510e-09 1.810e-09 1.387
3 2.960e-09 2.040e-09 1.451
4 2.760e-09 2.050e-09 1.346
5 3.010e-09 2.080e-09 1.447
6 3.280e-09 2.120e-09 1.547
7 3.710e-09 2.350e-09 1.579
8 3.330e-09 2.180e-09 1.528
9 3.600e-09 2.430e-09 1.481
10 3.840e-09 2.630e-09 1.460
11 4.300e-09 2.850e-09 1.509
12 3.860e-09 3.060e-09 1.261
13 4.130e-09 3.360e-09 1.229
14 4.600e-09 3.590e-09 1.281
15 5.050e-09 3.780e-09 1.336
16 4.890e-09 4.020e-09 1.216
17 5.320e-09 5.530e-09 0.962
18 5.690e-09 5.500e-09 1.035
19 6.210e-09 6.320e-09 0.983
20 5.320e-09 5.270e-09 1.009
21 5.540e-09 5.570e-09 0.995
22 5.840e-09 5.900e-09 0.990
23 6.120e-09 6.140e-09 0.997
24 6.320e-09 6.340e-09 0.997
25 6.520e-09 6.560e-09 0.994
26 6.850e-09 6.830e-09 1.003
27 6.990e-09 6.960e-09 1.004
28 7.250e-09 7.270e-09 0.997
29 7.550e-09 7.570e-09 0.997
30 7.700e-09 7.690e-09 1.001
31 7.940e-09 7.980e-09 0.995
32 8.180e-09 8.160e-09 1.002
Looks worth extending a bit beyond 16.
By the way, we often call mpn_aors_n(rp, xp, yp, n) with the same operand for rp and xp. Would functions specialized for this case gain anything?
Quick timings:
...
Looks worth extending a bit beyond 16.
Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.
By the way, we often call
mpn_aors_n(rp, xp, yp, n)with the same operand forrpandxp. Would functions specialized for this case gain anything?
For ARM, no. For x86, it may be the case that we can use OPC reg, mem, i.e. acting directly on memory. At least this will remove one store instruction, so it's worth a try.
Quick timings: ... Looks worth extending a bit beyond 16.
Perhaps I have to fine tune the thing first. Seems a little bit slow, but I may be wrong.
It looks reasonable considering that the O(1) function call overhead makes up a good chunk of the time for small n.
One can also do something like this (ignoring the carry-out for now):
if (n <= 8)
{
switch (n)
{
case 0: break;
case 1: z[0] += x[0] + y[0];
case 2: NN_ADD_2(z, x, y); break;
case 3: NN_ADD_3(z, x, y); break;
case 4: NN_ADD_4(z, x, y); break;
case 5: NN_ADD_5(z, x, y); break;
case 6: NN_ADD_6(z, x, y); break;
case 7: NN_ADD_7(z, x, y); break;
case 8: NN_ADD_8(z, x, y); break;
}
}
else
{
flint_mpn_add_n(z, x, y, n);
}
We get something like this:
n mpn_add_n flint_... inlined speedup flint_.../inlined
1 2.270e-09 1.830e-09 9.680e-10 1.240 1.890
2 2.510e-09 1.830e-09 8.100e-10 1.372 2.259
3 2.970e-09 2.050e-09 9.730e-10 1.449 2.107
4 2.760e-09 2.080e-09 1.210e-09 1.327 1.719
5 2.990e-09 2.100e-09 1.380e-09 1.424 1.522
6 3.230e-09 2.140e-09 1.640e-09 1.509 1.305
7 3.670e-09 2.360e-09 1.890e-09 1.555 1.249
8 3.270e-09 2.330e-09 2.070e-09 1.403 1.126
9 3.520e-09 2.500e-09 2.650e-09 1.408 0.943
10 3.750e-09 2.660e-09 2.880e-09 1.410 0.924
11 4.190e-09 2.940e-09 3.150e-09 1.425 0.933
12 3.810e-09 3.170e-09 3.350e-09 1.202 0.946
13 4.060e-09 3.380e-09 3.560e-09 1.201 0.949
14 4.300e-09 3.620e-09 3.790e-09 1.188 0.955
15 4.750e-09 3.860e-09 4.040e-09 1.231 0.955
16 4.840e-09 4.060e-09 4.240e-09 1.192 0.958
17 5.410e-09 5.150e-09 5.980e-09 1.050 0.861
18 5.630e-09 5.680e-09 6.370e-09 0.991 0.892
19 6.350e-09 6.320e-09 7.040e-09 1.005 0.898
20 5.220e-09 5.180e-09 5.690e-09 1.008 0.910
21 5.520e-09 5.560e-09 5.640e-09 0.993 0.986
Then we gain a bit for small n but lose a bit for larger n because of the extra logic.
BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.
BTW: I think I mentioned this before: add_sssss.... macros seem to be inefficient for large n because the compiler doesn't know how to interleave the move and add instructions. I guess we could have more efficient NN_ADD_N macros specifically for in-memory operands with inline asm versions of your functions, with their explicit move/addressing instructions.
Yes, but I recall Clang being sort of good at this, at least with instrinsics.
Using flint_mpn_aors_n instead of mpn_aors_n in mpn_mod_mat_mul seems to give a slight slowdown. However, there seems to be a very slight (~1%) speedup specializing the inner loops to call flint_mpn_aors_N directly for various N. I guess one needs to find more addition-heavy workloads for this to really pay off.
Keep in mind that the speedups for addition is not too significant in itself. The multiplication is a more important routine to optimize since its heavy, where optimizing it can lead to more time saved in absolute terms.
Note to self: If we want n > 16, we need to shift ap, bp and rp so that we do not get larger instructions. With offsets larger than 127, the instruction size increases which, I think, would negatively impact the performance of these functions (at least it would increase the risk of cache misses when using these functions).
This will fail on ARM, but now aorsrsh is implemented for x86.
I believe using this flint_mpn_add_n in actual code slows things down because the penalty of a mispredicted indirect jump (and maybe a predicted one too?) outweighs the speedup of the addition itself.
So maybe there shouldn't be an flint_mpn_add_n, but there could still be flint_mpn_add_12 for use in routines with hardcoded lengths; we could define such fixed-length functions on all architectures by having a macro for mpn_add_n(..., 12) as the fallback (we should actually do this for multiplication routines too).
I would like to get this in, but maybe for now don't add flint_add_n/flint_sub_n, just do the fixed-length versions and macro fallbacks #define flint_mpn_add_3 mpn_add(..., 3)?
But does it really yield improvements? I don't know if I would be happy having these merged.
Yes, it can yield improvements when one calls the hardcoded versions directly, for example in the hardcoded Karatsuba in routines https://github.com/fredrik-johansson/flint/commit/302e15711ef48cb130d46136b3cddbb2f7df0583
Not really beneficial, so I'll close this.