flint
flint copied to clipboard
Strange nmod8 performance
Consider this benchmark which does Strassen multiplication:
#include "gr.h"
#include "gr_mat.h"
#include "profiler.h"
int main()
{
gr_ctx_t ctx;
gr_mat_t A, B, C;
slong i, j, n = 512;
gr_ctx_init_nmod8(ctx, 251);
gr_mat_init(A, n, n, ctx);
gr_mat_init(B, n, n, ctx);
gr_mat_init(C, n, n, ctx);
flint_rand_t state;
flint_randinit(state);
gr_mat_randtest(A, state, ctx);
gr_mat_randtest(B, state, ctx);
TIMEIT_START
gr_mat_mul(C, A, B, ctx);
TIMEIT_STOP
TIMEIT_START
gr_mat_mul(C, A, B, ctx);
TIMEIT_STOP
TIMEIT_START
gr_mat_mul(C, A, B, ctx);
TIMEIT_STOP
}
This turns out to be sensitive to whether _nmod8_vec_add and _nmod8_vec_sub are implemented using nmod_add / _nmod_add, nmod_sub / _nmod_sub. What is surprising is that best choice isn't symmetric on my machine (Zen 3). I get:
1: nmod_add + nmod_sub: 0.0354 seconds
2: _nmod_add + nmod_sub: 0.0311 seconds
3: nmod_add + _nmod_sub: 0.0373 seconds
4: _nmod_add + _nmod_sub: 0.0373 seconds
Can anyone else reproduce this? Why is combination (2) so much faster than the others?
And how can we implement these performance-critical functions more efficiently? Clearly they ought to be using unrolled, branchless SIMD code.
Have you tried using native code for nmod8? It could be that the compiler is doing some weird.