Strange nmod8 performance

Open fredrik-johansson opened this issue 2 years ago • 1 comments

Consider this benchmark which does Strassen multiplication:

#include "gr.h"
#include "gr_mat.h"
#include "profiler.h"

int main()
{
    gr_ctx_t ctx;
    gr_mat_t A, B, C;
    slong i, j, n = 512;

    gr_ctx_init_nmod8(ctx, 251);

    gr_mat_init(A, n, n, ctx);
    gr_mat_init(B, n, n, ctx);
    gr_mat_init(C, n, n, ctx);

    flint_rand_t state;
    flint_randinit(state);

    gr_mat_randtest(A, state, ctx);
    gr_mat_randtest(B, state, ctx);

    TIMEIT_START
    gr_mat_mul(C, A, B, ctx);
    TIMEIT_STOP
    TIMEIT_START
    gr_mat_mul(C, A, B, ctx);
    TIMEIT_STOP
    TIMEIT_START
    gr_mat_mul(C, A, B, ctx);
    TIMEIT_STOP
}

This turns out to be sensitive to whether _nmod8_vec_add and _nmod8_vec_sub are implemented using nmod_add / _nmod_add, nmod_sub / _nmod_sub. What is surprising is that best choice isn't symmetric on my machine (Zen 3). I get:

1: nmod_add + nmod_sub: 0.0354 seconds
2: _nmod_add + nmod_sub: 0.0311 seconds
3: nmod_add + _nmod_sub: 0.0373 seconds
4: _nmod_add + _nmod_sub: 0.0373 seconds

Can anyone else reproduce this? Why is combination (2) so much faster than the others?

And how can we implement these performance-critical functions more efficiently? Clearly they ought to be using unrolled, branchless SIMD code.

Oct 21 '23 10:10 fredrik-johansson

Have you tried using native code for nmod8? It could be that the compiler is doing some weird.

May 20 '24 14:05 albinahlback