nmodl avx2 performance is lower than sse2?

I am quickly testing avx2 vs sse2 on Ubuntu box and seeing following:

Sample mod file:

NEURON {
    SUFFIX hh
    NONSPECIFIC_CURRENT il
    RANGE x, minf, mtau, gl, el
}

STATE {
    m
}

ASSIGNED {
    x
    v (mV)
    minf
    mtau (ms)
}

BREAKPOINT {
    SOLVE states METHOD cnexp
    il = gl*(v - el)
}

DERIVATIVE states {
     mtau = exp(m) + exp(minf) + (m / minf)
     : printf("->%lf %lf %lf\n", mtau, m, minf) : can print values by running vector width 1
}

Running with SSE2:

LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ \
  ./bin/nmodl ../test.mod llvm --ir --vector-width 2 --veclib SVML --opt \
  benchmark --run --instance-size 20000000 --repeat 5  \
  --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so \
  --opt-level-ir 3 --opt-level-codegen  3 --backend sse2

....
VOID nrn_state_hh(INSTANCE_STRUCT *mech){
    INTEGER id
    INTEGER node_id
    DOUBLE v
    for(id = 0; id<mech->node_count-1; id = id+2) {
        node_id = mech->node_index[id]
        v = mech->voltage[node_id]
        mech->mtau[id] = exp(mech->m[id])+exp(mech->minf[id])+(mech->m[id]/mech->minf[id])
    }
    INTEGER epilogue_node_id
    DOUBLE epilogue_v
    for(; id<mech->node_count; id = id+1) {
        epilogue_node_id = mech->node_index[id]
        epilogue_v = mech->voltage[epilogue_node_id]
        mech->mtau[id] = exp(mech->m[id])+exp(mech->minf[id])+(mech->m[id]/mech->minf[id])
    }
}
[NMODL] [info] :: Running LLVM optimisation passes
Created LLVM IR module from NMODL AST in 0.01741513

Backend: sse2
Disabling features:
-avx
-avx2
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.131291314
Experiment 1: compute time = 0.090968684
Experiment 2: compute time = 0.090951637
Experiment 3: compute time = 0.090918753
Experiment 4: compute time = 0.090925391
Average compute time = 0.0990111558

And running with AVX2:

$ LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ \
  ./bin/nmodl ../test.mod llvm --ir --vector-width 4 --veclib SVML --opt \
  benchmark --run --instance-size 20000000 --repeat 5  \
  --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so \
  --opt-level-ir 3 --opt-level-codegen  3 --backend avx2
....

[NMODL] [info] :: Running LLVM optimisation passes
Created LLVM IR module from NMODL AST in 0.017372994

Backend: avx2
Disabling features:
-sse
-sse2
-sse3
-sse4.1
-sse4.2
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.174501459
Experiment 1: compute time = 0.130532601
Experiment 2: compute time = 0.130379685
Experiment 3: compute time = 0.130569437
Experiment 4: compute time = 0.130534945
Average compute time = 0.139303625

🤔 @georgemitenkov: do you see the same thing? e.g. also locally on your laptop?

May 12 '21 06:05 pramodk

@pramodk what are the assembly dumps?

I will double-check on my laptop a bit later. Actually, my guess can be that if we disable “sse” features for avx, the code is somehow less efficient?

Also, some loop optimisations are not there yet, so we load 4-width vectors lots of times in the avx case, that can be a bottleneck?

May 12 '21 06:05 georgemitenkov

By disabling sse features I mean that we still have scalar epilogue loop to process and optimise

May 12 '21 06:05 georgemitenkov

By disabling sse features I mean that we still have scalar epilogue loop to process and optimise

But with instance size of 20000000, epilogue loop is not executed i.e. trip count divisible by 4 & 2>

I haven't checked assembly dumps. May be @castigli can have a quick look.

May 12 '21 06:05 pramodk

But with instance size of 20000000, epilogue loop is not executed i.e. trip count divisible by 4 & 2>

That’s right! Maybe 4-wide loads then, but looking at assembly will make things clearer.

May 12 '21 06:05 georgemitenkov

Copying from Slack:

The --backend XX option is working somehow funny. We disables cpu features based on this option and that causes always to use SSE registers.

So if we don't specify --backend avx2 or --backend sse option but use default --backend default then it works as expected:

 LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ ./bin/nmodl  ../test.mod llvm --ir --vector-width 2 --veclib SVML --opt benchmark --run --instance-size 20000000 --repeat 5  --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so --opt-level-ir 3 --opt-level-codegen  3  --backend default
...
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.09469293
Experiment 1: compute time = 0.090705488
Experiment 2: compute time = 0.090669446
Experiment 3: compute time = 0.090671005
Experiment 4: compute time = 0.090674679
Average compute time = 0.0914827096

LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ ./bin/nmodl  ../test.mod llvm --ir --vector-width 4 --veclib SVML --opt benchmark --run --instance-size 20000000 --repeat 5  --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so --opt-level-ir 3 --opt-level-codegen  3  --backend default
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.058610805
Experiment 1: compute time = 0.054555707
Experiment 2: compute time = 0.054569554
Experiment 3: compute time = 0.054505829
Experiment 4: compute time = 0.054483736
Average compute time = 0.0553451262

# bad one with avx2 that just uses sse
LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ ./bin/nmodl  ../test.mod llvm --ir --vector-width 4 --veclib SVML --opt benchmark --run --instance-size 20000000 --repeat 5  --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so --opt-level-ir 3 --opt-level-codegen  3  --backend avx2
Backend: avx2
Disabling features:
-sse
-sse2
-sse3
-sse4.1
-sse4.2
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.134491371
Experiment 1: compute time = 0.130429418
Experiment 2: compute time = 0.130184682
Experiment 3: compute time = 0.130341302
Experiment 4: compute time = 0.130260874
Average compute time = 0.131141529

May 12 '21 19:05 pramodk

That's right! also from assembly:

Running avx2 backend:

0000000000000000 _nrn_state_hh:
       0: 8b 4f 64                     	movl	100(%rdi), %ecx
       3: 8d 41 fd                     	leal	-3(%rcx), %eax
       6: 85 c0                        	testl	%eax, %eax
       8: 7e 68                        	jle	104 <_nrn_state_hh+0x72>
       a: 31 c0                        	xorl	%eax, %eax
       c: 0f 1f 40 00                  	nopl	(%rax)
      10: 48 98                        	cltq
      12: 48 8b 4f 08                  	movq	8(%rdi), %rcx
      16: 48 8b 57 10                  	movq	16(%rdi), %rdx
      1a: 66 0f 28 04 c1               	movapd	(%rcx,%rax,8), %xmm0
      1f: 66 0f 28 4c c1 10            	movapd	16(%rcx,%rax,8), %xmm1
      25: 48 8b 4f 18                  	movq	24(%rdi), %rcx
      29: 66 0f 28 14 c1               	movapd	(%rcx,%rax,8), %xmm2
      2e: 66 0f 28 5c c1 10            	movapd	16(%rcx,%rax,8), %xmm3
      34: 66 0f 28 e2                  	movapd	%xmm2, %xmm4
      38: 66 0f 5e e0                  	divpd	%xmm0, %xmm4
      3c: 66 0f 28 eb                  	movapd	%xmm3, %xmm5
      40: 66 0f 5e e9                  	divpd	%xmm1, %xmm5
      44: 66 0f 58 c2                  	addpd	%xmm2, %xmm0
      48: 66 0f 58 c4                  	addpd	%xmm4, %xmm0
      4c: 66 0f 58 cb                  	addpd	%xmm3, %xmm1
      50: 66 0f 58 cd                  	addpd	%xmm5, %xmm1
      54: 66 0f 29 04 c2               	movapd	%xmm0, (%rdx,%rax,8)
      59: 66 0f 29 4c c2 10            	movapd	%xmm1, 16(%rdx,%rax,8)

Running default:

0000000000000000 _nrn_state_hh:
       0: 8b 4f 64                     	movl	100(%rdi), %ecx
       3: 8d 41 fd                     	leal	-3(%rcx), %eax
       6: 85 c0                        	testl	%eax, %eax
       8: 7e 42                        	jle	66 <_nrn_state_hh+0x4c>
       a: 31 c0                        	xorl	%eax, %eax
       c: 0f 1f 40 00                  	nopl	(%rax)
      10: 48 98                        	cltq
      12: 48 8b 4f 08                  	movq	8(%rdi), %rcx
      16: 48 8b 57 10                  	movq	16(%rdi), %rdx
      1a: c5 fd 28 04 c1               	vmovapd	(%rcx,%rax,8), %ymm0
      1f: 48 8b 4f 18                  	movq	24(%rdi), %rcx
      23: c5 fd 28 0c c1               	vmovapd	(%rcx,%rax,8), %ymm1
      28: c5 f5 5e d0                  	vdivpd	%ymm0, %ymm1, %ymm2
      2c: c5 fd 58 c1                  	vaddpd	%ymm1, %ymm0, %ymm0
      30: c5 ed 58 c0                  	vaddpd	%ymm0, %ymm2, %ymm0
      34: c5 fd 29 04 c2               	vmovapd	%ymm0, (%rdx,%rax,8)

May 12 '21 21:05 georgemitenkov

yes, exactly! I saw the same.

When you implemented this, did you see how clang or other frameworks "prefer" specific instruction set? I just started skimming through numba but haven't looked at the internal implementation yet: https://github.com/numba/numba/pull/5962/files

May 12 '21 21:05 pramodk

Thanks for the link: I will make an issue for that and will think of a nicer way of forcing LLVM to generate AVX-512 code, etc. Meanwhile, we can benchmark with default.

May 13 '21 07:05 georgemitenkov