avx2 performance is lower than sse2?
I am quickly testing avx2 vs sse2 on Ubuntu box and seeing following:
Sample mod file:
NEURON {
SUFFIX hh
NONSPECIFIC_CURRENT il
RANGE x, minf, mtau, gl, el
}
STATE {
m
}
ASSIGNED {
x
v (mV)
minf
mtau (ms)
}
BREAKPOINT {
SOLVE states METHOD cnexp
il = gl*(v - el)
}
DERIVATIVE states {
mtau = exp(m) + exp(minf) + (m / minf)
: printf("->%lf %lf %lf\n", mtau, m, minf) : can print values by running vector width 1
}
Running with SSE2:
LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ \
./bin/nmodl ../test.mod llvm --ir --vector-width 2 --veclib SVML --opt \
benchmark --run --instance-size 20000000 --repeat 5 \
--libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so \
--opt-level-ir 3 --opt-level-codegen 3 --backend sse2
....
VOID nrn_state_hh(INSTANCE_STRUCT *mech){
INTEGER id
INTEGER node_id
DOUBLE v
for(id = 0; id<mech->node_count-1; id = id+2) {
node_id = mech->node_index[id]
v = mech->voltage[node_id]
mech->mtau[id] = exp(mech->m[id])+exp(mech->minf[id])+(mech->m[id]/mech->minf[id])
}
INTEGER epilogue_node_id
DOUBLE epilogue_v
for(; id<mech->node_count; id = id+1) {
epilogue_node_id = mech->node_index[id]
epilogue_v = mech->voltage[epilogue_node_id]
mech->mtau[id] = exp(mech->m[id])+exp(mech->minf[id])+(mech->m[id]/mech->minf[id])
}
}
[NMODL] [info] :: Running LLVM optimisation passes
Created LLVM IR module from NMODL AST in 0.01741513
Backend: sse2
Disabling features:
-avx
-avx2
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.131291314
Experiment 1: compute time = 0.090968684
Experiment 2: compute time = 0.090951637
Experiment 3: compute time = 0.090918753
Experiment 4: compute time = 0.090925391
Average compute time = 0.0990111558
And running with AVX2:
$ LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ \
./bin/nmodl ../test.mod llvm --ir --vector-width 4 --veclib SVML --opt \
benchmark --run --instance-size 20000000 --repeat 5 \
--libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so \
--opt-level-ir 3 --opt-level-codegen 3 --backend avx2
....
[NMODL] [info] :: Running LLVM optimisation passes
Created LLVM IR module from NMODL AST in 0.017372994
Backend: avx2
Disabling features:
-sse
-sse2
-sse3
-sse4.1
-sse4.2
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.174501459
Experiment 1: compute time = 0.130532601
Experiment 2: compute time = 0.130379685
Experiment 3: compute time = 0.130569437
Experiment 4: compute time = 0.130534945
Average compute time = 0.139303625
🤔 @georgemitenkov: do you see the same thing? e.g. also locally on your laptop?
@pramodk what are the assembly dumps?
I will double-check on my laptop a bit later. Actually, my guess can be that if we disable “sse” features for avx, the code is somehow less efficient?
Also, some loop optimisations are not there yet, so we load 4-width vectors lots of times in the avx case, that can be a bottleneck?
By disabling sse features I mean that we still have scalar epilogue loop to process and optimise
By disabling sse features I mean that we still have scalar epilogue loop to process and optimise
But with instance size of 20000000, epilogue loop is not executed i.e. trip count divisible by 4 & 2>
I haven't checked assembly dumps. May be @castigli can have a quick look.
But with instance size of 20000000, epilogue loop is not executed i.e. trip count divisible by 4 & 2>
That’s right! Maybe 4-wide loads then, but looking at assembly will make things clearer.
Copying from Slack:
The --backend XX option is working somehow funny. We disables cpu features based on this option and that causes always to use SSE registers.
So if we don't specify --backend avx2 or --backend sse option but use default --backend default then it works as expected:
LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ ./bin/nmodl ../test.mod llvm --ir --vector-width 2 --veclib SVML --opt benchmark --run --instance-size 20000000 --repeat 5 --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so --opt-level-ir 3 --opt-level-codegen 3 --backend default
...
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.09469293
Experiment 1: compute time = 0.090705488
Experiment 2: compute time = 0.090669446
Experiment 3: compute time = 0.090671005
Experiment 4: compute time = 0.090674679
Average compute time = 0.0914827096
LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ ./bin/nmodl ../test.mod llvm --ir --vector-width 4 --veclib SVML --opt benchmark --run --instance-size 20000000 --repeat 5 --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so --opt-level-ir 3 --opt-level-codegen 3 --backend default
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.058610805
Experiment 1: compute time = 0.054555707
Experiment 2: compute time = 0.054569554
Experiment 3: compute time = 0.054505829
Experiment 4: compute time = 0.054483736
Average compute time = 0.0553451262
# bad one with avx2 that just uses sse
LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/ ./bin/nmodl ../test.mod llvm --ir --vector-width 4 --veclib SVML --opt benchmark --run --instance-size 20000000 --repeat 5 --libs /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin/libsvml.so --opt-level-ir 3 --opt-level-codegen 3 --backend avx2
Backend: avx2
Disabling features:
-sse
-sse2
-sse3
-sse4.1
-sse4.2
Benchmarking kernel 'nrn_state_hh, with 1296.99718 MBs
Experiment 0: compute time = 0.134491371
Experiment 1: compute time = 0.130429418
Experiment 2: compute time = 0.130184682
Experiment 3: compute time = 0.130341302
Experiment 4: compute time = 0.130260874
Average compute time = 0.131141529
That's right! also from assembly:
Running avx2 backend:
0000000000000000 _nrn_state_hh:
0: 8b 4f 64 movl 100(%rdi), %ecx
3: 8d 41 fd leal -3(%rcx), %eax
6: 85 c0 testl %eax, %eax
8: 7e 68 jle 104 <_nrn_state_hh+0x72>
a: 31 c0 xorl %eax, %eax
c: 0f 1f 40 00 nopl (%rax)
10: 48 98 cltq
12: 48 8b 4f 08 movq 8(%rdi), %rcx
16: 48 8b 57 10 movq 16(%rdi), %rdx
1a: 66 0f 28 04 c1 movapd (%rcx,%rax,8), %xmm0
1f: 66 0f 28 4c c1 10 movapd 16(%rcx,%rax,8), %xmm1
25: 48 8b 4f 18 movq 24(%rdi), %rcx
29: 66 0f 28 14 c1 movapd (%rcx,%rax,8), %xmm2
2e: 66 0f 28 5c c1 10 movapd 16(%rcx,%rax,8), %xmm3
34: 66 0f 28 e2 movapd %xmm2, %xmm4
38: 66 0f 5e e0 divpd %xmm0, %xmm4
3c: 66 0f 28 eb movapd %xmm3, %xmm5
40: 66 0f 5e e9 divpd %xmm1, %xmm5
44: 66 0f 58 c2 addpd %xmm2, %xmm0
48: 66 0f 58 c4 addpd %xmm4, %xmm0
4c: 66 0f 58 cb addpd %xmm3, %xmm1
50: 66 0f 58 cd addpd %xmm5, %xmm1
54: 66 0f 29 04 c2 movapd %xmm0, (%rdx,%rax,8)
59: 66 0f 29 4c c2 10 movapd %xmm1, 16(%rdx,%rax,8)
Running default:
0000000000000000 _nrn_state_hh:
0: 8b 4f 64 movl 100(%rdi), %ecx
3: 8d 41 fd leal -3(%rcx), %eax
6: 85 c0 testl %eax, %eax
8: 7e 42 jle 66 <_nrn_state_hh+0x4c>
a: 31 c0 xorl %eax, %eax
c: 0f 1f 40 00 nopl (%rax)
10: 48 98 cltq
12: 48 8b 4f 08 movq 8(%rdi), %rcx
16: 48 8b 57 10 movq 16(%rdi), %rdx
1a: c5 fd 28 04 c1 vmovapd (%rcx,%rax,8), %ymm0
1f: 48 8b 4f 18 movq 24(%rdi), %rcx
23: c5 fd 28 0c c1 vmovapd (%rcx,%rax,8), %ymm1
28: c5 f5 5e d0 vdivpd %ymm0, %ymm1, %ymm2
2c: c5 fd 58 c1 vaddpd %ymm1, %ymm0, %ymm0
30: c5 ed 58 c0 vaddpd %ymm0, %ymm2, %ymm0
34: c5 fd 29 04 c2 vmovapd %ymm0, (%rdx,%rax,8)
yes, exactly! I saw the same.
When you implemented this, did you see how clang or other frameworks "prefer" specific instruction set? I just started skimming through numba but haven't looked at the internal implementation yet: https://github.com/numba/numba/pull/5962/files
Thanks for the link: I will make an issue for that and will think of a nicer way of forcing LLVM to generate AVX-512 code, etc. Meanwhile, we can benchmark with default.