nrn
nrn copied to clipboard
Higher simulation time in NEURON 9 compared to NEURON 8.2.4
Context
Testing NEURON 9 on a recent CPU shows longer simulation time compared to NEURON 8.2.4
Overview of the issue
Brand new installation of Ubuntu 23.10 on a 16 core Zen 4 AMD processor. With the same model, same code, same number of cores, etc.., NEURON 9 does a simulation in 90s whereas NEURON 8.2.4 does it in 37s.
Expected result/behavior
Similar or better performance changing from NEURON 8 to NEURON 9.
NEURON setup
-
Version: VERSION 9.0a-176-g17be061e0 master (17be061e0) 2024-03-13 and NEURON 8.2.4
-
Installation method: Cmake from source in both cases with: -DCMAKE_BUILD_TYPE=RelWithDebInfo
-DNRN_ENABLE_INTERVIEWS=ON
-DNRN_ENABLE_MPI=ON
-DNRN_ENABLE_PYTHON=ON
-DNRN_ENABLE_RX3D:BOOL=ON
-DCORENRN_ENABLE_NMODL=OFF
-DNRN_ENABLE_CORENEURON=OFF -
OS + Version: Ubuntu 23.10 latest upgrades
-
Compiler + Version: gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)
Minimal working example - MWE
https://modeldb.science/266806?tab=1 Use the first model "Morphology_1". Change protocol 01_SS.py with "p.change_nthread(12,1)" and h.tstop = 5000. nrnivmodl mod_files time nrniv -python protocols/01_SS.py
Thanks for the notice. I'm seeing pretty much the same thing. It will need to be looked into in greater detail and resolved before the v9 release. On my machine ( also 16 cores) but tstop=1000
nthread v8.2.4 master
1 46s 87s
2 21 47
4 17 29
8 12 24
16 10 21
If I did it right, a bisect seems to point to 6c529a415:
2c3a21d89: 49.67user 2.12system 0:11.58elapsed 447%CPU (0avgtext+0avgdata 67344maxresident)k
a3bbaab7c: 50.71user 2.00system 0:12.08elapsed 436%CPU (0avgtext+0avgdata 93484maxresident)k
936c7d3ae: 85.93user 1.98system 0:20.43elapsed 430%CPU (0avgtext+0avgdata 80288maxresident)k
522c86681: 46.77user 1.94system 0:10.98elapsed 443%CPU (0avgtext+0avgdata 93448maxresident)k
67a672a2c: 46.29user 2.03system 0:11.16elapsed 432%CPU (0avgtext+0avgdata 93764maxresident)k
baf3b145a: 85.80user 1.98system 0:20.13elapsed 435%CPU (0avgtext+0avgdata 80360maxresident)k
fb1712738: 85.58user 2.00system 0:20.18elapsed 434%CPU (0avgtext+0avgdata 80544maxresident)k
dfdee8442: 46.50user 2.15system 0:11.24elapsed 432%CPU (0avgtext+0avgdata 93196maxresident)k
d2d811040: 46.34user 2.15system 0:11.41elapsed 424%CPU (0avgtext+0avgdata 93560maxresident)k
6c529a415: 86.48user 2.06system 0:20.35elapsed 434%CPU (0avgtext+0avgdata 80244maxresident)k
b4ade5509: 46.42user 2.06system 0:11.39elapsed 425%CPU (0avgtext+0avgdata 93228maxresident)k
Where the good ones are < 50s, and the ones above that I marked bad.
Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?
Edited after fixing the Caliper region for state-update in version 8.2
Thanks! That is a great hint. Meanwhile I tried Caliper to see where the performance is very different. Here is what seems to me to be the relevant selection
master
fadvance 0.242032 78.523456 0.304608 98.825198
timestep 0.476875 78.281423 0.600168 98.520590
deliver-events 0.210294 0.285367 0.264664 0.359147
check-threshold 0.075073 0.075073 0.094483 0.094483
setup-tree-matrix 2.034998 5.770918 2.561134 7.262953
...
matrix-solver 0.126477 0.126477 0.159177 0.159177
second-order-cur 0.069821 0.069821 0.087873 0.087873
update 0.083985 0.083985 0.105698 0.105698
state-update 0.917186 71.467980 1.154318 89.945574
state-cdp5StCmod 49.011724 49.011724 61.683394 61.683394
state-Cav2_3 0.225994 0.225994 0.284424 0.284424
state-Cav3_1 0.353412 0.353412 0.444785 0.444785
state-GRC_CA 0.997766 0.997766 1.255732 1.255732
state-GRC_KM 0.106439 0.106439 0.133958 0.133958
state-HCN1 0.080375 0.080375 0.101155 0.101155
state-HCN2 0.078690 0.078690 0.099035 0.099035
state-Kca1_1 4.351285 4.351285 5.476282 5.476282
state-Kca2_2 1.352222 1.352222 1.701831 1.701831
state-Kca3_1 0.074793 0.074793 0.094131 0.094131
state-Kv1_1 0.074700 0.074700 0.094013 0.094013
state-Kv3_4 0.242558 0.242558 0.305269 0.305269
state-Kv4_3 0.089487 0.089487 0.112623 0.112623
state-Leak 0.070016 0.070016 0.088118 0.088118
state-Nav1_6 13.441334 13.441334 16.916505 16.916505
real 1m21.576s
8.2
fadvance 0.132501 50.456262 0.259746 98.911012
timestep 0.470033 50.323761 0.921421 98.651267
deliver-events 0.208092 0.278673 0.407930 0.546292
check-threshold 0.070581 0.070581 0.138362 0.138362
setup-tree-matrix 2.143723 5.734394 4.202409 11.241314
...
matrix-solver 0.169878 0.169878 0.333018 0.333018
second-order-cur 0.070763 0.070763 0.138719 0.138719
update 0.109914 0.109914 0.215469 0.215469
state-update 0.906806 43.490105 1.777640 85.255034
state-cdp5StCmod 24.232049 24.232049 47.502855 47.502855
state-Cav2_3 0.229292 0.229292 0.449488 0.449488
state-Cav3_1 0.348932 0.348932 0.684023 0.684023
state-GRC_CA 0.518979 0.518979 1.017370 1.017370
state-GRC_KM 0.093657 0.093657 0.183599 0.183599
state-HCN1 0.078251 0.078251 0.153397 0.153397
state-HCN2 0.077570 0.077570 0.152064 0.152064
state-Kca1_1 4.129362 4.129362 8.094920 8.094920
state-Kca2_2 1.515642 1.515642 2.971161 2.971161
state-Kca3_1 0.072181 0.072181 0.141499 0.141499
state-Kv1_1 0.072002 0.072002 0.141148 0.141148
state-Kv3_4 0.235797 0.235797 0.462241 0.462241
state-Kv4_3 0.078983 0.078983 0.154832 0.154832
state-Leak 0.068475 0.068475 0.134233 0.134233
state-Nav1_6 10.832127 10.832127 21.234563 21.234563
real 0m52.622s
I interpret for example advance 0.242032 78.523456 ...
to mean that the exclusive time within advance was 0.24s but the time interval between entry and exit was 78.5s. The last two numbers of both lines is supposed to be % times.
From this perspective the largest master lines
state-update 0.917186 71.467980 1.154318 89.945574
state-cdp5StCmod 49.011724 49.011724 61.683394 61.683394
state-Kca1_1 4.351285 4.351285 5.476282 5.476282
state-Nav1_6 13.441334 13.441334 16.916505 16.916505
suggest for example that state integration is dominated by cdp5StCmod. The equivalent lines for 8.2 are
state-update 0.906806 43.490105 1.777640 85.255034
state-cdp5StCmod 24.232049 24.232049 47.502855 47.502855
state-Kca1_1 4.129362 4.129362 8.094920 8.094920
state-Nav1_6 10.832127 10.832127 21.234563 21.234563
I conclude that the largest part of the master performance problem is with the KINETIC block of cdp5StCmod and I'll continue by focusing my attention there.
Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?
I hadn't realized that yet when I wrote my above comment. That being the introduction of SoA and data handles, it may be the case that many of the intermediate commits will not result in a working system.
Now might be a good time to mention that I'm using a slightly modified version of [email protected]:ModelDBRepository/266806 for both 8.2, master, and coreneuron. Those mods are:
~/models/modeldb/266806/Morphology_1$ git diff --staged
diff --git a/Morphology_1/mod_files/Hcn1.mod b/Morphology_1/mod_files/Hcn1.mod
index 1df9e10..f985669 100755
--- a/Morphology_1/mod_files/Hcn1.mod
+++ b/Morphology_1/mod_files/Hcn1.mod
@@ -15,7 +15,7 @@ NEURON {
NONSPECIFIC_CURRENT ih
RANGE Q10_diff,Q10_channel,gbar_Q10, ic
RANGE o_fast_inf, o_slow_inf, tau_f, tau_s, Erev
- RANGE gbar,r,g, o
+ RANGE gbar,g, o
}
UNITS {
diff --git a/Morphology_1/mod_files/cdp5StCmod.mod b/Morphology_1/mod_files/cdp5StCmod.mod
index bbdd811..9be5b05 100755
--- a/Morphology_1/mod_files/cdp5StCmod.mod
+++ b/Morphology_1/mod_files/cdp5StCmod.mod
@@ -123,9 +123,10 @@ ASSIGNED {
mgi (mM)
vrat (1)
icazz (nA)
+ cao (mM)
}
-CONSTANT { cao = 2 (mM) }
+:CONSTANT { cao = 2 (mM) }
STATE {
: ca[0] is equivalent to cai
@@ -246,10 +247,11 @@ PROCEDURE factors() {
}
-LOCAL dsq, dsqvol : can't define local variable in KINETIC block
+: LOCAL dsq, dsqvol : can't define local variable in KINETIC block
: or use in COMPARTMENT statement
KINETIC state {
+ LOCAL dsq, dsqvol
COMPARTMENT diam*diam*vrat {ca mg Buff1 Buff1_ca Buff2 Buff2_ca BTC BTC_ca DMNPE DMNPE_ca PV PV_ca PV_mg}
COMPARTMENT (1e10)*parea {pump pumpca}
and the temporary changes (there is currently an array bug in NMODL)
~/models/modeldb/266806/Morphology_1$ git diff
diff --git a/Morphology_1/mod_files/Cav2_3.mod b/Morphology_1/mod_files/Cav2_3.mod
index 0d10ca1..db53b82 100755
--- a/Morphology_1/mod_files/Cav2_3.mod
+++ b/Morphology_1/mod_files/Cav2_3.mod
@@ -10,7 +10,7 @@ NEURON {
THREADSAFE
USEION ca READ eca WRITE ica
RANGE gcabar, m, h, g, gmax, ica
- RANGE inf, tau
+ RANGE inf0, inf1, tau0, tau1
}
UNITS {
@@ -29,8 +29,10 @@ STATE { m h } : unknown activation and inactivation parameters to be
ASSIGNED { : parameters needed to solve DE
ica (mA/cm2)
- inf[2]
- tau[2] (ms)
+ inf0
+ tau0 (ms)
+ inf1
+ tau1 (ms)
g (mho/cm2)
gmax (mho/cm2)
}
@@ -46,8 +48,8 @@ BREAKPOINT {
INITIAL {
mhn(v)
- m = inf[0]
- h = inf[1]
+ m = inf0
+ h = inf1
g = gcabar*m*m*m*h
ica = g*(v - eca) : initial Ca++ current value
gmax = g
@@ -55,8 +57,8 @@ INITIAL {
DERIVATIVE states {
mhn(v)
- m' = (inf[0] - m)/tau[0]
- h' = (inf[1] - h)/tau[1]
+ m' = (inf0 - m)/tau0
+ h' = (inf1 - h)/tau1
}
FUNCTION varss(v (mV), i) {
@@ -79,11 +81,11 @@ FUNCTION vartau(v (mV), i) (ms) {
}
PROCEDURE mhn(v (mV)) {LOCAL a, b :rest = -70
- TABLE inf, tau DEPEND celsius FROM -100 TO 100 WITH 200
- FROM i=0 TO 1 {
- tau[i] = vartau(v,i)
- inf[i] = varss(v,i)
- }
+: TABLE inf, tau DEPEND celsius FROM -100 TO 100 WITH 200
+ tau0 = vartau(v,0)
+ inf0 = varss(v,0)
+ tau1 = vartau(v,1)
+ inf1 = varss(v,1)
}
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index 4e5cc4b..ff0e622 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -21,9 +21,10 @@ Fixed_step.active(0)
cpu = multiprocessing.cpu_count()
h.load_file("parcom.hoc")
p = h.ParallelComputeTool()
-p.change_nthread(cpu,1)
-p.multisplit(1)
-print(cpu)
+if h.mycpu > 1.0:
+ p.change_nthread(h.mycpu,1)
+ p.multisplit(1)
+print(h.mycpu)
#Voltage graph
h('load_file("vm.ses")')
@@ -37,13 +38,23 @@ h.celsius = 32
h.tstop = 1000
h.v_init = -65
+pc = h.ParallelContext()
+pc.set_gid2node(0, 0)
+pc.cell(0, cell.nc_spike)
+pc.set_maxstep(10)
+from neuron import coreneuron
+coreneuron.enable = 0
+
#Initialization
def initialize():
h.finitialize()
h.run()
+# h.stdinit()
+# pc.psolve(h.tstop)
initialize()
+'''
#Save the results into an image
fig, ax = plt.subplots()
ax.plot(np.array(cell.time_vector), np.array(cell.vm), 'b', label='spikes')
@@ -59,4 +70,4 @@ plt.ylabel("membrane voltage (mv) ")
plt.savefig('01_SS_trace.eps')
plt.close()
-
+'''
@ramcdougal > Given that (https://github.com/neuronsimulator/nrn/pull/2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?
I gave it a quick shot, but many of the commits don't even build; one of the only ones I got to build was 1bb5bf11c, and it was slow, so that at least partly brackets which of the many commits it was. However, out of the 19 other commits I randomly tried, none of them built without error.
@nrnhines > Those mods are: I did something similar; I haven't compared the output between the fast commit and the regression, though.
A couple other short notes:
perf stat
for:
b4ade5509
: (fast commit)
Performance counter stats for 'python 01_SS.py':
48,928.26 msec task-clock # 4.263 CPUs utilized
631,731 context-switches # 0.013 M/sec
29 cpu-migrations # 0.001 K/sec
28,020 page-faults # 0.573 K/sec
234,176,288,198 cycles # 4.786 GHz (31.06%)
555,016,050,075 instructions # 2.37 insn per cycle (38.92%)
70,189,788,913 branches # 1434.545 M/sec (38.80%)
99,889,950 branch-misses # 0.14% of all branches (38.56%)
224,807,810,920 L1-dcache-loads # 4594.642 M/sec (38.56%)
1,585,023,985 L1-dcache-load-misses # 0.71% of all L1-dcache hits (38.43%)
329,502,640 LLC-loads # 6.734 M/sec (30.57%)
1,118,627 LLC-load-misses # 0.34% of all LL-cache hits (30.47%)
<not supported> L1-icache-loads
625,499,048 L1-icache-load-misses (30.37%)
225,842,474,219 dTLB-loads # 4615.788 M/sec (30.51%)
784,591 dTLB-load-misses # 0.00% of all dTLB cache hits (30.62%)
45,054,175 iTLB-loads # 0.921 M/sec (30.96%)
28,229,053 iTLB-load-misses # 62.66% of all iTLB cache hits (31.07%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
11.476448534 seconds time elapsed
46.780273000 seconds user
3.108108000 seconds sys
6c529a415
: Regression:
Performance counter stats for 'python 01_SS.py':
88,420.11 msec task-clock # 4.289 CPUs utilized
700,525 context-switches # 0.008 M/sec
38 cpu-migrations # 0.000 K/sec
23,665 page-faults # 0.268 K/sec
424,431,007,847 cycles # 4.800 GHz (30.92%)
1,157,209,550,388 instructions # 2.73 insn per cycle (38.58%)
168,992,981,295 branches # 1911.251 M/sec (38.30%)
112,577,199 branch-misses # 0.07% of all branches (38.41%)
474,884,904,576 L1-dcache-loads # 5370.779 M/sec (38.33%)
2,880,849,265 L1-dcache-load-misses # 0.61% of all L1-dcache hits (38.44%)
464,556,719 LLC-loads # 5.254 M/sec (30.96%)
943,476 LLC-load-misses # 0.20% of all LL-cache hits (30.84%)
<not supported> L1-icache-loads
884,842,762 L1-icache-load-misses (30.84%)
476,144,529,914 dTLB-loads # 5385.025 M/sec (30.63%)
1,151,626 dTLB-load-misses # 0.00% of all dTLB cache hits (30.74%)
62,166,895 iTLB-loads # 0.703 M/sec (30.76%)
31,708,346 iTLB-load-misses # 51.01% of all iTLB cache hits (30.83%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
20.613380152 seconds time elapsed
85.971388000 seconds user
3.464232000 seconds sys
Looks to be that many more instructions are executed - not cache missing or branch missing.
perf record
/ `perf report:
b4ade5509
: (fast commit)
Overhead Command Shared Object Symbol
23.57% python libnrnmech.so [.] _nrn_state__cdp5StCmod
17.57% python libnrnmech.so [.] state__cdp5StCmod
7.86% python libnrnmech.so [.] _nrn_state__Nav1_6
6.57% python libnrnmech.so [.] neuron::scopmath::_nrn_thread_getelm
6.31% python libnrnmech.so [.] activation__Nav1_6
5.81% python libnrnmech.so [.] neuron::scopmath::sparse_thread<double*, int (*)(void*, double*, double*, Datum*, Datum*, NrnThread*), int*, double*&, Datum*&, Datum*&, NrnThread*&>
5.43% python libm-2.31.so [.] __ieee754_pow_fma
3.41% python libm-2.31.so [.] __ieee754_exp_fma
3.14% python libnrnmech.so [.] activation__Kca1_1
6c529a415
: Regression:
Overhead Command Shared Object Symbol
18.32% python libnrnmech.so [.] state__cdp5StCmod
15.51% python libnrnmech.so [.] _nrn_state__cdp5StCmod
14.35% python libnrnmech.so [.] neuron::scopmath::_nrn_thread_getelm
11.44% python libnrnmech.so [.] _ZNK6neuron9container19generic_data_handlecvNS0_11data_handleIT_EEIdEEv
4.42% python libnrnmech.so [.] _nrn_state__Nav1_6
3.43% python libnrnmech.so [.] activation__Nav1_6
3.20% python libnrnmech.so [.] neuron::container::data_handle<double>::get_ptr_helper<neuron::container::data_handle<double> >
2.86% python libm-2.31.so [.] __ieee754_pow_fma
2.21% python libnrnmech.so [.] _nrn_state__Kca1_1
Since it's a statistical profiler, we'll miss things, but it seems that _ZNK6neuron9container19generic_data_handlecvNS0_11data_handleIT_EEIdEEv
might be something to look into
@mgeplf I like that tool (perf?) you are using but I'm not familiar with it. (I was going to use gprof to delve more into the mod file functions but was wondering if that itself would change the profile considerably). Please send a link for me to get started. Your fragments above that say %
seem more indirect than actual time. Does one just multiply %
by total time?
I see that _nrn_thread_getelm
went from 7%
to 14%
but that seems small to account for half the time. But I'm likely misinterpreting.
@nrnhines > @mgeplf I like that tool (perf?)
It's this one: https://en.wikipedia.org/wiki/Perf_(Linux)
It's like a console version of vtune (or whatever it's called now), but is no where near as slick. Whenever I turn to it, I use this to remember some useful examples: https://www.brendangregg.com/perf.html
Unfortunately, it seems the dwarf debug isn't propagated fully into libnrnmech.so
, so the (very useful) --call-graph dwarf
can't be used with perf record
; I haven't spent time to try and figure out how to get it to show up.
Does one just multiply % by total time?
That may be a loose heuristic, but the nature of sampling profiling means that it won't be accurate.
I'd probably try valgrind callgrind
if I wanted accurate timings; which would probably illuminate things better. I mainly used perf
to check that the hardware counters weren't showing red flags like cache missing or branch missing.
I see that _nrn_thread_getelm went from 7% to 14%
That seems suspicious, too.
Here is an interesting observation. The time between 8.2 and master narrows considerably (running perf)
57.909s master
47.478s 8.2
merely by introducing an ASSIGNED variable d, settting d = diam
in the INITIAL block and replacing all diam with d in the KINETIC state
block in cdp5StdCmod.mod
This change was suggested by a comparison of the 8.2 and master perf results
8.2
# Samples: 195K of event 'cycles:Pu'
# Event count (approx.): 178697138584
#
# Overhead Samples Command Shared Object Symbol >
# ........ ............ ....... ................................................. ................................................>
#
26.40% 49954 nrniv libnrniv.so [.] matsol
17.62% 33393 nrniv libnrniv.so [.] _nrn_thread_getelm
14.13% 26805 nrniv libnrnmech.so [.] state__cdp5StCmod
11.72% 22198 nrniv libnrniv.so [.] sparse_thread
5.55% 10506 nrniv libnrnmech.so [.] _nrn_thread_getelm@plt
3.62% 6861 nrniv libnrnmech.so [.] activation__Nav1_6
3.07% 5820 nrniv libm.so.6 [.] __ieee754_pow_fma
2.83% 5357 nrniv libm.so.6 [.] __ieee754_exp_fma
1.90% 3596 nrniv libm.so.6 [.] exp@@GLIBC_2.29
1.77% 3368 nrniv libnrnmech.so [.] activation__Kca1_1
1.07% 2022 nrniv libnrnmech.so [.] rates__Nav1_6.constprop.0.isra.0
1.01% 1915 nrniv libnrniv.so [.] hoc_Exp
0.81% 1539 nrniv libnrnmech.so [.] kin__Kca2_2
0.62% 1180 nrniv libm.so.6 [.] pow@@GLIBC_2.29
with the master per result using diam
master with diam
# Samples: 310K of event 'cycles:Pu'
# Event count (approx.): 305901402865
#
# Overhead Samples Command Shared Object Symbol >
# ........ ............ ....... ................................................. ........................................................................................................>
#
31.96% 97514 nrniv libnrniv.so [.] double* neuron::container::generic_data_handle::get<double*>() const
15.67% 47797 nrniv libnrnmech.so [.] neuron::scopmath::detail::sparse_thread::matsol(neuron::scopmath::SparseObj*)
12.22% 37287 nrniv libnrnmech.so [.] state__cdp5StCmod(void*, double*, neuron::cache::MechanismRange<65ul, 8ul>*, unsigned long, neuron::>
9.18% 27980 nrniv libnrnmech.so [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)
4.95% 15139 nrniv libnrnmech.so [.] _nrn_state__cdp5StCmod(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
3.39% 10338 nrniv libnrnmech.so [.] activation__Nav1_6(void*, double*, neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron:>
3.03% 9227 nrniv libnrnmech.so [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)@plt
2.19% 6667 nrniv libm.so.6 [.] __ieee754_pow_fma
1.66% 5073 nrniv libm.so.6 [.] __ieee754_exp_fma
1.54% 4699 nrniv libnrnmech.so [.] _nrn_state__Nav1_6(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
1.32% 4038 nrniv libnrnmech.so [.] activation__Kca1_1(void*, double*, neuron::cache::MechanismRange<53ul, 5ul>*, unsigned long, neuron:>
1.15% 3502 nrniv libnrnmech.so [.] rates__Nav1_6(neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::container::generic_d>
1.11% 3393 nrniv libnrnmech.so [.] _nrn_state__Kca1_1(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
0.99% 3026 nrniv libm.so.6 [.] exp@@GLIBC_2.29
0.66% 2025 nrniv libnrnmech.so [.] double* neuron::container::generic_data_handle::get<double*>() const@plt
0.57% 1724 nrniv libnrnmech.so [.] states__GRC_CA(neuron::cache::MechanismRange<30ul, 3ul>*, unsigned long, neuron::container::generic_>
0.56% 1711 nrniv libnrniv.so [.] hoc_Exp(double)
Note the top use of get<double>
.
Avoiding that with d = diam
in the INITIAL block, the perf results become
master with d=diam
# Samples: 215K of event 'cycles:Pu'
# Event count (approx.): 209229806356
#
# Overhead Samples Command Shared Object Symbol >
# ........ ............ ....... ................................................. ...............................................................................................................>
#
23.10% 48383 nrniv libnrnmech.so [.] neuron::scopmath::detail::sparse_thread::matsol(neuron::scopmath::SparseObj*)
18.91% 39595 nrniv libnrnmech.so [.] state__cdp5StCmod(void*, double*, neuron::cache::MechanismRange<66ul, 8ul>*, unsigned long, neuron::contain>
13.94% 29208 nrniv libnrnmech.so [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)
6.72% 14087 nrniv libnrnmech.so [.] _nrn_state__cdp5StCmod(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
5.02% 10527 nrniv libnrnmech.so [.] activation__Nav1_6(void*, double*, neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::contai>
4.52% 9464 nrniv libnrnmech.so [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)@plt
3.31% 6940 nrniv libm.so.6 [.] __ieee754_pow_fma
2.43% 5103 nrniv libm.so.6 [.] __ieee754_exp_fma
2.32% 4873 nrniv libnrnmech.so [.] _nrn_state__Nav1_6(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
1.91% 4000 nrniv libnrnmech.so [.] activation__Kca1_1(void*, double*, neuron::cache::MechanismRange<53ul, 5ul>*, unsigned long, neuron::contai>
1.71% 3579 nrniv libnrnmech.so [.] rates__Nav1_6(neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::container::generic_data_han>
1.45% 3038 nrniv libm.so.6 [.] exp@@GLIBC_2.29
Just for my reference. Install perf
sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`
Avoid the error
$ perf record nrniv -c 'quit()'
Error:
Access to performance monitoring and observability operations is limited.
...
with
sudo sysctl kernel.perf_event_paranoid=2
I find the sample counts helpful, so am using
perf report --stdio -n
It would be helpful to collect useful idioms in https://nrn.readthedocs.io/en/latest/install/debug.html#profiling-and-performance-benchmarking
I ran this under valgrind --tool=callgrind
and kcachegrind
seems to point to neuron::container::generic_data_handler
:
I forked [email protected]:ModelDBRepository/266806 and my changes are in [email protected]:nrnhines/266806 in branch hines/v9
Another performance data point... I installed intel oneapi in hopes of using vtune (I have not yet figured out how to focus on the data handle performance) but here are the results for CC=icx CXX=icpx
for master and 8.2. For the model, I'm using the hines/v9 branch mentioned above.
icx compiler
nthread master v8.2
1 32.9031 38.4446
2 19.5026 22.4375
4 13.5151 14.2496
8 11.7145 8.35472
16 11.6525 6.88371
To check if recent improvements are sufficient (now all merged), I used the this version of Model DB: https://github.com/nrnhines/266806/commit/c57abdba4833919dad878e556e667d5811214ec5
When compiled with:
cmake -DCMAKE_INSTALL_PREFIX=build/install -DNRN_ENABLE_TESTS=ON -DNRN_ENABLE_PYTHON=ON -DNRN_ENABLE_DOCS=OFF -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_PROFILING=OFF -DNRN_ENABLE_MPI_DYNAMIC=OFF -DCORENRN_ENABLE_OPENMP=OFF -DCMAKE_BUILD_TYPE=Release -B build .
I get the following runtimes for 9.0:
ncores: [ 2 , 4 , 8 ]
time [s]: [ 28.8, 19.1, 15.6 ]
[ 30.2, 20.9, 16.0 ]
[ 20.4, 15.8 ]
[ , 15.9 ]
and for 8.2.4:
ncores: [ 2 , 4 , 8 ]
time [s]: [ 28.9, 17.7, 15.8 ]
[ 25.3, 16.9, 15.7 ]
[ 26.4, 17.5, 15.9 ]
[ , 16.0 ]
It seems most of the regression is fixes, but there's still a performance gap. We have a couple of options:
- Consider it good enough and close the issue.
- Keep the issue open, but only as a reminder that there still a performance gap. However, development efforts would move on and we don't consider it blocking for 9.0
- Keep working on the issue.
Thought I'd look at the present performance status on my Apple M1.
ncore 8.2.4-10-g4ad8f84d2 9.0a-269-g72034bb09
1 54.26 44.81
2 30.78 27.03
4 18.41 18.45
8 16.85 17.79
Note that the times above come from
hines@michaels-macbook-pro-2 Morphology_1 % git diff
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index d134bcc..4486a37 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -47,8 +47,11 @@ pc.set_maxstep(10)
#coreneuron.file_mode = 0
#Initialization
def initialize():
+ x = h.startsw()
h.finitialize()
h.run()
+ x = h.startsw() - x
+ print("nthread=%d time=%g" % (pc.nthread(), x))
# h.stdinit()
# pc.psolve(h.tstop)
I launched with commands like
hines@michaels-macbook-pro-2 Morphology_1 % time nrniv -c 'mycpu=8' protocols/01_SS.py
but noticed a puzzling requirement nowadays for the master that I needed
hines@michaels-macbook-pro-2 Morphology_1 % export PYTHONPATH=$PYTHONPATH:`pwd`
in order for python to find Golgi2020_morpho_1.py
from protocols/01_SS.py
I believe @pramodk has demonstrated (please correct me if I'm mistaken) on x86_64 that the performance differences disappear when the intel compiler is used and it would be nice to document that here.
I vote to close this issue. Or perhaps have a discussion about it at the next developer meeting. The thread here is getting long and there remain many avenues for further exploration. E.g. good developer documentation for how to focus on performance issue diagnosis, multisplit, KINETIC memory access ordering.
It occurred to me that my last performance results
ncore 8.2.4-10-g4ad8f84d2 9.0a-269-g72034bb09
1 54.26 44.81
2 30.78 27.03
4 18.41 18.45
8 16.85 17.79
might be due to the master suffering from cacheline sharing between threads. To look into this, I started an experimental branch, hines/thread-padding, to see if it would be easy to take advantage of the permutability of the SoA storage to add some padding rows between the threads. The first commit added a padding
field (default 0) to NrnThread
and Memb_list
. The second commit added some debugging printf's to indicate the data padding that would be needed in the backing data std::vector<double>
for each range variable. Launching the model of this issue with 1 and 4 threads gives:
Morphology_1 % nrniv -c 'mycpu=1' protocols/01_SS.py
NEURON -- VERSION 9.0a-270-gac8d660b2+ hines/thread-padding (ac8d660b2+) 2024-06-24
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits
loading membrane mechanisms from arm64/.libs/libnrnmech.so
Additional mechanisms from files
"mod_files/Cav12.mod" "mod_files/Cav13.mod" "mod_files/Cav2_3.mod" "mod_files/Cav3_1.mod" "mod_files/GOLGI_Ampa_mossy_det_vi.mod" "mod_files/GOLGI_Ampa_pf_aa_det_vi.mod" "mod_files/GRC_CA.mod" "mod_files/GRC_KM.mod" "mod_files/Hcn1.mod" "mod_files/Hcn2.mod" "mod_files/Kca11.mod" "mod_files/Kca22.mod" "mod_files/Kca31.mod" "mod_files/Kv11.mod" "mod_files/Kv34.mod" "mod_files/Kv43.mod" "mod_files/Leak.mod" "mod_files/Nav16.mod" "mod_files/PC_NMDA_NR2B.mod" "mod_files/cdp5StCmod.mod"
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=1
ZZZ nodes id=0 end=0 padding=0
5389 lines read
1.0
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=1
ZZZ nodes id=0 end=545 padding=7
ZZZ capacitance id=0 count=317 padding=3
ZZZ na_ion id=0 count=317 padding=3
ZZZ k_ion id=0 count=317 padding=3
ZZZ ca_ion id=0 count=317 padding=3
ZZZ Cav2_3 id=0 count=123 padding=5
ZZZ Cav3_1 id=0 count=124 padding=4
ZZZ GRC_CA id=0 count=90 padding=6
ZZZ GRC_KM id=0 count=5 padding=3
ZZZ HCN1 id=0 count=5 padding=3
ZZZ HCN2 id=0 count=5 padding=3
ZZZ Kca1_1 id=0 count=213 padding=3
ZZZ Kca2_2 id=0 count=207 padding=1
ZZZ Kca3_1 id=0 count=1 padding=7
ZZZ Kv1_1 id=0 count=1 padding=7
ZZZ Kv3_4 id=0 count=105 padding=7
ZZZ Kv4_3 id=0 count=1 padding=7
ZZZ Leak id=0 count=317 padding=3
ZZZ Nav1_6 id=0 count=317 padding=3
ZZZ nrvc_ion id=0 count=317 padding=3
ZZZ cdp5StCmod id=0 count=317 padding=3
and
Morphology_1 % nrniv -c 'mycpu=4' protocols/01_SS.py
NEURON -- VERSION 9.0a-270-gac8d660b2+ hines/thread-padding (ac8d660b2+) 2024-06-24
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits
loading membrane mechanisms from arm64/.libs/libnrnmech.so
Additional mechanisms from files
"mod_files/Cav12.mod" "mod_files/Cav13.mod" "mod_files/Cav2_3.mod" "mod_files/Cav3_1.mod" "mod_files/GOLGI_Ampa_mossy_det_vi.mod" "mod_files/GOLGI_Ampa_pf_aa_det_vi.mod" "mod_files/GRC_CA.mod" "mod_files/GRC_KM.mod" "mod_files/Hcn1.mod" "mod_files/Hcn2.mod" "mod_files/Kca11.mod" "mod_files/Kca22.mod" "mod_files/Kca31.mod" "mod_files/Kv11.mod" "mod_files/Kv34.mod" "mod_files/Kv43.mod" "mod_files/Leak.mod" "mod_files/Nav16.mod" "mod_files/PC_NMDA_NR2B.mod" "mod_files/cdp5StCmod.mod"
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=1
ZZZ nodes id=0 end=0 padding=0
5389 lines read
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=4
ZZZ nodes id=0 end=127 padding=1
ZZZ nodes id=1 end=152 padding=0
ZZZ nodes id=2 end=129 padding=7
ZZZ nodes id=3 end=157 padding=3
ZZZ capacitance id=0 count=72 padding=0
ZZZ capacitance id=1 count=85 padding=3
ZZZ capacitance id=2 count=70 padding=2
ZZZ capacitance id=3 count=90 padding=6
ZZZ na_ion id=0 count=72 padding=0
ZZZ na_ion id=1 count=85 padding=3
ZZZ na_ion id=2 count=70 padding=2
ZZZ na_ion id=3 count=90 padding=6
ZZZ k_ion id=0 count=72 padding=0
ZZZ k_ion id=1 count=85 padding=3
ZZZ k_ion id=2 count=70 padding=2
ZZZ k_ion id=3 count=90 padding=6
ZZZ ca_ion id=0 count=72 padding=0
ZZZ ca_ion id=1 count=85 padding=3
ZZZ ca_ion id=2 count=70 padding=2
ZZZ ca_ion id=3 count=90 padding=6
ZZZ Cav2_3 id=0 count=33 padding=7
ZZZ Cav2_3 id=1 count=28 padding=4
ZZZ Cav2_3 id=2 count=50 padding=6
ZZZ Cav2_3 id=3 count=12 padding=4
ZZZ Cav3_1 id=0 count=33 padding=7
ZZZ Cav3_1 id=1 count=29 padding=3
ZZZ Cav3_1 id=2 count=50 padding=6
ZZZ Cav3_1 id=3 count=12 padding=4
ZZZ GRC_CA id=0 count=39 padding=1
ZZZ GRC_CA id=1 count=5 padding=3
ZZZ GRC_CA id=2 count=20 padding=4
ZZZ GRC_CA id=3 count=26 padding=6
ZZZ GRC_KM id=3 count=5 padding=3
ZZZ HCN1 id=3 count=5 padding=3
ZZZ HCN2 id=3 count=5 padding=3
ZZZ Kca1_1 id=0 count=72 padding=0
ZZZ Kca1_1 id=1 count=33 padding=7
ZZZ Kca1_1 id=2 count=70 padding=2
ZZZ Kca1_1 id=3 count=38 padding=2
ZZZ Kca2_2 id=0 count=72 padding=0
ZZZ Kca2_2 id=1 count=32 padding=0
ZZZ Kca2_2 id=2 count=70 padding=2
ZZZ Kca2_2 id=3 count=33 padding=7
ZZZ Kca3_1 id=1 count=1 padding=7
ZZZ Kv1_1 id=1 count=1 padding=7
ZZZ Kv3_4 id=1 count=53 padding=3
ZZZ Kv3_4 id=3 count=52 padding=4
ZZZ Kv4_3 id=1 count=1 padding=7
ZZZ Leak id=0 count=72 padding=0
ZZZ Leak id=1 count=85 padding=3
ZZZ Leak id=2 count=70 padding=2
ZZZ Leak id=3 count=90 padding=6
ZZZ Nav1_6 id=0 count=72 padding=0
ZZZ Nav1_6 id=1 count=85 padding=3
ZZZ Nav1_6 id=2 count=70 padding=2
ZZZ Nav1_6 id=3 count=90 padding=6
ZZZ nrvc_ion id=0 count=72 padding=0
ZZZ nrvc_ion id=1 count=85 padding=3
ZZZ nrvc_ion id=2 count=70 padding=2
ZZZ nrvc_ion id=3 count=90 padding=6
ZZZ cdp5StCmod id=0 count=72 padding=0
ZZZ cdp5StCmod id=1 count=85 padding=3
ZZZ cdp5StCmod id=2 count=70 padding=2
ZZZ cdp5StCmod id=3 count=90 padding=6
With that amount of cacheline sharing it seems reasonable to pursue this. One of the purposes of the SoA/DataHandle was to be able to easily do permutation experiments on performance. And a case in point is the present segregation of backing data into threads merely as a permutation. However, I have to admit that the introduction of "holes" or unused padding data into the backing store is making me nervous even though it is still a "kind" of permutation. It seems like it ought to be conceptually straightforward to create the padding data at the tail end of the storage and then permute them into place at the end of each thread region. What is a bit foggy at the moment is whether unused node and mechanism data that computationally never participates in a simulation and, at least from some perspectives, is not available to the interpreter, can be created/destroyed/managed with a small amount of code. Also worrying is the different item sizes in the std:vector for each range variable if the items are arrays (SoAoS).
@pramodk @1uc I'd be happy to have your opinions/suggestions.
I don't know how we can easily add "false" instances of a mechanism to achieve the required padding. I'll let you know if I think of something.
There's three reasons I'm not particularly leaning towards false sharing:
-
The last cache line from thread
i
shares the first cache line with threadi+1
. For the costly mechanisms:cdp5StCmod
,Nav1_6
,Kca1_1
andKca2_1
there's typically 70+ rows per thread. The L1 cache on this machine is about 40-50 rows ofcdp5StCmod
per core. Hence, unless the threads are really out-of-sync, by the time threadi
needs the cache line it shares withi+1
, threadi+1
is likely done with the line, and likely evicted the shared line already. -
The most expensive mechanism
cdp5StCmod
is KINETIC with a 24**2 matrix, the other three are also kinetic, with slightly smaller matrices. Each of these will require one or more linear solves per instance. -
If the uncertainty in the runtime is similar to what I was seeing on my device, then at 4 and 8 threads I'm not convinced we can see a performance difference between 9.0 and 8.2.4.
A counter to this argument is that only about 50% of the runtime is spent in the "expensive" mechanisms. Meaning the tail is heavy and optimization that don't seem important for the heavy mechanism might still matter.
It feels like one should be able to measure some of these things, e.g. using LIKWID, seems like something interesting to try out. LIKWID specifically will likely not work on laptop and desktop, but only on server hardware, like BB5. VTune should also be able to measure certain metrics, but I'm not sure if we can tell it to measure "per mechanism".
I don't have answers to the above questions but quickly want to provide some additional timing numbers that I got on my x86_64 desktop:
- I used the model from my fork here. I have reverted
diam
related change in the MOD file and some other small changes to measure timing (all trivial): https://github.com/pramodk/266806/tree/hines/v9 - I build three versions: 8.2.4, latest master and
presoa
which is the commit just before Olli's SOA PR. You have the commit links to all 3 versions in the below table. - I build these versions in standard way i.e.
-DCMAKE_BUILD_TYPE=RelWithDebInfo
:
cmake .. -DNRN_ENABLE_MOD_COMPATIBILITY=OFF -DCORENRN_ENABLE_OPENMP=OFF \
-DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_RX3D=OFF \
-DCMAKE_INSTALL_PREFIX=`pwd`/install -DNRN_ENABLE_CORENEURON=OFF \
-DNRN_ENABLE_TESTS=OFF -DNRN_ENABLE_PROFILING=OFF \
-DNRN_PROFILER=caliper -DCMAKE_BUILD_TYPE=RelWithDebInfo
- The comparison is between GCC 11.4.0 and Intel 2023.2.0
GCC Compiler
Version | Threads | Repetition 1 | Repetition 2 |
---|---|---|---|
master-latest | 1 | 3.99 | 4.10 |
2 | 2.22 | 2.27 | |
4 | 1.26 | 1.32 | |
8 | 1.15 | 1.15 | |
presoa | 1 | 2.77 | 2.71 |
2 | 1.46 | 1.42 | |
4 | 0.78 | 0.77 | |
8 | 0.72 | 0.74 | |
824 | 1 | 3.48 | 3.30 |
2 | 1.75 | 1.75 | |
4 | 0.97 | 0.98 | |
8 | 0.88 | 0.87 |
Intel Compiler
Version | Threads | Repetition 1 | Repetition 2 |
---|---|---|---|
master-latest | 1 | 2.77 | 2.76 |
2 | 1.55 | 1.56 | |
4 | 0.92 | 0.92 | |
8 | 0.84 | 0.84 | |
presoa | 1 | 2.29 | 2.26 |
2 | 1.23 | 1.23 | |
4 | 0.68 | 0.68 | |
8 | 0.58 | 0.61 | |
824 | 1 | 3.37 | 3.38 |
2 | 1.83 | 1.83 | |
4 | 1.00 | 1.08 | |
8 | 0.80 | 0.81 |
@1uc and I briefly looked at presoa
vs master-latest
difference. I can add more info here once we generate & compare the profile outputs.
Finally got a dedicated time and "working" Linux desktop machine without hiccups to look into detailed profiling.
Compiler | Build Type | Threads | Average Time (seconds) |
---|---|---|---|
gcc | master + PR #2966 | 1 | 2.78 |
gcc | master + PR #2966 | 4 | 0.95 |
gcc | master + PR #2966 | 8 | 0.82 |
gcc | 824 | 1 | 3.32 |
gcc | 824 | 4 | 1.06 |
gcc | 824 | 8 | 0.89 |
clang | master + PR #2966 | 1 | 2.92 |
clang | master + PR #2966 | 4 | 0.99 |
clang | master + PR #2966 | 8 | 0.88 |
clang | 824 | 1 | 3.34 |
clang | 824 | 4 | 1.04 |
clang | 824 | 8 | 0.83 |
intel | master + PR #2966 | 1 | 2.60 |
intel | master + PR #2966 | 4 | 0.89 |
intel | master + PR #2966 | 8 | 0.81 |
intel | 824 | 1 | 3.37 |
intel | 824 | 4 | 1.02 |
intel | 824 | 8 | 0.81 |
With all 3 compilers (GCC, Intel and Clang), master with #2966 is equal or better than v8.2.4.
I should expand these notes into dev docs but writing brief points already:
- For Vtune profiling with Python script, itt-python is handy to pause / resume solver part:
$ git diff
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index bbdeff7..2dd1111 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -12,6 +12,9 @@ import numpy as np
import sys
import os
+import itt
+itt.pause()
+
mycpu = multiprocessing.cpu_count()
mycpu = int(os.environ['NRN_THREADS']) if 'NRN_THREADS' in os.environ else mycpu
@@ -53,9 +56,11 @@ pc.set_maxstep(10)
def run():
import time
h.stdinit()
+ itt.resume()
t0 = time.time()
pc.psolve(h.tstop)
t1 = time.time()
+ itt.pause()
print("NEURON RUN with %d threads took %f " % (mycpu, t1-t0))
Vtune profiling is as usual:
analysis=uarch-exploration
result_dir=vtune_uarch
time taskset -c 0-7 vtune -collect $analysis -start-paused -result-dir=${result_dir} $compile_dir/x86_64/special -python protocols/01_SS.py
- Vtune will sample a lot of functions and generated profile data is a bit cluttered if we want to use the "profile comparison" functionality. We can just focus on specific parts like
sparse_thread()
by modifyingnocmodl
as:
+++ b/src/nmodl/deriv.cpp
@@ -19,6 +19,8 @@ void copylist(List*, Item*);
List* massage_list_;
List* netrec_cnexp;
+extern int requires_itt_notify;
+
/* SmallBuf size */
#undef SB
#define SB 256
@@ -183,6 +185,7 @@ void solv_diffeq(Item* qsol,
fun->name,
listnum);
vectorize_substitute(qsol, buf);
+ requires_itt_notify = 1;
}
}
dtsav_for_nrn_state = 1;
diff --git a/src/nmodl/noccout.cpp b/src/nmodl/noccout.cpp
index a2ceef6b9..0670e558f 100644
--- a/src/nmodl/noccout.cpp
+++ b/src/nmodl/noccout.cpp
@@ -33,6 +33,8 @@ extern Symbol* cvode_nrn_cur_solve_;
extern Symbol* cvode_nrn_current_solve_;
extern List* state_discon_list_;
+int requires_itt_notify = 0;
+
/* VECTORIZE has not been optional for years. We leave the define there but */
/* we no longer update the #else clauses. */
extern int vectorize;
@@ -532,6 +534,9 @@ void c_out_vectorize() {
/* things which must go first and most declarations */
P("/* VECTORIZED */\n#define NRN_VECTORIZED 1\n");
P("#include <stdio.h>\n#include <stdlib.h>\n#include <math.h>\n#include \"mech_api.h\"\n");
+ if (requires_itt_notify) {
+ P("#include <ittnotify.h>\n");
+ }
P("#undef PI\n");
P("#define nil 0\n");
P("#define _pval pval\n"); // due to some old models using _pval
@@ -775,6 +780,9 @@ void c_out_vectorize() {
P("double _dtsav = dt;\n"
"if (secondorder) { dt *= 0.5; }\n");
}
+ if (requires_itt_notify) {
+ P("__itt_resume();\n");
+ }
P("#if CACHEVEC\n");
P(" _ni = _ml->_nodeindices;\n");
P("#endif\n");
@@ -797,6 +805,9 @@ void c_out_vectorize() {
if (dtsav_for_nrn_state && nrnstate) {
P(" dt = _dtsav;");
}
+ if (requires_itt_notify) {
+ P("__itt_pause();\n");
+ }
i.e. in this case, we are only recording nrn_state
functions that use sparse
solvers.
-
When comparing the profiles with Vtune, using the custom view to compare timings per file might be a good first step. This is helpful because between 8.2.4 and master, functions and namespaces are changed which makes comparison a bit difficult.
-
In the right bottom corner, one can easily toggle inline functions from the profile view. This also helps to quickly find out which functions are inlined or not inlined:
- If we compare Intel vs GCC profiles for the same version master, we see the below:
i.e.
-
the total instructions executed are significantly higher with GCC
-
the same is reflected in
sparse_thread.hpp
orcdp5StCmod.cpp
-
looking at the
Event Count
tab, we can see the same -
note, as functions are inlined and hence counts are empty in the intel version
-
The issue is reproducible if we install neuron with Intel compiler and then do
nrnivmodl
with GCC compiler i.e.// install neuron with icx icpx // compile now mod files with gcc. Link to libirc for intel functions like fast copy CXX=g++ nrnivmodl -loadflags "<home>/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libirc.a" mod_files
Or, compiling
cdp5StCmod.cpp
with g++ is sufficient. As the issue is reproducible like this, it also points to the issue of code generation/compilation. -
Looking into the assembly view for two executions, we can see that in one case we have explicit function calls but not in another (i.e. inlining issue):
This got closed as I merged PR but I am happy to clarify if there are any further comments/questions. I will take some time to document some of the instructions above.
@SteMasoli: I tested up to 8 threads, and the results are consistent now. When you have time, could you pull the latest master branch and check if the results are as expected on your machine? That would be great!
We tried NEURON+NMODL (and with a small tweak we get):
ncores: [ 2 , 4 , 8 , 16 ]
time [s]: [ 34.2, 20.0, 13.8, 23.4 ]
[ 34.5, 20.2, 14.2, 24.4 ]
[ 34.6, 19.5, 14.2, -- ]
[ 34.7 20.2, 14.2, -- ]
compared to NEURON master + NOCMODL:
ncores: [ 2 , 4 , 8 , 16 ]
time [s]: [ 34.9, 20.3, 14.0, 22.3 ]
[ 34.7, 20.2, 14.2, 22.0 ]
[ 35.6, 19.6, 13.6, -- ]
[ 34.6 19.9, 14.5, -- ]
(I'm on a different device from the previous measurements.)