nrn Slow compile time with rx3d module and -O3

When we build rx3d, cython generated files like ctng.cpp takes minutes to compile with -O3 (-O0 is quite fast).

Oct 07 '19 07:10 nrnhines

The cmake build system provides the NRN_RX3D_OPT_LEVEL option. Default is

-DNRN_RX3D_OPT_LEVEL=0

and can get the old autotools level with

-DNRN_RZX3D_OPT_LEVEL=2

Nov 08 '19 13:11 nrnhines

Instead of creating new issue, I will use this one to discuss the issue and possible solution.

I was trying to get an idea of what takes compilation time. Using -ftime-report gives following:

→ time g++-9 -I/usr/local/opt/ruby/include -arch i386 -arch x86_64 -pipe -I/Users/kumbhar/workarena/repos/bbp/nn/share/lib/python/neuron/rxd/geometry3d -I. -I/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c ../../share/lib/python/neuron/rxd/geometry3d/surfaces.cpp -o surfaces.o -O0 -ftime-report -O3
g++-9: warning: x86_64 conflicts with i386 (arch flags ignored)
In file included from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/ndarraytypes.h:1760,
                 from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/ndarrayobject.h:17,
                 from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/arrayobject.h:4,
                 from ../../share/lib/python/neuron/rxd/geometry3d/surfaces.cpp:598:
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
   15 | #warning "Using deprecated NumPy API, disable it by " \
      |  ^~~~~~~

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)    1481 kB (  1%)
 phase parsing                      :   1.63 ( 13%)   0.70 ( 53%)   2.43 ( 17%)   89009 kB ( 34%)
 phase lang. deferred               :   0.04 (  0%)   0.02 (  2%)   0.05 (  0%)    3439 kB (  1%)
 phase opt and generate             :  10.52 ( 86%)   0.59 ( 45%)  11.71 ( 82%)  164208 kB ( 64%)
 |name lookup                       :   0.25 (  2%)   0.11 (  8%)   0.35 (  2%)    2179 kB (  1%)
 |overload resolution               :   0.14 (  1%)   0.03 (  2%)   0.13 (  1%)    3727 kB (  1%)
 dump files                         :   0.03 (  0%)   0.01 (  1%)   0.04 (  0%)       0 kB (  0%)
 callgraph construction             :   0.03 (  0%)   0.00 (  0%)   0.03 (  0%)    4394 kB (  2%)
 callgraph optimization             :   0.05 (  0%)   0.00 (  0%)   0.01 (  0%)       8 kB (  0%)
 ipa function summary               :   0.03 (  0%)   0.00 (  0%)   0.02 (  0%)     852 kB (  0%)
 ipa cp                             :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)     824 kB (  0%)
 ipa inlining heuristics            :   0.03 (  0%)   0.00 (  0%)   0.03 (  0%)    1361 kB (  1%)
 ipa function splitting             :   0.02 (  0%)   0.00 (  0%)   0.00 (  0%)     188 kB (  0%)
 ipa various optimizations          :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     528 kB (  0%)
 ipa icf                            :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 ipa SRA                            :   0.11 (  1%)   0.02 (  2%)   0.13 (  1%)   13899 kB (  5%)
 cfg construction                   :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     857 kB (  0%)
 cfg cleanup                        :   0.14 (  1%)   0.00 (  0%)   0.17 (  1%)    1424 kB (  1%)
 trivially dead code                :   0.04 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 df scan insns                      :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)       2 kB (  0%)
 df multiple defs                   :   0.05 (  0%)   0.01 (  1%)   0.06 (  0%)       0 kB (  0%)
 df reaching defs                   :   0.11 (  1%)   0.02 (  2%)   0.18 (  1%)       0 kB (  0%)
 df live regs                       :   0.44 (  4%)   0.00 (  0%)   0.48 (  3%)       0 kB (  0%)
 df live&initialized regs           :   0.31 (  3%)   0.00 (  0%)   0.35 (  2%)       0 kB (  0%)
 df must-initialized regs           :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 df use-def / def-use chains        :   0.06 (  0%)   0.00 (  0%)   0.03 (  0%)       0 kB (  0%)
 df reg dead/unused notes           :   0.20 (  2%)   0.00 (  0%)   0.11 (  1%)    1745 kB (  1%)
 register information               :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 alias analysis                     :   0.07 (  1%)   0.00 (  0%)   0.08 (  1%)    4152 kB (  2%)
 alias stmt walking                 :   0.24 (  2%)   0.10 (  8%)   0.42 (  3%)      45 kB (  0%)
 register scan                      :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)     137 kB (  0%)
 rebuild jump labels                :   0.04 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 preprocessing                      :   0.34 (  3%)   0.26 ( 20%)   0.63 (  4%)   11877 kB (  5%)
 parser (global)                    :   0.40 (  3%)   0.16 ( 12%)   0.55 (  4%)   24151 kB (  9%)
 parser struct body                 :   0.11 (  1%)   0.04 (  3%)   0.17 (  1%)    6392 kB (  2%)
 parser function body               :   0.51 (  4%)   0.15 ( 11%)   0.63 (  4%)   28038 kB ( 11%)
 parser inl. func. body             :   0.10 (  1%)   0.02 (  2%)   0.15 (  1%)    8510 kB (  3%)
 parser inl. meth. body             :   0.06 (  0%)   0.03 (  2%)   0.09 (  1%)    2785 kB (  1%)
 template instantiation             :   0.09 (  1%)   0.06 (  5%)   0.16 (  1%)   10506 kB (  4%)
 constant expression evaluation     :   0.05 (  0%)   0.00 (  0%)   0.10 (  1%)      75 kB (  0%)
 inline parameters                  :   0.03 (  0%)   0.00 (  0%)   0.05 (  0%)     988 kB (  0%)
 integration                        :   0.06 (  0%)   0.01 (  1%)   0.07 (  0%)    7978 kB (  3%)
 tree gimplify                      :   0.06 (  0%)   0.00 (  0%)   0.09 (  1%)   11135 kB (  4%)
 tree eh                            :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)     726 kB (  0%)
 tree CFG construction              :   0.01 (  0%)   0.01 (  1%)   0.02 (  0%)    3896 kB (  2%)
 tree CFG cleanup                   :   0.09 (  1%)   0.00 (  0%)   0.19 (  1%)     202 kB (  0%)
 tree tail merge                    :   0.05 (  0%)   0.00 (  0%)   0.04 (  0%)     942 kB (  0%)
 tree VRP                           :   0.25 (  2%)   0.01 (  1%)   0.29 (  2%)    5185 kB (  2%)
 tree Early VRP                     :   0.06 (  0%)   0.02 (  2%)   0.05 (  0%)    2473 kB (  1%)
 tree copy propagation              :   0.09 (  1%)   0.00 (  0%)   0.03 (  0%)     465 kB (  0%)
 tree PTA                           :   0.13 (  1%)   0.01 (  1%)   0.15 (  1%)     874 kB (  0%)
 tree PHI insertion                 :   0.00 (  0%)   0.00 (  0%)   0.00 (  0%)    3686 kB (  1%)
 tree SSA rewrite                   :   0.01 (  0%)   0.00 (  0%)   0.17 (  1%)    2947 kB (  1%)
 tree SSA other                     :   0.00 (  0%)   0.02 (  2%)   0.07 (  0%)      22 kB (  0%)
 tree SSA incremental               :   0.14 (  1%)   0.01 (  1%)   0.22 (  2%)    2466 kB (  1%)
 tree operand scan                  :   0.12 (  1%)   0.12 (  9%)   0.30 (  2%)    6937 kB (  3%)
 dominator optimization             :   0.34 (  3%)   0.02 (  2%)   0.39 (  3%)    5617 kB (  2%)
 backwards jump threading           :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)     483 kB (  0%)
 tree SRA                           :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      13 kB (  0%)
 isolate eroneous paths             :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 tree CCP                           :   0.12 (  1%)   0.01 (  1%)   0.06 (  0%)     855 kB (  0%)
 tree reassociation                 :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      12 kB (  0%)
 tree PRE                           :   0.29 (  2%)   0.03 (  2%)   0.30 (  2%)    5989 kB (  2%)
 tree FRE                           :   0.22 (  2%)   0.03 (  2%)   0.27 (  2%)    2260 kB (  1%)
 tree forward propagate             :   0.04 (  0%)   0.00 (  0%)   0.07 (  0%)    1101 kB (  0%)
 tree conservative DCE              :   0.07 (  1%)   0.01 (  1%)   0.09 (  1%)     108 kB (  0%)
 tree aggressive DCE                :   0.06 (  0%)   0.02 (  2%)   0.08 (  1%)     441 kB (  0%)
 PHI merge                          :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)     626 kB (  0%)
 tree loop invariant motion         :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)       1 kB (  0%)
 tree loop unswitching              :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      41 kB (  0%)
 complete unrolling                 :   0.03 (  0%)   0.00 (  0%)   0.04 (  0%)     560 kB (  0%)
 tree slp vectorization             :   0.09 (  1%)   0.00 (  0%)   0.04 (  0%)    4579 kB (  2%)
 tree iv optimization               :   0.02 (  0%)   0.00 (  0%)   0.00 (  0%)     927 kB (  0%)
 tree SSA uncprop                   :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 tree switch lowering               :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)     152 kB (  0%)
 gimple widening/fma detection      :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       3 kB (  0%)
 dominance frontiers                :   0.02 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 dominance computation              :   0.09 (  1%)   0.00 (  0%)   0.07 (  0%)       0 kB (  0%)
 control dependences                :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 out of ssa                         :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)      13 kB (  0%)
 expand vars                        :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)    1363 kB (  1%)
 expand                             :   0.07 (  1%)   0.00 (  0%)   0.08 (  1%)   13960 kB (  5%)
 post expand cleanups               :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)     380 kB (  0%)
 varconst                           :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      46 kB (  0%)
 forward prop                       :   0.04 (  0%)   0.01 (  1%)   0.08 (  1%)     753 kB (  0%)
 CSE                                :   0.19 (  2%)   0.00 (  0%)   0.13 (  1%)     312 kB (  0%)
 dead code elimination              :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 dead store elim1                   :   0.08 (  1%)   0.00 (  0%)   0.04 (  0%)     778 kB (  0%)
 dead store elim2                   :   0.05 (  0%)   0.00 (  0%)   0.05 (  0%)    3239 kB (  1%)
 loop init                          :   0.06 (  0%)   0.00 (  0%)   0.05 (  0%)    2508 kB (  1%)
 loop invariant motion              :   0.01 (  0%)   0.00 (  0%)   0.02 (  0%)      18 kB (  0%)
 loop fini                          :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 CPROP                              :   0.13 (  1%)   0.00 (  0%)   0.15 (  1%)    2835 kB (  1%)
 PRE                                :   0.71 (  6%)   0.00 (  0%)   0.78 (  5%)     208 kB (  0%)
 code hoisting                      :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 CSE 2                              :   0.08 (  1%)   0.00 (  0%)   0.14 (  1%)     174 kB (  0%)
 branch prediction                  :   0.02 (  0%)   0.01 (  1%)   0.03 (  0%)     394 kB (  0%)
 combiner                           :   0.15 (  1%)   0.00 (  0%)   0.15 (  1%)    3166 kB (  1%)
 if-conversion                      :   0.01 (  0%)   0.01 (  1%)   0.01 (  0%)     312 kB (  0%)
 integrated RA                      :   0.39 (  3%)   0.01 (  1%)   0.45 (  3%)   10852 kB (  4%)
 LRA non-specific                   :   0.24 (  2%)   0.00 (  0%)   0.20 (  1%)    2615 kB (  1%)
 LRA virtuals elimination           :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)    2106 kB (  1%)
 LRA reload inheritance             :   0.02 (  0%)   0.00 (  0%)   0.05 (  0%)     276 kB (  0%)
 LRA create live ranges             :   0.28 (  2%)   0.01 (  1%)   0.28 (  2%)     266 kB (  0%)
 LRA hard reg assignment            :   0.04 (  0%)   0.00 (  0%)   0.05 (  0%)       0 kB (  0%)
 LRA rematerialization              :   0.04 (  0%)   0.00 (  0%)   0.05 (  0%)       0 kB (  0%)
 reload                             :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 reload CSE regs                    :   0.30 (  2%)   0.00 (  0%)   0.30 (  2%)    3568 kB (  1%)
 load CSE after reload              :   1.71 ( 14%)   0.01 (  1%)   1.76 ( 12%)      88 kB (  0%)
 ree                                :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      50 kB (  0%)
 thread pro- & epilogue             :   0.03 (  0%)   0.00 (  0%)   0.01 (  0%)     210 kB (  0%)
 if-conversion 2                    :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       6 kB (  0%)
 peephole 2                         :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)     643 kB (  0%)
 hard reg cprop                     :   0.03 (  0%)   0.00 (  0%)   0.05 (  0%)      46 kB (  0%)
 scheduling 2                       :   0.68 (  6%)   0.02 (  2%)   0.80 (  6%)     435 kB (  0%)
 reorder blocks                     :   0.01 (  0%)   0.00 (  0%)   0.03 (  0%)    1062 kB (  0%)
 shorten branches                   :   0.02 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 final                              :   0.12 (  1%)   0.01 (  1%)   0.05 (  0%)    2590 kB (  1%)
 variable output                    :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      23 kB (  0%)
 straight-line strength reduction   :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)      66 kB (  0%)
 store merging                      :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      25 kB (  0%)
 unaccounted optimizations          :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 rest of compilation                :   0.09 (  1%)   0.00 (  0%)   0.13 (  1%)    1090 kB (  0%)
 remove unused locals               :   0.05 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 address taken                      :   0.01 (  0%)   0.01 (  1%)   0.02 (  0%)       0 kB (  0%)
 repair loop structures             :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 TOTAL                              :  12.20          1.31         14.21         258152 kB

real	0m14.462s
user	0m12.424s
sys	0m1.370s

Not much helpful, just says phase opt and generate takes most of the time.

Dec 31 '19 05:12 pramodk

Just wandering what would be the usecase for using O3?

ninja build time on my Ubuntu22 machine:

# O3
real    1m51.433s
user    12m17.868s
sys     1m0.699s

vs:

# O2 
real    1m24.762s
user    11m22.291s
sys     0m58.137s

Jan 18 '23 13:01 alexsavulescu

Is this still an issue? Or should we close it?

Using the defaults, NEURON compiles in maybe 1 minute on my computer.

Jul 01 '23 01:07 ramcdougal

I propose to drop NRN_RX3D_OPT_LEVEL given that we usually compile with -O2.

Jul 10 '23 07:07 alexsavulescu

Presently

set(NRN_RX3D_OPT_LEVEL_DEFAULT "0")

I don't object to setting to 2 (or 3) but I'd like to see a comparison of build times for a parallel make between 0 and 3 before removing NRN_RX3D_OPT_LEVEL

A fresh build (no external git downloads) in the sense of

rm -r -f build ; mkdir build; cd build; cmake .. -G Ninja ...
time ninja install

gives time results for default level on my ubuntu 22.04 desktop

real	3m34.182s
user	23m25.897s
sys	2m8.105s

and for cmake ... -DNRN_RX3D_OPT_LEVEL=3

real	5m31.861s
user	25m47.618s
sys	2m7.635s

That extra two minutes is excruciating, but I guess the real lesson is that I need a new desktop.

Jul 10 '23 19:07 nrnhines