Slow compile time with rx3d module and -O3
When we build rx3d, cython generated files like ctng.cpp takes minutes to compile with -O3 (-O0 is quite fast).
The cmake build system provides the NRN_RX3D_OPT_LEVEL option. Default is
-DNRN_RX3D_OPT_LEVEL=0
and can get the old autotools level with
-DNRN_RZX3D_OPT_LEVEL=2
Instead of creating new issue, I will use this one to discuss the issue and possible solution.
I was trying to get an idea of what takes compilation time. Using -ftime-report gives following:
→ time g++-9 -I/usr/local/opt/ruby/include -arch i386 -arch x86_64 -pipe -I/Users/kumbhar/workarena/repos/bbp/nn/share/lib/python/neuron/rxd/geometry3d -I. -I/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c ../../share/lib/python/neuron/rxd/geometry3d/surfaces.cpp -o surfaces.o -O0 -ftime-report -O3
g++-9: warning: x86_64 conflicts with i386 (arch flags ignored)
In file included from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/ndarraytypes.h:1760,
from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/ndarrayobject.h:17,
from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/arrayobject.h:4,
from ../../share/lib/python/neuron/rxd/geometry3d/surfaces.cpp:598:
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
15 | #warning "Using deprecated NumPy API, disable it by " \
| ^~~~~~~
Time variable usr sys wall GGC
phase setup : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1481 kB ( 1%)
phase parsing : 1.63 ( 13%) 0.70 ( 53%) 2.43 ( 17%) 89009 kB ( 34%)
phase lang. deferred : 0.04 ( 0%) 0.02 ( 2%) 0.05 ( 0%) 3439 kB ( 1%)
phase opt and generate : 10.52 ( 86%) 0.59 ( 45%) 11.71 ( 82%) 164208 kB ( 64%)
|name lookup : 0.25 ( 2%) 0.11 ( 8%) 0.35 ( 2%) 2179 kB ( 1%)
|overload resolution : 0.14 ( 1%) 0.03 ( 2%) 0.13 ( 1%) 3727 kB ( 1%)
dump files : 0.03 ( 0%) 0.01 ( 1%) 0.04 ( 0%) 0 kB ( 0%)
callgraph construction : 0.03 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 4394 kB ( 2%)
callgraph optimization : 0.05 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 8 kB ( 0%)
ipa function summary : 0.03 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 852 kB ( 0%)
ipa cp : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 824 kB ( 0%)
ipa inlining heuristics : 0.03 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 1361 kB ( 1%)
ipa function splitting : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 188 kB ( 0%)
ipa various optimizations : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 528 kB ( 0%)
ipa icf : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
ipa SRA : 0.11 ( 1%) 0.02 ( 2%) 0.13 ( 1%) 13899 kB ( 5%)
cfg construction : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 857 kB ( 0%)
cfg cleanup : 0.14 ( 1%) 0.00 ( 0%) 0.17 ( 1%) 1424 kB ( 1%)
trivially dead code : 0.04 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
df scan insns : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 2 kB ( 0%)
df multiple defs : 0.05 ( 0%) 0.01 ( 1%) 0.06 ( 0%) 0 kB ( 0%)
df reaching defs : 0.11 ( 1%) 0.02 ( 2%) 0.18 ( 1%) 0 kB ( 0%)
df live regs : 0.44 ( 4%) 0.00 ( 0%) 0.48 ( 3%) 0 kB ( 0%)
df live&initialized regs : 0.31 ( 3%) 0.00 ( 0%) 0.35 ( 2%) 0 kB ( 0%)
df must-initialized regs : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
df use-def / def-use chains : 0.06 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
df reg dead/unused notes : 0.20 ( 2%) 0.00 ( 0%) 0.11 ( 1%) 1745 kB ( 1%)
register information : 0.04 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%)
alias analysis : 0.07 ( 1%) 0.00 ( 0%) 0.08 ( 1%) 4152 kB ( 2%)
alias stmt walking : 0.24 ( 2%) 0.10 ( 8%) 0.42 ( 3%) 45 kB ( 0%)
register scan : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 137 kB ( 0%)
rebuild jump labels : 0.04 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
preprocessing : 0.34 ( 3%) 0.26 ( 20%) 0.63 ( 4%) 11877 kB ( 5%)
parser (global) : 0.40 ( 3%) 0.16 ( 12%) 0.55 ( 4%) 24151 kB ( 9%)
parser struct body : 0.11 ( 1%) 0.04 ( 3%) 0.17 ( 1%) 6392 kB ( 2%)
parser function body : 0.51 ( 4%) 0.15 ( 11%) 0.63 ( 4%) 28038 kB ( 11%)
parser inl. func. body : 0.10 ( 1%) 0.02 ( 2%) 0.15 ( 1%) 8510 kB ( 3%)
parser inl. meth. body : 0.06 ( 0%) 0.03 ( 2%) 0.09 ( 1%) 2785 kB ( 1%)
template instantiation : 0.09 ( 1%) 0.06 ( 5%) 0.16 ( 1%) 10506 kB ( 4%)
constant expression evaluation : 0.05 ( 0%) 0.00 ( 0%) 0.10 ( 1%) 75 kB ( 0%)
inline parameters : 0.03 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 988 kB ( 0%)
integration : 0.06 ( 0%) 0.01 ( 1%) 0.07 ( 0%) 7978 kB ( 3%)
tree gimplify : 0.06 ( 0%) 0.00 ( 0%) 0.09 ( 1%) 11135 kB ( 4%)
tree eh : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 726 kB ( 0%)
tree CFG construction : 0.01 ( 0%) 0.01 ( 1%) 0.02 ( 0%) 3896 kB ( 2%)
tree CFG cleanup : 0.09 ( 1%) 0.00 ( 0%) 0.19 ( 1%) 202 kB ( 0%)
tree tail merge : 0.05 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 942 kB ( 0%)
tree VRP : 0.25 ( 2%) 0.01 ( 1%) 0.29 ( 2%) 5185 kB ( 2%)
tree Early VRP : 0.06 ( 0%) 0.02 ( 2%) 0.05 ( 0%) 2473 kB ( 1%)
tree copy propagation : 0.09 ( 1%) 0.00 ( 0%) 0.03 ( 0%) 465 kB ( 0%)
tree PTA : 0.13 ( 1%) 0.01 ( 1%) 0.15 ( 1%) 874 kB ( 0%)
tree PHI insertion : 0.00 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 3686 kB ( 1%)
tree SSA rewrite : 0.01 ( 0%) 0.00 ( 0%) 0.17 ( 1%) 2947 kB ( 1%)
tree SSA other : 0.00 ( 0%) 0.02 ( 2%) 0.07 ( 0%) 22 kB ( 0%)
tree SSA incremental : 0.14 ( 1%) 0.01 ( 1%) 0.22 ( 2%) 2466 kB ( 1%)
tree operand scan : 0.12 ( 1%) 0.12 ( 9%) 0.30 ( 2%) 6937 kB ( 3%)
dominator optimization : 0.34 ( 3%) 0.02 ( 2%) 0.39 ( 3%) 5617 kB ( 2%)
backwards jump threading : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 483 kB ( 0%)
tree SRA : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 13 kB ( 0%)
isolate eroneous paths : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
tree CCP : 0.12 ( 1%) 0.01 ( 1%) 0.06 ( 0%) 855 kB ( 0%)
tree reassociation : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 12 kB ( 0%)
tree PRE : 0.29 ( 2%) 0.03 ( 2%) 0.30 ( 2%) 5989 kB ( 2%)
tree FRE : 0.22 ( 2%) 0.03 ( 2%) 0.27 ( 2%) 2260 kB ( 1%)
tree forward propagate : 0.04 ( 0%) 0.00 ( 0%) 0.07 ( 0%) 1101 kB ( 0%)
tree conservative DCE : 0.07 ( 1%) 0.01 ( 1%) 0.09 ( 1%) 108 kB ( 0%)
tree aggressive DCE : 0.06 ( 0%) 0.02 ( 2%) 0.08 ( 1%) 441 kB ( 0%)
PHI merge : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 626 kB ( 0%)
tree loop invariant motion : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1 kB ( 0%)
tree loop unswitching : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 41 kB ( 0%)
complete unrolling : 0.03 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 560 kB ( 0%)
tree slp vectorization : 0.09 ( 1%) 0.00 ( 0%) 0.04 ( 0%) 4579 kB ( 2%)
tree iv optimization : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 927 kB ( 0%)
tree SSA uncprop : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
tree switch lowering : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 152 kB ( 0%)
gimple widening/fma detection : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 3 kB ( 0%)
dominance frontiers : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
dominance computation : 0.09 ( 1%) 0.00 ( 0%) 0.07 ( 0%) 0 kB ( 0%)
control dependences : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
out of ssa : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 13 kB ( 0%)
expand vars : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1363 kB ( 1%)
expand : 0.07 ( 1%) 0.00 ( 0%) 0.08 ( 1%) 13960 kB ( 5%)
post expand cleanups : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 380 kB ( 0%)
varconst : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 46 kB ( 0%)
forward prop : 0.04 ( 0%) 0.01 ( 1%) 0.08 ( 1%) 753 kB ( 0%)
CSE : 0.19 ( 2%) 0.00 ( 0%) 0.13 ( 1%) 312 kB ( 0%)
dead code elimination : 0.04 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%)
dead store elim1 : 0.08 ( 1%) 0.00 ( 0%) 0.04 ( 0%) 778 kB ( 0%)
dead store elim2 : 0.05 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 3239 kB ( 1%)
loop init : 0.06 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 2508 kB ( 1%)
loop invariant motion : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 18 kB ( 0%)
loop fini : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
CPROP : 0.13 ( 1%) 0.00 ( 0%) 0.15 ( 1%) 2835 kB ( 1%)
PRE : 0.71 ( 6%) 0.00 ( 0%) 0.78 ( 5%) 208 kB ( 0%)
code hoisting : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
CSE 2 : 0.08 ( 1%) 0.00 ( 0%) 0.14 ( 1%) 174 kB ( 0%)
branch prediction : 0.02 ( 0%) 0.01 ( 1%) 0.03 ( 0%) 394 kB ( 0%)
combiner : 0.15 ( 1%) 0.00 ( 0%) 0.15 ( 1%) 3166 kB ( 1%)
if-conversion : 0.01 ( 0%) 0.01 ( 1%) 0.01 ( 0%) 312 kB ( 0%)
integrated RA : 0.39 ( 3%) 0.01 ( 1%) 0.45 ( 3%) 10852 kB ( 4%)
LRA non-specific : 0.24 ( 2%) 0.00 ( 0%) 0.20 ( 1%) 2615 kB ( 1%)
LRA virtuals elimination : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 2106 kB ( 1%)
LRA reload inheritance : 0.02 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 276 kB ( 0%)
LRA create live ranges : 0.28 ( 2%) 0.01 ( 1%) 0.28 ( 2%) 266 kB ( 0%)
LRA hard reg assignment : 0.04 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 0 kB ( 0%)
LRA rematerialization : 0.04 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 0 kB ( 0%)
reload : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
reload CSE regs : 0.30 ( 2%) 0.00 ( 0%) 0.30 ( 2%) 3568 kB ( 1%)
load CSE after reload : 1.71 ( 14%) 0.01 ( 1%) 1.76 ( 12%) 88 kB ( 0%)
ree : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 50 kB ( 0%)
thread pro- & epilogue : 0.03 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 210 kB ( 0%)
if-conversion 2 : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 6 kB ( 0%)
peephole 2 : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 643 kB ( 0%)
hard reg cprop : 0.03 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 46 kB ( 0%)
scheduling 2 : 0.68 ( 6%) 0.02 ( 2%) 0.80 ( 6%) 435 kB ( 0%)
reorder blocks : 0.01 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 1062 kB ( 0%)
shorten branches : 0.02 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%)
final : 0.12 ( 1%) 0.01 ( 1%) 0.05 ( 0%) 2590 kB ( 1%)
variable output : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 23 kB ( 0%)
straight-line strength reduction : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 66 kB ( 0%)
store merging : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 25 kB ( 0%)
unaccounted optimizations : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
rest of compilation : 0.09 ( 1%) 0.00 ( 0%) 0.13 ( 1%) 1090 kB ( 0%)
remove unused locals : 0.05 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
address taken : 0.01 ( 0%) 0.01 ( 1%) 0.02 ( 0%) 0 kB ( 0%)
repair loop structures : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
TOTAL : 12.20 1.31 14.21 258152 kB
real 0m14.462s
user 0m12.424s
sys 0m1.370s
Not much helpful, just says phase opt and generate takes most of the time.
Just wandering what would be the usecase for using O3?
ninja build time on my Ubuntu22 machine:
# O3
real 1m51.433s
user 12m17.868s
sys 1m0.699s
vs:
# O2
real 1m24.762s
user 11m22.291s
sys 0m58.137s
Is this still an issue? Or should we close it?
Using the defaults, NEURON compiles in maybe 1 minute on my computer.
I propose to drop NRN_RX3D_OPT_LEVEL given that we usually compile with -O2.
Presently
set(NRN_RX3D_OPT_LEVEL_DEFAULT "0")
I don't object to setting to 2 (or 3) but I'd like to see a comparison of build times for a parallel make between 0 and 3 before removing NRN_RX3D_OPT_LEVEL
A fresh build (no external git downloads) in the sense of
rm -r -f build ; mkdir build; cd build; cmake .. -G Ninja ...
time ninja install
gives time results for default level on my ubuntu 22.04 desktop
real 3m34.182s
user 23m25.897s
sys 2m8.105s
and for cmake ... -DNRN_RX3D_OPT_LEVEL=3
real 5m31.861s
user 25m47.618s
sys 2m7.635s
That extra two minutes is excruciating, but I guess the real lesson is that I need a new desktop.