Performance regression (slowdown) in ALICE event generation
Check duplicate issues.
- [X] Checked for duplicates
Description
10-20fold slowdown of running a particular ALICE event generation process, where the generator is compiled just-in-time from a ROOT macro (see reproducer). Slowdown observed when switching from ROOT tag 6.28.04 to 6.28.08.
In particular, bisecting this slowdown shows that it is caused by this commit https://github.com/root-project/root/pull/11631/commits/4e4545fcd4dcb80af05cefd9747f7f585fbe75d3 (from #11631). Reverting this single commit brings back the original performance.
Reproducer
On lxplus7.cern.ch, run:
#!/usr/bin/env bash
# good software version
/cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231207-1 > env.txt
# bad software version
#/cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231208-1 > env.txt
source env.txt
time o2-sim-dpl-eventgen --nEvents 5 --generator external --configFile ${O2DPG_ROOT}/MC/config/PWGEM/ini/GeneratorEMCock
ROOT version
Everything having https://github.com/root-project/root/pull/11631/commits/4e4545fcd4dcb80af05cefd9747f7f585fbe75d3 included.
Installation method
build from source
Operating system
Linux
Additional context
No response
Attaching callgrind output files showing that the slowdown comes from:
(a) ROOT spends much more time/instructions (roughly a factor 9x) in libCore with ROOT6.30.01.
Notably the function TClassEdit::CleanType is now called much more often than before.
(b) We also spend more time in libc/libstdc++, but that is probably a consequence of more libCore activity.
(c) libCling increased by a factor of 2.
Hi @sawenzel . I encountered a problem reproducing:
$ lsb_release --all
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: CentOS
Description: CentOS Linux release 7.9.2009 (Core)
Release: 7.9.2009
Codename: Core
$ /cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231207-1
libgcc_s.so.1 must be installed for pthread_cancel to work
/cvmfs/alice.cern.ch/bin/alienv: line 291: 9641 Aborted (core dumped) tclsh > /dev/null 2>&1 <<EOF
EOF
init.c(556):ERROR:161: Cannot initialize TCL
test 0 = 1
$ /cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231208-1
libgcc_s.so.1 must be installed for pthread_cancel to work
/cvmfs/alice.cern.ch/bin/alienv: line 291: 9795 Aborted (core dumped) tclsh > /dev/null 2>&1 <<EOF
EOF
init.c(556):ERROR:161: Cannot initialize TCL
Ok, using something like v20240108-1 works and uses a recent 6.30. I am starting with that one.
I spoke too soon. The workflow crashes.
Hi @dpiparo. There was a copy-paste error for the command to run which is now corrected. But I am puzzled why you see these Tcl errors (maybe @ktf could comment). I am literally only doing ssh lxplus7.cern.ch + executing the above lines inside /tmp/swenzel. There is also nothing peculiar in my .bashrc and I checked that it even works inside env -i bash --noprofile --norc ... so that remains puzzling. For the record, I was on lxplus713.cern.ch for these checks.
I confirm that with the new command the processing starts. Debugging for real now.
The PR speeds up by factors (5x-7x) the simple reproducer of the issue we now have thanks to @sawenzel :
// Inspired by Alice production code, extracted from the surrounding context
#include "TF1.h"
void Foo(int N) {
double am = 90;
double awidth = 1;
for (int i = 0; i < N; ++i) {
static TF1 rbw("rbw", "pow([1],2)*pow([0],2)/(pow(x*x-[0]*[0],2)+pow(x*x*[1]/[0],2))", -10, 10);
rbw.SetRange(am - 5 * awidth, am + 5 * awidth);
rbw.SetParameter(0, am);
rbw.SetParameter(1, awidth);
am = rbw.GetRandom();
if (am <-100000000) std::cout << am << std::endl; // to avoid compiler optimisations
}
}
int main()
{
Foo(1000);
return 0;
}
The code is faster because the fast code path is taken in the PluginHandler because the right type is used in the GSLIntegratorPlugin (unsigned long, via the size_t typedef), and not just a compatible one (unsigned int).
I confirm that the issue is fixed via #15816. Thank you!
great. I am backporting the change and closing the issue. Thanks a lot for your help in debugging this problem!