root icon indicating copy to clipboard operation
root copied to clipboard

Performance regression (slowdown) in ALICE event generation

Open sawenzel opened this issue 1 year ago • 1 comments

Check duplicate issues.

  • [X] Checked for duplicates

Description

10-20fold slowdown of running a particular ALICE event generation process, where the generator is compiled just-in-time from a ROOT macro (see reproducer). Slowdown observed when switching from ROOT tag 6.28.04 to 6.28.08.

In particular, bisecting this slowdown shows that it is caused by this commit https://github.com/root-project/root/pull/11631/commits/4e4545fcd4dcb80af05cefd9747f7f585fbe75d3 (from #11631). Reverting this single commit brings back the original performance.

Reproducer

On lxplus7.cern.ch, run:

#!/usr/bin/env bash

# good software version
/cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231207-1 > env.txt

# bad software version
#/cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231208-1 > env.txt
source env.txt

time o2-sim-dpl-eventgen --nEvents 5 --generator external --configFile ${O2DPG_ROOT}/MC/config/PWGEM/ini/GeneratorEMCock

ROOT version

Everything having https://github.com/root-project/root/pull/11631/commits/4e4545fcd4dcb80af05cefd9747f7f585fbe75d3 included.

Installation method

build from source

Operating system

Linux

Additional context

No response

sawenzel avatar May 21 '24 09:05 sawenzel

callgrind_generator.zip

Attaching callgrind output files showing that the slowdown comes from:

(a) ROOT spends much more time/instructions (roughly a factor 9x) in libCore with ROOT6.30.01.
Notably the function TClassEdit::CleanType is now called much more often than before.

(b) We also spend more time in libc/libstdc++, but that is probably a consequence of more libCore activity.

(c) libCling increased by a factor of 2.

sawenzel avatar May 21 '24 09:05 sawenzel

Hi @sawenzel . I encountered a problem reproducing:

$ lsb_release --all
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.9.2009 (Core) 
Release:	7.9.2009
Codename:	Core

$ /cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231207-1 
libgcc_s.so.1 must be installed for pthread_cancel to work
/cvmfs/alice.cern.ch/bin/alienv: line 291:  9641 Aborted                 (core dumped) tclsh  > /dev/null 2>&1 <<EOF
EOF

init.c(556):ERROR:161: Cannot initialize TCL
 test 0 = 1

$ /cvmfs/alice.cern.ch/bin/alienv printenv O2sim::v20231208-1
libgcc_s.so.1 must be installed for pthread_cancel to work
/cvmfs/alice.cern.ch/bin/alienv: line 291:  9795 Aborted                 (core dumped) tclsh  > /dev/null 2>&1 <<EOF
EOF

init.c(556):ERROR:161: Cannot initialize TCL

dpiparo avatar Jun 09 '24 11:06 dpiparo

Ok, using something like v20240108-1 works and uses a recent 6.30. I am starting with that one.

dpiparo avatar Jun 09 '24 12:06 dpiparo

I spoke too soon. The workflow crashes.

dpiparo avatar Jun 09 '24 12:06 dpiparo

Hi @dpiparo. There was a copy-paste error for the command to run which is now corrected. But I am puzzled why you see these Tcl errors (maybe @ktf could comment). I am literally only doing ssh lxplus7.cern.ch + executing the above lines inside /tmp/swenzel. There is also nothing peculiar in my .bashrc and I checked that it even works inside env -i bash --noprofile --norc ... so that remains puzzling. For the record, I was on lxplus713.cern.ch for these checks.

sawenzel avatar Jun 10 '24 08:06 sawenzel

I confirm that with the new command the processing starts. Debugging for real now.

dpiparo avatar Jun 11 '24 06:06 dpiparo

The PR speeds up by factors (5x-7x) the simple reproducer of the issue we now have thanks to @sawenzel :

// Inspired by Alice production code, extracted from the surrounding context
#include "TF1.h"

void Foo(int N) {
   double am  = 90;
   double awidth = 1;
   for (int i = 0; i < N; ++i) {
      static TF1 rbw("rbw", "pow([1],2)*pow([0],2)/(pow(x*x-[0]*[0],2)+pow(x*x*[1]/[0],2))", -10, 10);
      rbw.SetRange(am - 5 * awidth, am + 5 * awidth);
      rbw.SetParameter(0, am);
      rbw.SetParameter(1, awidth);
      am = rbw.GetRandom();
      if (am <-100000000) std::cout << am << std::endl; // to avoid compiler optimisations
   }
}

int main()
{
   Foo(1000);
   return 0;
}

The code is faster because the fast code path is taken in the PluginHandler because the right type is used in the GSLIntegratorPlugin (unsigned long, via the size_t typedef), and not just a compatible one (unsigned int).

dpiparo avatar Jun 11 '24 18:06 dpiparo

Profiles

dpiparo avatar Jun 12 '24 07:06 dpiparo

I confirm that the issue is fixed via #15816. Thank you!

sawenzel avatar Jun 12 '24 11:06 sawenzel

great. I am backporting the change and closing the issue. Thanks a lot for your help in debugging this problem!

dpiparo avatar Jun 12 '24 12:06 dpiparo