ROOT 6.36.04 does a factor 2 more stats when loading a library than 6.32.08
Check duplicate issues.
- [ ] Checked for duplicates
Description
We see a large number of (unneeded?) stats when trying to load a library. This results in greatly reduced performance on startup to the point it prevents us to start our datataking workflows.
In particular doing:
strace -f root -q -e 'gSystem->Load("libO2DataFormatFV0s.so")' 2>&1 | grep BUILD
results in 15843 matches while with the old release we have 6049. Notice in both cases none of the files actually exists. So this is really ROOT looking up in some unrelocated path, which itself is wrong.
Moreover it looks like the same stat is done multiple times. In both logs I have patterns like:
→ cat stats-new.txt | head -n 100
newfstatat(AT_FDCWD, "/home/alibuild/sw/BUILD/<build-hash>/ROOT/lib/libc.pcm", 0x7ffd02cf75e0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include", 0x7ffd02cf5fc0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include", 0x7ffd02cf6430, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stddef.h", 0x7ffd02cf61c0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stddef.h", 0x7ffd02cf6630, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_size_t.h", 0x7ffd02cf6130, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_size_t.h", 0x7ffd02cf65a0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_size_t.h", 0x7ffd02cf61c0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_size_t.h", 0x7ffd02cf6630, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_null.h", 0x7ffd02cf6130, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_null.h", 0x7ffd02cf65a0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_null.h", 0x7ffd02cf61c0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/__stddef_null.h", 0x7ffd02cf6630, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdint.h", 0x7ffd02cf6130, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdint.h", 0x7ffd02cf65a0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdint.h", 0x7ffd02cf61c0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdint.h", 0x7ffd02cf6630, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdarg.h", 0x7ffd02cf6130, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdarg.h", 0x7ffd02cf65a0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdarg.h", 0x7ffd02cf61c0, 0) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/stdarg.h", 0x7ffd02cf6630, 0) = -1 ENOENT (No such file or directory)
which show repetitive, failing stats for the same file.
Reproducer
See above
ROOT version
6.36.04-alice2
Installation method
aliBuild
Operating system
Linux
Additional context
No response
Here another place where we see stats https://github.com/root-project/root/issues/14296 .
@ktf do you see the paths encoded in the library, i.e. can you show the path data/sw/BUILD/<build-hash>/ROOT/etc/cling/lib/clang/18/include/ in the library libO2DataFormatFV0s.so with the strings command?
Actually, I do:
[pdp@alio2-cr1-qc06 ~]$ strings /opt/o2/lib/libO2DataFormatsFV0.so | grep BUILD
/home/alibuild/sw/BUILD/7dfefc80e207f4fc7884ab4ef983296ca64c87e6/O2/DataFormats/Detectors/FIT/FV0
/home/alibuild/sw/BUILD/7dfefc80e207f4fc7884ab4ef983296ca64c87e6/O2/DataFormats/Detectors/FIT/FV0/src
/data/sw/BUILD/7dfefc80e207f4fc7884ab4ef983296ca64c87e6/O2/DataFormats/Detectors/FIT/FV0/
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init2039
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1584
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1909
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init86
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1974
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init129
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init193
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1519
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init225
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init257
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init289
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init161
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1649
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init321
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init353
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init385
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1714
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1779
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init423
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init466
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init498
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init1844
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0ReadRules_G__O2DataFormatsFV02144
dIhomedIalibuilddIswdIBUILDdI7dfefc80e207f4fc7884ab4ef983296ca64c87e6dIO2dIDataFormatsdIDetectorsdIFITdIFV0dIG__O2DataFormatsFV0Init2104
/home/alibuild/sw/BUILD/7dfefc80e207f4fc7884ab4ef983296ca64c87e6/O2
/home/alibuild/sw/BUILD/7dfefc80e207f4fc7884ab4ef983296ca64c87e6/O2/DataFormats/Detectors/FIT/FV0/G__O2DataFormatsFV0.cxx
/home/alibuild/sw/BUILD/7dfefc80e207f4fc7884ab4ef983296ca64c87e6/O2/DataFormats/Detectors/FIT/FV0
Also related: https://github.com/root-project/root/issues/12462
Not the one you asked for, though... I assume that comes from ROOT?
Indeed it looks like https://github.com/root-project/root/issues/14296 is the actual culprit, the new ROOT merely looks for way more things.
That said, why checking so many times?
I managed to make in ALICE-agnostic. It's enough to do:
strace -f root -q -e 'gSystem->Load("libCore.so")' 2>&1 | grep newfstatat | wc -l
to see that in one case you have 10K stats (which is already quite incredible) while the new vesion has over 20K.
Hi,
I just want to stress the fact that it is a real issue for us (ALICE DAQ). It now takes so long to start processes that we time out. We can increase the timeout from 30 seconds to 45 seconds, but this is becoming very high. Is there a way at least to avoid the duplicate calls to stat ?
@ktf is it possible to have the profile reported in the internal O2 issue tracker also in here for easier consultation?
@silverweed yes, sure.
@ktf is it possible to have the profile reported in the internal O2 issue tracker also in here for easier consultation?
Comment by Piotr Konopka reposted from the O2 internal issue:
technical notes for those interested
To confirm regression, I compared startup time of almost the same TPC QC workflow on production (FLP suite 1.72.0, env ID 2yb4GSsdS3s, 48 CPUs on QC node) and staging (FLP suite 1.73.0, env ID 2yb4GcBGw66, 40 CPUs on QC node). With 1.72.0, startup until DEPLOYED takes 6s, while with 1.73.0 it takes 16s. Such a big difference could not be attributed to the difference in CPUs. The results are consistent across serveral tries with margin of ~1s.
I can also see the performance regression if I run the TPC QC workflow generation with --dump (1.0s vs. 1.9s):
time bash -c "source /etc/profile.d/o2.sh && export DPL_RAWPROXY_OVERRIDE_ORBITRESET=1547590800000 && o2-qc --config 'apricot://alio2-cr1-hv-mvs00.cern.ch:32188/o2/components/qc/ANY/any/tpc-full-qcmn?run_type=SYNTHETIC&beam_type=pp&ctp_readout_enabled=false' --remote --shm-metadata-msg-size 100000 -b --dump"
staging:
real 0m1.927s
user 0m1.716s
sys 0m0.206s
prod:
real 0m1.032s
user 0m0.923s
sys 0m0.105s
I confirmed that libO2Framework.so and libQualityControl.so had the same compiler flags with:
readelf --debug-dump=info /opt/o2/lib/libO2Framework.so | grep -i "producer"
I also checked performance with perf. Node-wise (perf record -a -g), we can see an increase of time spent in strcmp. By running perf record --call-graph dwarf against the workflow generation, I see an unexpected appearance of libCling and exception catching. By expanding the call chain, I can see strcmp being called there:
- 28.78% 0.00% o2-qc ld-linux-x86-64.so.2 [.] __GI__dl_catch_exception (inlined)
- __GI__dl_catch_exception (inlined)
+ 24.67% dlopen_doit
+ 2.98% openaux
+ 0.82% dl_open_worker
+ 28.28% 0.12% o2-qc ld-linux-x86-64.so.2 [.] _dl_map_object_deps
+ 28.23% 3.51% o2-qc ld-linux-x86-64.so.2 [.] _dl_map_object
+ 27.99% 0.04% o2-qc ld-linux-x86-64.so.2 [.] openaux
+ 26.94% 0.00% o2-qc o2-qc [.] _Z11mainNoCatchiPPc
- 26.89% 0.02% o2-qc libCling.so.6.36.04 [.] _ZN6TCling14RegisterModuleEPKcPS1_S2_S1_S1_PFvvERKSt6vectorISt4pairINSt7__cxx1▒
- 26.87% _ZN6TCling14RegisterModuleEPKcPS1_S2_S1_S1_PFvvERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiESaISD_EES2_bb
- 22.55% dlopen@GLIBC_2.2.5
- _dlerror_run
- 22.52% _dl_catch_error
__GI__dl_catch_exception (inlined)
dlopen_doit
_dl_open
__GI__dl_catch_exception (inlined)
dl_open_worker
__GI__dl_catch_exception (inlined)
- dl_open_worker_begin
- 22.46% _dl_map_object_deps
- 22.24% __GI__dl_catch_exception (inlined)
- 22.18% openaux
- _dl_map_object
11.81% strcmp
- 7.75% _dl_name_match_p
strcmp
+ 3.24% _ZN6TCling7LoadPCMENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
+ 26.78% 0.00% o2-qc libO2Framework.so [.] _Z8callMainiPPcPFiiS0_E
+ 26.72% 0.00% o2-qc o2-qc [.] main
For comparison, these are the results for the old sw:
+ 48.30% 0.00% o2-qc o2-qc [.] _Z11mainNoCatchiPPc
+ 47.92% 0.00% o2-qc libO2Framework.so [.] _Z8callMainiPPcPFiiS0_E
+ 47.75% 0.00% o2-qc o2-qc [.] main
+ 47.73% 0.00% o2-qc libc.so.6 [.] __libc_start_call_main
+ 46.95% 0.00% o2-qc libc.so.6 [.] __libc_start_main@@GLIBC_2.34
+ 46.86% 0.00% o2-qc o2-qc [.] _start
+ 36.09% 8.26% o2-qc o2-qc [.] _ZN5boost11multi_index21multi_index_containerISt4pairIKNSt7__cxx1112basic_string▒
+ 35.59% 0.00% o2-qc o2-qc [.] _ZN5boost11multi_index6detail8copy_mapINS1_20sequenced_index_nodeINS1_18ordered_▒
+ 35.59% 0.00% o2-qc o2-qc [.] _ZN5boost11multi_index6detail8copy_mapINS1_20sequenced_index_nodeINS1_18ordered_▒
+ 35.45% 0.00% o2-qc o2-qc [.] _ZNSt16allocator_traitsISaIN5boost11multi_index6detail20sequenced_index_nodeINS2▒
+ 35.45% 0.00% o2-qc o2-qc [.] _ZSt12construct_atISt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE▒
+ 35.45% 0.00% o2-qc o2-qc [.] _ZNSt4pairIKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN5boost13propert▒
+ 35.38% 0.00% o2-qc o2-qc [.] _ZN5boost13property_tree11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traits▒
+ 35.36% 0.00% o2-qc o2-qc [.] _ZN5boost11multi_index21multi_index_containerISt4pairIKNSt7__cxx1112basic_string▒
+ 34.34% 0.00% o2-qc libO2QualityControl.so [.] _ZNSt17_Function_handlerIFSt8functionIFvRN2o29framework17ProcessingContextEEERNS▒
+ 34.34% 0.00% o2-qc libO2QualityControl.so [.] _ZNSt14_Function_base13_Base_managerIZN2o29framework13adaptFromTaskINS1_15qualit▒
+ 31.12% 0.00% o2-qc o2-qc [.] _ZN5boost13property_tree11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traits▒
+ 28.63% 0.00% o2-qc libO2QualityControl.so [.] _ZN2o215quality_control14postprocessing26PostProcessingRunnerConfigC2ERKS2_
+ 27.42% 0.75% o2-qc libstdc++.so.6.0.33 [.] _Znwm
+ 26.90% 8.12% o2-qc libc.so.6 [.] malloc
Comparing the number of syscalls, I can see much more activity coming from ROOT, e.g. file status calls (newfstatat):
le or directory)
3117358 newfstatat(AT_FDCWD, "/data/sw/BUILD/243fcf32f76320a685aba110cc33756f444595aa/ROOT/lib/Tree.pcm", 0x7ffe1694b1b0, 0) = -1 ENOENT (No such file o
r directory)
3117358 newfstatat(AT_FDCWD, "/data/sw/BUILD/243fcf32f76320a685aba110cc33756f444595aa/ROOT/lib/Gui.pcm", 0x7ffe1694b1b0, 0) = -1 ENOENT (No such file or
directory)
3117358 newfstatat(AT_FDCWD, "/data/sw/BUILD/243fcf32f76320a685aba110cc33756f444595aa/ROOT/lib/Gpad.pcm", 0x7ffe1694b1b0, 0) = -1 ENOENT (No such file o
r directory)
3117358 newfstatat(AT_FDCWD, "/data/sw/BUILD/243fcf32f76320a685aba110cc33756f444595aa/ROOT/lib/Graf.pcm", 0x7ffe1694b1b0, 0) = -1 ENOENT (No such file o
r directory)
3117358 newfstatat(AT_FDCWD, "/data/sw/BUILD/243fcf32f76320a685aba110cc33756f444595aa/ROOT/lib/Hist.pcm", 0x7ffe1694b1b0, 0) = -1 ENOENT (No such file o
r directory)
The total strace output doubled (56021 lines vs. 23574 before).
After the ALICE-ROOT joint investigation which was carried out, should this item be kept open or is the issue understood, i.e. no particular new behaviour of 6.36 is causing excessive stats?
Yes, for the stats the actual issue seems to be https://github.com/root-project/root/issues/14296, which has been reported since a while and it's not 6.36 specific. The number remains extremely high. It would be nice it had high priority on the PoW.
Thanks @ktf , noted.