Infinite recursion seen in TFile::Open
A production job at RAL exceeded 70GB of memory and was killed by the site. A subsequent change to stop 'lazy download' appears to have avoided the problem.
Looking at the log file from one failed job, the trace back shows all threads but 1 waiting, where the only active thread looks like
14-May-2024 07:55:48 UTC Initiating request to open file root://xrootd.echo.stfc.ac.uk//store/mc/RunIISummer20ULPrePremix/Neutrino_E-10_gun/PREMIX/UL18_106X_upgrade2018_realistic_v11_L1v1-v2/230003/012B7EF9-8FD3-6340-9AFF-5DD7830C2C60.root
A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.
...
#4 <signal handler called>
#5 0x000014c44b7faaf5 in Matchs(char const*, unsigned long, unsigned short const*, char const**) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#6 0x000014c44b84f83e in TRegexp::Index(TString const&, int*, int) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#7 0x000014c44b84f8dc in TString::Index(TRegexp const&, int) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#8 0x000014c44b8407a4 in TPluginHandler::CanHandle(char const*, char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#9 0x000014c44b840886 in TPluginManager::FindHandler(char const*, char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#10 0x000014c44bd27c2d in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#11 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#13 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
...
#12267140 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267141 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267142 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267143 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
note that the debugger just gave up after going 12 million deep in the stack (so the stack was likely larger).
This job was using CMSSW_10_6_17_patch1 and running
cms-bot internal usage
A new Issue was created by @Dr15Jones.
@makortel, @sextonkennedy, @smuzaffar, @antoniovilela, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
The full log information can be found
https://cernbox.cern.ch/s/CQvjzkONaprof65 you should be able to find the WMTaskSpace dir in the vocms0253.cern.ch-283595-1 tarball.
assign root
assign core
type root
@pcanal Please take a look
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
As a guess, I'd say the recursion is from this line
https://github.com/root-project/root/blob/88c9a5e76f7e118e7c8a3edc1eb9cf6128b5e0a0/io/io/src/TFile.cxx#L4289
The file that appears to be being opened is a premix file. As far as I can tell, the Embedded root source (used by the mixing module) does NOT make use of the lazy-download system as that code does not call storage::StorageFactory::get()->stagein(...)
The recursion could also happen here:
https://github.com/root-project/root/blob/88c9a5e76f7e118e7c8a3edc1eb9cf6128b5e0a0/io/io/src/TFile.cxx#L4294-L4299
as TNetXNGFile does have the ability to return a non blank GetNewUrl.
Indeed both places are missing a test that the new attempt is different from the current ...
See https://github.com/root-project/root/issues/15590 and https://github.com/root-project/root/pull/15591
Do we want to backport the fix, and if so, how far back? 10_6_X (where the issue occurred) might be too far back, but maybe 14_0_X would make sense?
+core
Backport down to ROOT 6.30 and CMSSW_14_1_X was done in https://github.com/cms-sw/cmsdist/pull/9222
@cmsbuild, please close
This issue is fully signed and ready to be closed.