cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Infinite recursion seen in TFile::Open

Open Dr15Jones opened this issue 1 year ago • 14 comments

A production job at RAL exceeded 70GB of memory and was killed by the site. A subsequent change to stop 'lazy download' appears to have avoided the problem.

Looking at the log file from one failed job, the trace back shows all threads but 1 waiting, where the only active thread looks like

14-May-2024 07:55:48 UTC  Initiating request to open file root://xrootd.echo.stfc.ac.uk//store/mc/RunIISummer20ULPrePremix/Neutrino_E-10_gun/PREMIX/UL18_106X_upgrade2018_realistic_v11_L1v1-v2/230003/012B7EF9-8FD3-6340-9AFF-5DD7830C2C60.root


A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

...

#4  <signal handler called>
#5  0x000014c44b7faaf5 in Matchs(char const*, unsigned long, unsigned short const*, char const**) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#6  0x000014c44b84f83e in TRegexp::Index(TString const&, int*, int) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#7  0x000014c44b84f8dc in TString::Index(TRegexp const&, int) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#8  0x000014c44b8407a4 in TPluginHandler::CanHandle(char const*, char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#9  0x000014c44b840886 in TPluginManager::FindHandler(char const*, char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#10 0x000014c44bd27c2d in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#11 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#13 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
...
#12267140 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267141 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267142 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267143 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so

note that the debugger just gave up after going 12 million deep in the stack (so the stack was likely larger).

This job was using CMSSW_10_6_17_patch1 and running

Dr15Jones avatar May 17 '24 14:05 Dr15Jones

cms-bot internal usage

cmsbuild avatar May 17 '24 14:05 cmsbuild

A new Issue was created by @Dr15Jones.

@makortel, @sextonkennedy, @smuzaffar, @antoniovilela, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar May 17 '24 14:05 cmsbuild

The full log information can be found

https://cernbox.cern.ch/s/CQvjzkONaprof65 you should be able to find the WMTaskSpace dir in the vocms0253.cern.ch-283595-1 tarball.

Dr15Jones avatar May 17 '24 14:05 Dr15Jones

assign root

Dr15Jones avatar May 17 '24 14:05 Dr15Jones

assign core

makortel avatar May 17 '24 15:05 makortel

type root

makortel avatar May 17 '24 15:05 makortel

@pcanal Please take a look

makortel avatar May 17 '24 15:05 makortel

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar May 17 '24 15:05 cmsbuild

As a guess, I'd say the recursion is from this line

https://github.com/root-project/root/blob/88c9a5e76f7e118e7c8a3edc1eb9cf6128b5e0a0/io/io/src/TFile.cxx#L4289

Dr15Jones avatar May 17 '24 16:05 Dr15Jones

The file that appears to be being opened is a premix file. As far as I can tell, the Embedded root source (used by the mixing module) does NOT make use of the lazy-download system as that code does not call storage::StorageFactory::get()->stagein(...)

Dr15Jones avatar May 17 '24 16:05 Dr15Jones

The recursion could also happen here:

https://github.com/root-project/root/blob/88c9a5e76f7e118e7c8a3edc1eb9cf6128b5e0a0/io/io/src/TFile.cxx#L4294-L4299

as TNetXNGFile does have the ability to return a non blank GetNewUrl.

Dr15Jones avatar May 17 '24 16:05 Dr15Jones

Indeed both places are missing a test that the new attempt is different from the current ...

pcanal avatar May 17 '24 17:05 pcanal

See https://github.com/root-project/root/issues/15590 and https://github.com/root-project/root/pull/15591

pcanal avatar May 21 '24 19:05 pcanal

Do we want to backport the fix, and if so, how far back? 10_6_X (where the issue occurred) might be too far back, but maybe 14_0_X would make sense?

makortel avatar May 22 '24 17:05 makortel

+core

Backport down to ROOT 6.30 and CMSSW_14_1_X was done in https://github.com/cms-sw/cmsdist/pull/9222

makortel avatar Jun 03 '24 18:06 makortel

@cmsbuild, please close

makortel avatar Jun 03 '24 18:06 makortel

This issue is fully signed and ready to be closed.

cmsbuild avatar Jun 03 '24 18:06 cmsbuild