cmssw
cmssw copied to clipboard
Problem overwriting/unlink simulation output file
I'm running the PPS Full Simulation with a particle gun, but when I run it for the second time I get the following error: ----- Begin Fatal Exception 11-Mar-2024 15:47:36 CET----------------------- An exception of category 'FatalRootError' occurred while [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob) Additional Info: [a] Fatal Root Error: @SUB=TStorageFactorySystem::Unlink Unsupported
----- End Fatal Exception -------------------------------------------------
The error disappears if I delete the output root file and re-run the simulation. It started to happen a few weeks ago, never happened before. It happens in CMSSW_14_0_0 but also in other releases. It happens both with lxplus and lxplus7. The file I'm running is https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py
cms-bot internal usage
A new Issue was created by @fabferro.
@rappoccio, @makortel, @Dr15Jones, @smuzaffar, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
I'm not able to reproduce on lxplus8
or lxplus9
on either /tmp
or on AFS.
Could you give more details, e.g. on what filesystem you are running? Are you using https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py exactly as it is, or do you change process.o1.fileName
in any way?
I'm not able to reproduce on
lxplus8
orlxplus9
on either/tmp
or on AFS.Could you give more details, e.g. on what filesystem you are running? Are you using https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py exactly as it is, or do you change
process.o1.fileName
in any way?
I ran it as it is. I tried modifying it but things don't change.
Trying some differential analysis: I installed two brand new releases (14_0_0 and 13_3_2) on the same machine (lxplus958) in the same shell. The problem appears only in 14_0_0 not in 13_3_2. I also ran a RECO script and it does the same. The output root file can't be re-written, as if it was locked.
One more piece of information: it works fine with "pure" AFS, so it seems to be related to some bad interplay between EOS and CMSSW_14_0_0
The last working releases is CMSSW_14_0_0_pre1. _pre2 is the first one showing this issue
I can reproduce when running the job on directory on EOS (via the FUSE mount). A major difference between 14_0_0_pre1 and pre2 is that pre1 used ROOT 6.26, and pre2 uses ROOT 6.30.
Here is a stack trace for the exception
(gdb) where
#0 0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7fffa3fb6b80, tinfo=0x7ffff79a3650 <typeinfo for edm::Exception>, dest=0x7ffff796d010 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1 0x00007ffff173099d in (anonymous namespace)::RootErrorHandlerImpl(int, char const*, char const*) [clone .cold] () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007ffff6ceea5b in ErrorHandler () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCore.so
#3 0x00007ffff6c3e214 in TObject::Error(char const*, char const*, ...) const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCore.so
#4 0x00007ffff238a56d in TStorageFactorySystem::Unlink(char const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#5 0x00007ffff238dba5 in TStorageFactoryFile::Initialize(char const*, char const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#6 0x00007ffff238dd54 in TStorageFactoryFile::TStorageFactoryFile(char const*, char const*, char const*, int, int, bool) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#7 0x00007fffe94d20b9 in ?? ()
#8 0x00007fff00000000 in ?? ()
#9 0x00007fffa3aa1640 in ?? ()
#10 0x00007fffa3aa1640 in ?? ()
#11 0x00007fffffff2a90 in ?? ()
#12 0x00007fffbc5aa4f1 in ?? () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#13 0x00007fffffff2999 in ?? ()
#14 0x00007ffff3572920 in ?? ()
#15 0x00007fffea710062 in TClingCallFunc::IFacePtr() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCling.so
#16 0x0000000400000000 in ?? ()
#17 0x00007fffe3ba22a0 in ?? ()
#18 0x00007fffa3a7c580 in ?? ()
#19 0x00007fffffff2a10 in ?? ()
#20 0x00007fffffff2ee0 in ?? ()
#21 0x00007fffffff2b50 in ?? ()
#22 0x00007ffff7154852 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libRIO.so
#23 0x00007ffff7153d69 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libRIO.so
#24 0x00007fffbc59a8cc in edm::RootOutputFile::RootOutputFile(edm::PoolOutputModule*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#25 0x00007fffbc589437 in edm::PoolOutputModule::reallyOpenFile() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#26 0x00007fffbc589591 in virtual thunk to edm::PoolOutputModule::openFile(edm::FileBlock const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#27 0x00007ffff7deb0b8 in edm::Schedule::openOutputFiles(edm::FileBlock&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#28 0x00007ffff7d4210d in edm::EventProcessor::openOutputFiles() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#29 0x00007ffff7d4776e in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#30 0x00000000004074f5 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#31 0x00007ffff6f0f96d in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0_pre2_SKYLAKEAVX512-el9_amd64_gcc12/build/CMSSW_14_0_0_pre2_SKYLAKEAVX512-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-c38983dfefd2a4afa504d4856ead176c/tbb-v2021.9.0/src/tbb/arena.cpp:688
#32 0x0000000000408ee2 in main::{lambda()#1}::operator()() const ()
#33 0x000000000040517c in main ()
The TStorageFactorySystem::Unlink()
is called from
https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TStorageFactoryFile.cc#L167-L169
and our TStorageFactorySystem::Unlink()
is indeed implemented as
https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TStorageFactorySystem.cc#L45-L48
The TStorageFactorySystem
is registered to ROOT in
https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TFileAdaptor.cc#L52
https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TFileAdaptor.cc#L60-L61
As of why the underlying filesystem makes a difference, I have no clue at the moment.
Two possible workarounds
- Use AFS or "local disk" for running CMSSW instead of EOS
- Add
process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))
to the configuration file
With gdb
I found that when running on EOS, the path
in
https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TStorageFactoryFile.cc#L169
is root://eoshome-m.cern.ch/${PWD}<filename>
. I'd bet this somehow makes the ROOT's TUnixSystem
to not unlink the file, and leading to our TStorageFactorySystem::Unlink()
to be called.
I checked the behavior on 14_0_0_pre1, and the the path
was just the <filename>
.
type root
@pcanal Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/
(or similar) somewhere between 6.26 and 6.30?
This workaround seems to work too
process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))
Setting output file as file:<filename>
does not have an impact.
Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with
root://eoshome-m.cern.ch/
(or similar) somewhere between 6.26 and 6.30?
I found https://github.com/root-project/root/pull/11644. It pointed another workaround, adding
TFile.CrossProtocolRedirects: 0
to $HOME/.rootrc
.
Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/ (or similar) somewhere between 6.26 and 6.30?
Yes in v6.28. (the PR you found).
@pcanal Is there a way to choose the behavior per TFile
? (I'm thinking like allowing this redirection for input files, but disabling it for output files) From the PR I'd guess "no".
If you know it is a local file and want to stay local, you use new TFile
instead of TFile::Open