cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Problem overwriting/unlink simulation output file

Open fabferro opened this issue 11 months ago • 20 comments

I'm running the PPS Full Simulation with a particle gun, but when I run it for the second time I get the following error: ----- Begin Fatal Exception 11-Mar-2024 15:47:36 CET----------------------- An exception of category 'FatalRootError' occurred while [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob) Additional Info: [a] Fatal Root Error: @SUB=TStorageFactorySystem::Unlink Unsupported

----- End Fatal Exception -------------------------------------------------

The error disappears if I delete the output root file and re-run the simulation. It started to happen a few weeks ago, never happened before. It happens in CMSSW_14_0_0 but also in other releases. It happens both with lxplus and lxplus7. The file I'm running is https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py

fabferro avatar Mar 11 '24 15:03 fabferro

cms-bot internal usage

cmsbuild avatar Mar 11 '24 15:03 cmsbuild

A new Issue was created by @fabferro.

@rappoccio, @makortel, @Dr15Jones, @smuzaffar, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Mar 11 '24 15:03 cmsbuild

assign core

makortel avatar Mar 11 '24 15:03 makortel

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Mar 11 '24 15:03 cmsbuild

I'm not able to reproduce on lxplus8 or lxplus9 on either /tmp or on AFS.

Could you give more details, e.g. on what filesystem you are running? Are you using https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py exactly as it is, or do you change process.o1.fileName in any way?

makortel avatar Mar 11 '24 16:03 makortel

I'm not able to reproduce on lxplus8 or lxplus9 on either /tmp or on AFS.

Could you give more details, e.g. on what filesystem you are running? Are you using https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py exactly as it is, or do you change process.o1.fileName in any way?

I ran it as it is. I tried modifying it but things don't change.

fabferro avatar Mar 12 '24 08:03 fabferro

Trying some differential analysis: I installed two brand new releases (14_0_0 and 13_3_2) on the same machine (lxplus958) in the same shell. The problem appears only in 14_0_0 not in 13_3_2. I also ran a RECO script and it does the same. The output root file can't be re-written, as if it was locked.

fabferro avatar Mar 12 '24 08:03 fabferro

One more piece of information: it works fine with "pure" AFS, so it seems to be related to some bad interplay between EOS and CMSSW_14_0_0

fabferro avatar Mar 12 '24 10:03 fabferro

The last working releases is CMSSW_14_0_0_pre1. _pre2 is the first one showing this issue

fabferro avatar Mar 12 '24 11:03 fabferro

I can reproduce when running the job on directory on EOS (via the FUSE mount). A major difference between 14_0_0_pre1 and pre2 is that pre1 used ROOT 6.26, and pre2 uses ROOT 6.30.

Here is a stack trace for the exception

(gdb) where
#0  0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7fffa3fb6b80, tinfo=0x7ffff79a3650 <typeinfo for edm::Exception>, dest=0x7ffff796d010 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007ffff173099d in (anonymous namespace)::RootErrorHandlerImpl(int, char const*, char const*) [clone .cold] () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007ffff6ceea5b in ErrorHandler () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCore.so
#3  0x00007ffff6c3e214 in TObject::Error(char const*, char const*, ...) const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCore.so
#4  0x00007ffff238a56d in TStorageFactorySystem::Unlink(char const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#5  0x00007ffff238dba5 in TStorageFactoryFile::Initialize(char const*, char const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#6  0x00007ffff238dd54 in TStorageFactoryFile::TStorageFactoryFile(char const*, char const*, char const*, int, int, bool) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#7  0x00007fffe94d20b9 in ?? ()
#8  0x00007fff00000000 in ?? ()
#9  0x00007fffa3aa1640 in ?? ()
#10 0x00007fffa3aa1640 in ?? ()
#11 0x00007fffffff2a90 in ?? ()
#12 0x00007fffbc5aa4f1 in ?? () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#13 0x00007fffffff2999 in ?? ()
#14 0x00007ffff3572920 in ?? ()
#15 0x00007fffea710062 in TClingCallFunc::IFacePtr() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCling.so
#16 0x0000000400000000 in ?? ()
#17 0x00007fffe3ba22a0 in ?? ()
#18 0x00007fffa3a7c580 in ?? ()
#19 0x00007fffffff2a10 in ?? ()
#20 0x00007fffffff2ee0 in ?? ()
#21 0x00007fffffff2b50 in ?? ()
#22 0x00007ffff7154852 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libRIO.so
#23 0x00007ffff7153d69 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libRIO.so
#24 0x00007fffbc59a8cc in edm::RootOutputFile::RootOutputFile(edm::PoolOutputModule*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#25 0x00007fffbc589437 in edm::PoolOutputModule::reallyOpenFile() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#26 0x00007fffbc589591 in virtual thunk to edm::PoolOutputModule::openFile(edm::FileBlock const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#27 0x00007ffff7deb0b8 in edm::Schedule::openOutputFiles(edm::FileBlock&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#28 0x00007ffff7d4210d in edm::EventProcessor::openOutputFiles() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#29 0x00007ffff7d4776e in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#30 0x00000000004074f5 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#31 0x00007ffff6f0f96d in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0_pre2_SKYLAKEAVX512-el9_amd64_gcc12/build/CMSSW_14_0_0_pre2_SKYLAKEAVX512-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-c38983dfefd2a4afa504d4856ead176c/tbb-v2021.9.0/src/tbb/arena.cpp:688
#32 0x0000000000408ee2 in main::{lambda()#1}::operator()() const ()
#33 0x000000000040517c in main ()

The TStorageFactorySystem::Unlink() is called from https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TStorageFactoryFile.cc#L167-L169 and our TStorageFactorySystem::Unlink() is indeed implemented as https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TStorageFactorySystem.cc#L45-L48

The TStorageFactorySystem is registered to ROOT in https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TFileAdaptor.cc#L52 https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TFileAdaptor.cc#L60-L61

As of why the underlying filesystem makes a difference, I have no clue at the moment.

makortel avatar Mar 12 '24 19:03 makortel

Two possible workarounds

  1. Use AFS or "local disk" for running CMSSW instead of EOS
  2. Add process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root"))) to the configuration file

makortel avatar Mar 12 '24 19:03 makortel

With gdb I found that when running on EOS, the path in https://github.com/cms-sw/cmssw/blob/d389c1c21ce701feb341cd00236d5b432160d2b5/IOPool/TFileAdaptor/src/TStorageFactoryFile.cc#L169 is root://eoshome-m.cern.ch/${PWD}<filename>. I'd bet this somehow makes the ROOT's TUnixSystem to not unlink the file, and leading to our TStorageFactorySystem::Unlink() to be called.

I checked the behavior on 14_0_0_pre1, and the the path was just the <filename>.

makortel avatar Mar 12 '24 21:03 makortel

type root

makortel avatar Mar 12 '24 21:03 makortel

@pcanal Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/ (or similar) somewhere between 6.26 and 6.30?

makortel avatar Mar 12 '24 21:03 makortel

This workaround seems to work too

process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))

makortel avatar Mar 12 '24 22:03 makortel

Setting output file as file:<filename> does not have an impact.

makortel avatar Mar 12 '24 22:03 makortel

Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/ (or similar) somewhere between 6.26 and 6.30?

I found https://github.com/root-project/root/pull/11644. It pointed another workaround, adding

TFile.CrossProtocolRedirects: 0

to $HOME/.rootrc.

makortel avatar Mar 12 '24 22:03 makortel

Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/ (or similar) somewhere between 6.26 and 6.30?

Yes in v6.28. (the PR you found).

pcanal avatar Mar 12 '24 22:03 pcanal

@pcanal Is there a way to choose the behavior per TFile? (I'm thinking like allowing this redirection for input files, but disabling it for output files) From the PR I'd guess "no".

makortel avatar Mar 12 '24 22:03 makortel

If you know it is a local file and want to stay local, you use new TFile instead of TFile::Open

pcanal avatar Mar 13 '24 01:03 pcanal