DeepTauId throws (in event with zero taus?)
The original issue #40437 has taken a different path so I reopen here just for this.
Reminder of relevant post: https://github.com/cms-sw/cmssw/issues/40437#issue-1522624055 https://github.com/cms-sw/cmssw/issues/40437#issuecomment-1373734266 https://github.com/cms-sw/cmssw/issues/28358 (closed. WHY?)
a log file https://cms-unified.web.cern.ch/cms-unified/joblogs/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693/8001/DataProcessing/020da37f-6871-4688-a86e-2b7e8a6bc683-26-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log
here the "shooting gun": trying to reproduce the event happen to have ZERO taus https://github.com/cms-sw/cmssw/issues/40437#issuecomment-1633730153
A new Issue was created by @VinInn Vincenzo Innocente.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
I still do not understand how a memory corruption can modify the values in an otherwise empty vector (pointer, size, capacity, all 0) w/o causing a segfault before reaching the throw.
I'n my opinion there is a legit vector there with "something" in it that reasonably fake a tau.
the tau collection is a view: no way to survive even the first deference
it sounds also there is a reproducibility problem. The PR tests cover the single thread cases somewhat (even though not on the right SSE-only hardware)
Could it be multithreaded reproducibility issue? Perhaps just running reco comparisons on MT job outputs is enough to get an idea.
assign reconstruction
New categories assigned: reconstruction
@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks
Let me tag @cms-sw/tau-pog-l2 also in this issue
Just for the record, I'm hitting into this issue while trying to run an HLT job through VTune. Without VTune the configuration works, but within VTune the job crashes in a few different ways, this exception being one of them.
Running VTune on a single-thread job in ASAN build resulted in
AddressSanitizer: CHECK failed: asan_allocator.cpp:188 "((old)) == ((kAllocBegMagic))" (0x302030303a303020, 0xcc6e96b9cc6e96b9) (tid=2772)
#0 0x7fddac36d4ba in CheckUnwind ../../../../libsanitizer/asan/asan_rtl.cpp:67
#1 0x7fddac38de65 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
#2 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:188
#3 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:178
#4 0x7fddac2d845b in __asan::QuarantineCallback::Recycle(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:204
#5 0x7fddac2d8655 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::DoRecycle(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:193
#6 0x7fddac2d8b3d in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Recycle(unsigned long, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:181
#7 0x7fddac2d411c in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Put(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback, __asan::AsanChunk*, unsigned long) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:112
#8 0x7fddac2d411c in __asan::Allocator::QuarantineChunk(__asan::AsanChunk*, void*, __sanitizer::BufferedStackTrace*) ../../../../libsanitizer/asan/asan_allocator.cpp:665
#9 0x7fddac3655ad in operator delete(void*, unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:164
#10 0x7fdd6f44320f in std::_Sp_counted_ptr_inplace<BasicSingleTrajectoryState, churn_allocator<BasicSingleTrajectoryState>, (__gnu_cxx::_Lock_policy)2>::_M_destroy() (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so+0x5e20f)
#11 0x7fdc6843beed in CkfTrajectoryBuilder::limitedCandidates(std::shared_ptr<TrajectorySeed const> const&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf4eed)
#12 0x7fdc6843fbb6 in CkfTrajectoryBuilder::limitedCandidates(TrajectorySeed const&, TempTrajectory&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf8bb6)
#13 0x7fdc6844079c in CkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf979c)
#14 0x7fdc683c9306 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x82306)
#15 0x7fdc683d04fe in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x894fe)
#16 0x7fddac0f6dd0 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x986dd0)
that hints towards ASAN own data structures being overwritten
Running VTune on a single-thread job of cmsRunGlibC resulted in
*** Error in `cmsRunGlibC': corrupted size vs. prev_size: 0x0000000047f1cc00 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x7f8d1b99f474]
/lib64/libc.so.6(+0x816a4)[0x7f8d1b9a16a4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eadc44)[0x7f8c7f1a0c44]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eae8bb)[0x7f8c7f1a18bb]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9fd2efd)[0x7f8c7f2c5efd]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9af7e27)[0x7f8c7edeae27]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x97995d3)[0x7f8c7ea8c5d3]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x71123b7)[0x7f8c7c4053b7]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZNK5Eigen30TensorContractionEvaluatorBaseINS_15TensorEvaluatorIKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS_9TensorMapINS_6TensorIKfLi4ELi1ElEELi16ENS_11MakePointerEEEEEEEKNS8_ISB_SJ_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEENS_16ThreadPoolDeviceEEEE15evalGemmPartialILb1ELb1ELb0ELi0ELb1EEEvPflli+0x7ad)[0x7f8c7c455c6d]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN5Eigen8internal14TensorExecutorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi4ELi1ElEELi16ENS_11MakePointerEEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi4EEEKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS8_IKNS9_IlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS3_INS4_IKfLi4ELi1ElEELi16ES6_EEEEEEKNS8_ISJ_SO_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEEEEEENS_16ThreadPoolDeviceELb1ELNS0_15TiledEvaluationE0EE3runERS15_RKS16_+0x95)[0x7f8c7c487365]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x7195164)[0x7f8c7c488164]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow33LaunchFusedConv2DWithOutputKernelIfEclINS_19BiasAddOutputKernelIfNS_8IdentityEEEEEvRKT_PNS_15OpKernelContextERKNS_6TensorESD_PSB_+0x1ed)[0x7f8c7c4896ad]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13FusedConv2DOpIN5Eigen16ThreadPoolDeviceEfE7ComputeEPNS_15OpKernelContext[39/1803]0x7f8c7c48e1e5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(_ZN10tensorflow16ThreadPoolDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x4b)[0x7f8c74ae2f6b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1258ef1)[0x7f8c74bd0ef1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x123e924)[0x7f8c74bb6924]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124436e)[0x7f8c74bbc36e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124dfa5)[0x7f8c74bc5fa5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x125820c)[0x7f8c74bd020c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1244271)[0x7f8c74bbc271]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1248f08)[0x7f8c74bc0f08]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession11RunInternalElRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x7fa)[0x7f8c81ddd61a]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISD_EERKS4_ISB_SaISB_EESL_PS4_ISC_SaISC_EEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x96e)[0x7f8c81de034e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EERKN3tsl6thread17ThreadPoolOptionsE+0xa5)[0x7f8c876b54d5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EEPN3tsl6thread19ThreadPoolInterfaceE+0x19)[0x7f8c876b5589]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0xa729a)[0x7f8bf985f29a]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0x9bb74)[0x7f8bf9853b74]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm6stream21EDProducerAdaptorBase7doEventERKNS_19EventTransitionInfoEPNS_16ActivityRegistryEPKNS_20ModuleCallingContextE+0x141)[0x7f8d1e8f54b1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8a19)[0x7f8d1e865a19]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8fb4)[0x7f8d1e865fb4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so(+0x5f28)[0x7f8d253faf28]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x28083)[0x7f8d1cb4c083]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x13d21b)[0x7f8d1e7ea21b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor11processRunsEv+0x205)[0x7f8d1e7f3bc5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor15runToCompletionEv+0x23f)[0x7f8d1e7f418f]
cmsRunGlibC[0x4074ef]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x149bd)[0x7f8d1cb389bd]
cmsRunGlibC[0x408ed2]
cmsRunGlibC(main+0x14c)[0x40518c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8d1b942555]
cmsRunGlibC[0x405451]
======= Memory map: ========
00400000-00404000 r--p 00000000 00:31 245399302 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00404000-00405000 r-xp 00004000 00:31 245399302 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00405000-0040a000 r-xp 00005000 00:31 245399302 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040a000-0040d000 r--p 0000a000 00:31 245399302 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040d000-0040e000 r--p 0000c000 00:31 245399302 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040e000-0040f000 rw-p 0000d000 00:31 245399302 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040f000-00410000 rwxp 00000000 00:00 0
00419000-4d534000 rw-p 00000000 00:00 0 [heap]
7f8bd8000000-7f8bd8021000 rw-p 00000000 00:00 0
7f8bd8021000-7f8bdc000000 ---p 00000000 00:00 0
7f8bdee6c000-7f8be100b000 rw-p 00000000 00:00 0
7f8be305c000-7f8be314c000 rwxp 00000000 00:00 0
7f8be314c000-7f8be8000000 rw-p 00000000 00:00 0
7f8be8000000-7f8be8021000 rw-p 00000000 00:00 0
7f8be8021000-7f8bec000000 ---p 00000000 00:00 0
7f8bec03d000-7f8bec35d000 rwxp 00000000 00:00 0
7f8bec35d000-7f8bec379000 rw-p 00000000 00:00 0
7f8bec379000-7f8bec3c9000 rwxp 00000000 00:00 0
7f8bec3e5000-7f8bec435000 rwxp 00000000 00:00 0
7f8bec444000-7f8bec44a000 r--p 00000000 00:31 231860432 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec450000-7f8bec4a0000 rwxp 00000000 00:00 0
7f8bec4a0000-7f8bec4a2000 r--p 00000000 00:31 231860432 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a2000-7f8bec4a3000 r-xp 00002000 00:31 231860432 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a3000-7f8bec4a4000 r--p 00003000 00:31 231860432 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a4000-7f8bec4a5000 r--p 00003000 00:31 231860432 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a5000-7f8bec4a6000 rw-p 00004000 00:31 231860432 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a6000-7f8bec507000 rwxp 00000000 00:00 0
Looks like memory corruption inside Tensorflow is not unheard of
- https://github.com/tensorflow/tensorflow/issues/60468
- https://github.com/tensorflow/tensorflow/issues/59394
- https://github.com/tensorflow/tensorflow/issues/47179
(no idea if any of these are in any way related)
We updated tensorflow to 2.12 in 13_3_0_pre1 (built on Aug 3). I wonder if there would be any temporal correlation with the RECO profiling crashes?
At least in CMSSW_13_2_X_2023-11-17-1100 the IgProf job on step3 of workflow 136.889 succeeds, whereas in CMSSW_13_3_X_2023-11-17-1100 the job crashes (but of course there are many other changes in between)
(maybe I'm going in a wrong direction, given that the DeepTauId throwing an exception on nan is known for several years https://github.com/cms-sw/cmssw/issues/28358)
In my HLT menu test case with VTune, if I "disable" the two DeepTauId modules by adding an always-rejecting EDFilter before them, the job succeeds. Enabling either one, or both, of the DeepTauId modules makes the job crash most of the time.
Hi @makortel. By "succeed", do you mean just to finish without crashing, or the outputs are always identical? Especially for other taggers that use TF. In the past, we couldn't conclude with 100% certainty whether memory corruption happens in the DeepTau module or it occures elsewhere and corrupts DeepTau-related memory. Is the machine on which one could reproduce the crash accessible by normal users? If yes, could you please share instructions on how one can reproduce the crash?
Hi @kandrosov
By "succeed", do you mean just to finish without crashing, or the outputs are always identical?
I mean "finishes without crashing". I'm not checking the outputs in any way.
I sent you privately the HLT recipe that I used. By adding an event rejecting EDFilter before/after the DeepTauId modules in the menu I see the behavior that
- if the EDFilter is right before DeepTauId module, the VTune profiling job (technically) succeeds
- if the EDFilter is right after DeepTauId module, the VTune profiling job crashes
I started also look into the step3 of the workflow 136.889, that is used in the IgProf profiling in IBs (which is currently crashing in 13_3_X and 14_0_X). Running the step3 as it is indeed crashes also in VTune, in a way that looks like memory corruption. If I run everything up to and including the DeepTauId modules there (*), the job still crashes. But, contrary to the HLT test case, if I run everything up to the modules DeepTauId consumes (**), but not DeepTauId, the job crashes as well. This behavior would indicate that something else than (or maybe in addition to?) DeepTauId would cause memory corruption when the job is run through VTune.
I'll continue to investigate (albeit slowly).
(*) by adding a snippet
process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
process.raw2digi_step,
process.reconstruction_step,
process.tauPath
)
to the end of the step3 configuration
(**) by adding a snippet
process.deepTauSequence = cms.Sequence(
process.offlineSlimmedPrimaryVertices
+ process.packedPFCandidates
+ process.hpsPFTauTransverseImpactParameters
+ process.slimmedTausNoDeepIDs
+ process.slimmedElectrons
+ process.slimmedMuons
)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
process.raw2digi_step,
process.reconstruction_step,
process.tauPath
)
to the end of the step3 configuration
With the step3 of 136.889 I got to the point that running initialStep and everything it consumes crashes (under VTune), while running any of the modules initialStep consumes individually technically works. The smoking gun here is that initialStep module is of type TrackTfClassifier that uses Tensorflow. The stack trace of the working thread contains a few interesting frames, even if it is mostly corrupted
Thread 1 (Thread 0x7f2b0d5aa740 (LWP 21048) "cmsRun"):
#0 0x00007f2b0f464ddd in poll () from /lib64/libc.so.6
#1 0x00007f2b16f75841 in ?? ()
#2 0x01007f2b0d5a4858 in ?? ()
#3 0x00007ffc13af4400 in ?? ()
#4 0x00007ffc13af4920 in ?? ()
#5 0x00007ffc13af4410 in ?? ()
#6 0x00007ffc13af44a0 in ?? ()
#7 0x00007f2b176b1060 in ?? ()
#8 0x0000002100000001 in ?? ()
#9 0x00007ffc13af4390 in ?? ()
#10 0x00007f2b0d582240 in ?? ()
#11 0x00007f2b13d241c0 in ?? ()
#12 0x000000006567babf in ?? ()
#13 0x0000000000098c43 in ?? ()
#14 0x00007f2a1c780320 in ?? ()
#15 0x00007f2a71cc753b in tensorflow::SimplePropagatorState::SimplePropagatorState(tensorflow::ImmutableExecutorState const&, long, tensorflow::ImmutableExecutorState::FrameInfo const&, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2
#16 0x00007f2ad712db5f in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#17 0x00007f2ad70e650c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#18 0x00007f2ad70e6e70 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#19 0x00007f2b1715b0fe in ?? ()
#20 0x00007ffc13af5730 in ?? ()
#21 0x00007f2b15106698 in ?? ()
#22 0x000000000000000b in ?? ()
#23 0x00007ffc13af5730 in ?? ()
#24 0x00007ffc13af5170 in ?? ()
#25 0x00007f2b15106060 in ?? ()
#26 0x00007ffc13af5270 in ?? ()
#27 0x00007f2b1715c416 in ?? ()
#28 0x000000000000000b in ?? ()
#29 0x00007ffc13af5600 in ?? ()
#30 0x0000000000000000 in ?? ()
Recipe to reproduce (at CERN) is more or less
cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 136.889 --ibeos
cd 136.889_RunMET2018D
cat >>step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py <<EOF
def make(name):
print(f"Only up to {name}")
process.testSequence = cms.Sequence(getattr(process, name))
make("initialStep") # should crash
# immediate dependencies of initialStep
#make("offlineBeamSpot")
#make("firstStepPrimaryVertices")
#make("initialStepTracks")
process.testPath = cms.Path(process.testSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.raw2digi_step, process.reconstruction_step, process.testPath)
EOF
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- cmsRun step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py
To be again more specific for DeepTauId, I took workflow 140.201 that runs re-MINI, and was able to reproduce the behavior I saw in the earlier HLT job, i.e. including DeepTauId crashes (under VTune), but running only up to the modules consumed by the DeepTauId modules technically works.
Recipe to reproduce at CERN
cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 140.201 --ibeos
cd 140.201_RunJetMET2022D_reMINI
cat >>step2_PAT.py <<EOF
process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini)
# uncomment for the modules consumed by DeepTauId
#process.deepTauSequence = cms.Sequence(process.offlineSlimmedPrimaryVertices+process.packedPFCandidates+process.slimmedTausNoDeepIDs+process.slimmedElectrons+process.slimmedMuons)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.tauPath)
EOF
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
vtune -collect hotspots -- cmsRun step2_PAT.py
I repeated my tests in https://github.com/cms-sw/cmssw/issues/42444#issuecomment-1832860800 and https://github.com/cms-sw/cmssw/issues/42444#issuecomment-1832866146 with IgProf (*), and see exactly the same behavior. I.e. running up to the modules that initialStep / DeepTauId consume technically works, but running initialStep / DeepTauId results in crash.
(*) replace
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- <command>
with
igprof -d -t cmsRun -pp -z -o igprof.perf.gz <command>
FYI @gartung . I think we should add crash detection to the IB profiling jobs. Something along after detecting a crash in some step of the workflow, not running the subsequent steps, and communicating the failure to the IB dashboard (e.g. similarly to the HLT validation tests that show a red backround when there are failures).
With IgProf I managed to run it together with Valgrind, and got (on 136.889 step 3 customized to run initialStep and its dependencies)
valgrind: m_mallocfree.c:278 (mk_plain_bszB): Assertion 'bszB != 0' failed.
valgrind: This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.
host stacktrace:
==11218== at 0x5803F638: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x5803F757: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x5803F8DE: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x58046B96: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x58047679: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x5800485C: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x58004CE7: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x58004EB9: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x580933DB: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218== by 0x580DC945: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
sched status:
running_tid=1
Thread 1: status = VgTs_Runnable (lwpid 11218)
==11218== at 0x402EF11: operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==11218== by 0x55DEC210: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**, tsl::AllocatorAttributes) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x55DEC6D0: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x4A65B582: tensorflow::FusedMatMulOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218== by 0x5606CF6A: tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x5615AEF0: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218== by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218== by 0x56140923: void tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::RunTask<std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)> >(std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)>&&, int) [clone .constprop.0] (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x5614636D: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x5614FFA4: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::NodeDone(tsl::Status const&, absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::NodeExecStatsInterface*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x5615A20B: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218== by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218== by 0x56146270: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x5614AF07: tensorflow::(anonymous namespace)::ExecutorImpl::RunAsync(tensorflow::Executor::Args const&, std::function<void (tsl::Status const&)>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218== by 0x4FA0D619: tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218== by 0x4FA1034D: tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218== by 0x42F1D4D4: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218== by 0x42F1D588: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolInterface*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218== by 0x5F36B736: (anonymous namespace)::TfDnn::operator()(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218== by 0x5F36C085: TrackMVAClassifier<(anonymous namespace)::TfDnn, void>::computeMVA(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&, std::vector<std::pair<float, bool>, std::allocator<std::pair<float, bool> > >&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218== by 0x5F3E9AF7: TrackMVAClassifierBase::produce(edm::Event&, edm::EventSetup const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libRecoTrackerFinalTrackSelectors.so)
==11218== by 0x4C6C4B0: edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218== by 0x4C50F1D: edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218== by 0x4BDCA18: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218== by 0x4BDCFB3: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218== by 0x41E7F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so)
==11218== by 0x640B290: tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) (task_dispatcher.h:322)
==11218== by 0x4B6121A: edm::FinalWaitingTask::wait() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218== by 0x4B6ABC4: edm::EventProcessor::processRuns() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218== by 0x4B6B18E: edm::EventProcessor::runToCompletion() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218== by 0x4074EE: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218== by 0x63F79BC: tbb::detail::r1::task_arena_impl::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) (arena.cpp:688)
==11218== by 0x408ED1: main::{lambda()#1}::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218== by 0x40518B: main (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
client stack range: [0x1FFEFAF000 0x1FFEFF6FFF] client SP: 0x1FFEFF1060
valgrind stack range: [0x1008AA6000 0x1008BA5FFF] top usage: 15736 of 1048576
Just shooting in the dark, I increased the stack size per thread from 10 MB to 20 MB with process.options.sizeOfStackForThreadsInKB = 20*1024, and with that I was able to run workflow 136.889 step 3 customized to run initialStep through IgProf. The full step 3 configuration, however, still crashed. So maybe stack size does not play a role.
the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)
the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)
My recent tests were indeed with jemalloc. My HLT test case crashed with glibc malloc as well (https://github.com/cms-sw/cmssw/issues/42444#issuecomment-1817056942) when ran through VTune. I could test the offline workflows and IgProf with glibc malloc too.
my tool shows that tensoerflow calls aligned_alloc function_taskedm:WaitingTaskList:announce():{lambda()#1}:execute;edm:Worker:RunModuleTask<edm:OccurrenceTraitsedm:EventPrincipal,(edm:BranchActionType)1>:execute;__exception_ptr:exception_ptredm:Worker:runModuleAfterAsyncPrefetch<edm:OccurrenceTraitsedm:EventPrincipal,(edm:BranchActionType)1>(__exception_ptr:exception_ptr,edm:OccurrenceTraitsedm:EventPrincipal,(edm:BranchActionType)1:TransitionInfoType,edm:StreamID,edm:ParentContext,edm:OccurrenceTraits<edm:EventPrincipal,;edm:WorkerTedm:stream:EDProducerAdaptorBase:implDo;edm:stream:EDProducerAdaptorBase:doEvent;DeepMETProducer:produce;tensorflow:run;tensorflow:run;tensorflow:DirectSession:Run;tensorflow:DirectSession:GetOrCreateExecutors;tensorflow:DirectSession:CreateExecutors;tensorflow:DirectSession:CreateGraphs;tensorflow:GraphExecutionState:BuildGraph;tensorflow:GraphExecutionState:OptimizeGraph;tensorflow:grappler:RunMetaOptimizer;tensorflow:grappler:MetaOptimizer:OptimizeConsumeItem;tensorflow:grappler:MetaOptimizer:OptimizeGraph;tensorflow:grappler:MetaOptimizer:OptimizeGraph;tensorflow:grappler:MetaOptimizer:RunOptimizer;mlir:tfg:TFGGrapplerOptimizer:Optimize;mlir:tfg:ImportGraphDef;mlir:tfg:(anonymousnamespace):GraphDefImporter:ConvertNodes(mlir:OpBuilder,mlir:tfg:;mlir:OperationState:OperationState;mlir:OperationName:OperationName;opnew;aligned_alloc;
do the AVX2 and AVX512 builds crash as well?
Workflow 136.889 step 3 (customized to run initialStep and its dependencies) crashes with cmsRunGlibC on both IgProf and VTune. IgProf crash stack trace didn't have any useful additional information, the VTune crash is a tiny bit different, but still points to Tensorflow
*** Error in `cmsRunGlibC': malloc(): smallbin double linked list corrupted: 0x00000000459a97d0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x7f69fda4d474]
/lib64/libc.so.6(+0x82bb0)[0x7f69fda50bb0]
/lib64/libc.so.6(__libc_malloc+0x4c)[0x7f69fda5378c] [30/74]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libstdc++.so.6(_Znwm+0x1c)[0x7f69fe38486c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x358bca7)[0x7f695e3a4ca7]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow12ConcatBaseOpIN5Eigen16ThreadPoolDeviceEfLNS_16AxisArgumentNameE0EE7ComputeEPNS_15OpKernelContextE+0xff)[0x7f695e3b4cbf]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(_ZN10tensorflow16ThreadPoolDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x4b)[0x7f695a63cf6b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1258ef1)[0x7f695a72aef1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f695a72bfd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f69677b299c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f69678f0d78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1244271)[0x7f695a716271]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1248f08)[0x7f695a71af08]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession11RunInternalElRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x7fa)[0x7f696790361a]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISD_EERKS4_ISB_SaISB_EESL_PS4_ISC_SaISC_EEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x96e)[0x7f696790634e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EERKN3tsl6thread17ThreadPoolOptionsE+0xa5)[0x7f6975f794d5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EEPN3tsl6thread19ThreadPoolInterfaceE+0x19)[0x7f6975f79589]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so(+0xba737)[0x7f6945080737]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so(+0xbb086)[0x7f6945081086]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libRecoTrackerFinalTrackSelectors.so(_ZN22TrackMVAClassifierBase7produceERN3edm5EventERKNS0_10EventSetupE+0xaa8)[0x7f694549caf8]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm6stream21EDProducerAdaptorBase7doEventERKNS_19EventTransitionInfoEPNS_16ActivityRegistryEPKNS_20ModuleCallingContextE+0x141)[0x7f6a009a44b1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm7WorkerTINS_6stream21EDProducerAdaptorBaseEE6implDoERKNS_19EventTransitionInfoEPKNS_20ModuleCallingContextE+0x5e)[0x7f6a00988f1e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8a19)[0x7f6a00914a19]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8fb4)[0x7f6a00914fb4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so(+0x5f28)[0x7f6a074a8f28]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x28083)[0x7f69febfb083]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x13d21b)[0x7f6a0089921b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor11processRunsEv+0x205)[0x7f6a008a2bc5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor15runToCompletionEv+0x23f)[0x7f6a008a318f]
cmsRunGlibC[0x4074ef]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x149bd)[0x7f69febe79bd]
cmsRunGlibC[0x408ed2]
cmsRunGlibC(main+0x14c)[0x40518c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f69fd9f0555]
cmsRunGlibC[0x405451]
======= Memory map: ========
00400000-00404000 r--p 00000000 00:31 449008759 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00404000-00405000 r-xp 00004000 00:31 449008759 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00405000-0040a000 r-xp 00005000 00:31 449008759 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040a000-0040d000 r--p 0000a000 00:31 449008759 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040d000-0040e000 r--p 0000c000 00:31 449008759 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040e000-0040f000 rw-p 0000d000 00:31 449008759 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040f000-00410000 rwxp 00000000 00:00 0
014d1000-46ef0000 rw-p 00000000 00:00 0 [heap]
7f68d6da9000-7f68eb73c000 rw-p 00000000 00:00 0
7f68f9c51000-7f68fa061000 rwxp 00000000 00:00 0
7f68fa061000-7f68fa07d000 rw-p 00000000 00:00 0
7f68fa07d000-7f68fa0cd000 rwxp 00000000 00:00 0
7f68fa0e9000-7f68fa139000 rwxp 00000000 00:00 0
7f68fa154000-7f68fa1d6000 rwxp 00000000 00:00 0
7f68fa1d6000-7f68fe09d000 rw-p 00000000 00:00 0
7f68fe09d000-7f68fe0db000 r--p 00000000 00:31 443261081 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
7f68fe0ef000-7f68fe119000 rwxp 00000000 00:00 0
7f68fe119000-7f68fe125000 r--p 00000000 00:31 443261081 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
7f68fe125000-7f68fe13a000 r-xp 0000c000 00:31 443261081 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
7f68fe13a000-7f68fe142000 r--p 00021000 00:31 443261081 /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
do the AVX2 and AVX512 builds crash as well?
I haven't tested. On the machine I'm testing Tensorflow prints at the beginning of the job
2023-12-01 15:40:42.725466: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-01 15:40:42.758009: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
I know, and do not understand what "This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations" really mean. Tensorflow does some jitting and maybe produce vector code for the specific machine. Maybe vtune ((or igprof) can be used to see if it is the case or not)
Running workflow 140.201 step 2 (customized to run DeepTauId and its dependencies) with cmsRunGlibC resulted the nan exception with IgProf
----- Begin Fatal Exception 01-Dec-2023 16:14:20 CET-----------------------
An exception of category 'DeepTauId' occurred while
[0] Processing Event run: 357735 lumi: 12 event: 18233675 stream: 0
[1] Running path 'tauPath'
[2] Calling method for module DeepTauId/'deepTau2018v2p5ForMini'
Exception Message:
invalid prediction = nan for tau_index = 0, pred_index = 0
----- End Fatal Exception -------------------------------------------------
and on VTune the job crashed inside DeepTauId module with a corrupted stack.