HLT crash in run393240: segmentation violation in `ProtonReconstructionAlgorithm::reconstructFromSingleRP`
One HLT job crashed in run-393240 (release: CMSSW_15_0_7).
The full cmsRun log of the job can be found here.
old_hlt_run393240_pid2974290.log
Copying below the output of the thread from which sig_dostack_then_abort was called. Maybe the crash originated from ProtonReconstructionAlgorithm ?
Thread 38 (Thread 0x7fd9d3ffe700 (LWP 2976418) "cmsRun"):
#0 0x00007fdb2b16fac1 in poll () from /lib64/libc.so.6
#1 0x00007fdb241c2147 in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007fdb241c2344 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007fdb2076fb9f in TSpline3::Eval(double) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/external/el8_amd64_gcc12/lib/libHist.so
#5 0x00007fda0cecb724 in ProtonReconstructionAlgorithm::reconstructFromSingleRP(edm::Ref<std::vector<CTPPSLocalTrackLite, std::allocator<CTPPSLocalTrackLite> >, CTPPSLocalTrackLite, edm::refhelper::FindUsingAdvance<std::vector<CTPPSLocalTrackLite, std::allocator<CTPPSLocalTrackLite> >, CTPPSLocalTrackLite> > const&, float, std::ostream&) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libRecoPPSProtonReconstruction.so
#6 0x00007fda0ceec813 in CTPPSProtonProducer::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/pluginRecoPPSProtonReconstructionAuto.so
#7 0x00007fdb2c097155 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#8 0x00007fdb2c07dd2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#9 0x00007fdb2bfff589 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x00007fdb2bfffa91 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007fdb2c28c2a8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#12 0x00007fdb2c1e1b3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fd98002f900, waiter=..., this=0x7fdb25e42200) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#13 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fdb25e42200) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#14 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/arena.cpp:137
#15 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/market.cpp:599
#16 0x00007fdb2c1e3cee in tbb::detail::r1::rml::private_worker::run (this=0x7fdb25e35600) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#17 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fdb25e35600) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#18 0x00007fdb2a8771ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007fdb2b0768d3 in clone () from /lib64/libc.so.6
I tried to reproduce the crash offline on hilton-c2b02-44-01 (a node with the same type of CPUs and GPUs as the HLT node where the crash happened) running the script below (so, running 100 times on the problematic events), but I could not reproduce the crash so far.
#!/bin/bash
hltGetConfiguration run:393240 \
--globaltag 150X_dataRun3_HLT_v1 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
/store/group/tsg/FOG/error_stream_root/run393240/run393240_ls1114_index000338_fu-c2b05-34-01_pid2974290.root,\
/store/group/tsg/FOG/error_stream_root/run393240/run393240_ls1114_index000362_fu-c2b05-34-01_pid2974290.root,\
/store/group/tsg/FOG/error_stream_root/run393240/run393240_ls1114_index000388_fu-c2b05-34-01_pid2974290.root \
> tmp.py
cat <<@EOF >> tmp.py
process.options.wantSummary = False
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
process.GlobalTag.recordsToDebug = []
del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
@EOF
for ntry in {00..99}; do
hltLabel=hlt"${ntry}"
echo "${hltLabel}"...
cmsRun tmp.py &> "${hltLabel}".log
grep -inrl fatal "${hltLabel}".log
done
unset ntry
cms-bot internal usage
A new Issue was created by @missirol.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign RecoPPS/ProtonReconstruction
assign hlt
- tentatively
New categories assigned: reconstruction,hlt
@jfernan2,@mandrenguyen,@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks
type ctpps
If the segmentation fault is not repeatable, then it would be good to add the full tracebacks for all threads as it could be a thread safety issue.
If the segmentation fault is not repeatable, then it would be good to add the full tracebacks for all threads as it could be a thread safety issue.
@Dr15Jones , I think this is provided in the attachment at the top of the issue's description.
@fabferro @obertino @vavati as CTPPS experts, please consider this issue too, thanks!!
@grzanka FYI too
If the segmentation fault is not repeatable, then it would be good to add the full tracebacks for all threads as it could be a thread safety issue.
@Dr15Jones , I think this is provided in the attachment at the top of the issue's description.
@Dr15Jones , would you have time to please review the stack trace ? Does it contain any hint on what the problem (if any) could be ?
I didn't spot anything obviously related to the TSpline3::Eval() call in the other threads.