cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

HLT crash in run393240: segmentation violation in `ProtonReconstructionAlgorithm::reconstructFromSingleRP`

Open missirol opened this issue 6 months ago • 11 comments

One HLT job crashed in run-393240 (release: CMSSW_15_0_7).

The full cmsRun log of the job can be found here. old_hlt_run393240_pid2974290.log

Copying below the output of the thread from which sig_dostack_then_abort was called. Maybe the crash originated from ProtonReconstructionAlgorithm ?

Thread 38 (Thread 0x7fd9d3ffe700 (LWP 2976418) "cmsRun"):
#0  0x00007fdb2b16fac1 in poll () from /lib64/libc.so.6
#1  0x00007fdb241c2147 in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007fdb241c2344 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fdb2076fb9f in TSpline3::Eval(double) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/external/el8_amd64_gcc12/lib/libHist.so
#5  0x00007fda0cecb724 in ProtonReconstructionAlgorithm::reconstructFromSingleRP(edm::Ref<std::vector<CTPPSLocalTrackLite, std::allocator<CTPPSLocalTrackLite> >, CTPPSLocalTrackLite, edm::refhelper::FindUsingAdvance<std::vector<CTPPSLocalTrackLite, std::allocator<CTPPSLocalTrackLite> >, CTPPSLocalTrackLite> > const&, float, std::ostream&) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libRecoPPSProtonReconstruction.so
#6  0x00007fda0ceec813 in CTPPSProtonProducer::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/pluginRecoPPSProtonReconstructionAuto.so
#7  0x00007fdb2c097155 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#8  0x00007fdb2c07dd2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#9  0x00007fdb2bfff589 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x00007fdb2bfffa91 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007fdb2c28c2a8 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_7/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#12 0x00007fdb2c1e1b3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fd98002f900, waiter=..., this=0x7fdb25e42200) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#13 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fdb25e42200) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#14 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/arena.cpp:137
#15 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/market.cpp:599
#16 0x00007fdb2c1e3cee in tbb::detail::r1::rml::private_worker::run (this=0x7fdb25e35600) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#17 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fdb25e35600) at /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-054885a1b1d4ef9ec998d1bcd72c9241/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#18 0x00007fdb2a8771ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007fdb2b0768d3 in clone () from /lib64/libc.so.6

I tried to reproduce the crash offline on hilton-c2b02-44-01 (a node with the same type of CPUs and GPUs as the HLT node where the crash happened) running the script below (so, running 100 times on the problematic events), but I could not reproduce the crash so far.

#!/bin/bash

hltGetConfiguration run:393240 \
  --globaltag 150X_dataRun3_HLT_v1 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input \
/store/group/tsg/FOG/error_stream_root/run393240/run393240_ls1114_index000338_fu-c2b05-34-01_pid2974290.root,\
/store/group/tsg/FOG/error_stream_root/run393240/run393240_ls1114_index000362_fu-c2b05-34-01_pid2974290.root,\
/store/group/tsg/FOG/error_stream_root/run393240/run393240_ls1114_index000388_fu-c2b05-34-01_pid2974290.root \
  > tmp.py

cat <<@EOF >> tmp.py
process.options.wantSummary = False
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24

process.GlobalTag.recordsToDebug = []

del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
@EOF

for ntry in {00..99}; do
  hltLabel=hlt"${ntry}"
  echo "${hltLabel}"...
  cmsRun tmp.py &> "${hltLabel}".log
  grep -inrl fatal "${hltLabel}".log
done
unset ntry

missirol avatar Jun 23 '25 09:06 missirol

cms-bot internal usage

cmsbuild avatar Jun 23 '25 09:06 cmsbuild

A new Issue was created by @missirol.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Jun 23 '25 09:06 cmsbuild

assign RecoPPS/ProtonReconstruction

mmusich avatar Jun 23 '25 12:06 mmusich

assign hlt

  • tentatively

mmusich avatar Jun 23 '25 12:06 mmusich

New categories assigned: reconstruction,hlt

@jfernan2,@mandrenguyen,@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Jun 23 '25 12:06 cmsbuild

type ctpps

mmusich avatar Jun 23 '25 12:06 mmusich

If the segmentation fault is not repeatable, then it would be good to add the full tracebacks for all threads as it could be a thread safety issue.

Dr15Jones avatar Jun 23 '25 15:06 Dr15Jones

If the segmentation fault is not repeatable, then it would be good to add the full tracebacks for all threads as it could be a thread safety issue.

@Dr15Jones , I think this is provided in the attachment at the top of the issue's description.

missirol avatar Jun 23 '25 15:06 missirol

@fabferro @obertino @vavati as CTPPS experts, please consider this issue too, thanks!!

jfernan2 avatar Jun 23 '25 15:06 jfernan2

@grzanka FYI too

jfernan2 avatar Jun 23 '25 15:06 jfernan2

If the segmentation fault is not repeatable, then it would be good to add the full tracebacks for all threads as it could be a thread safety issue.

@Dr15Jones , I think this is provided in the attachment at the top of the issue's description.

@Dr15Jones , would you have time to please review the stack trace ? Does it contain any hint on what the problem (if any) could be ?

missirol avatar Jun 26 '25 15:06 missirol

I didn't spot anything obviously related to the TSpline3::Eval() call in the other threads.

makortel avatar Jul 02 '25 22:07 makortel