HLT crashes in GPU and CPU in collision runs
Dear experts,
During the week of June 13-20, following 3 types of HLT crashes happened in collision runs. HLT was using CMSSW_12_3_5.
type 1
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.
A fatal system signal has occurred: abort signal
This crash happened on June 13th, during stable beams, collision at 900 GeV. Run number: 353709. The crash happened in a CPU(fu-c2a05-35-01). Elog: http://cmsonline.cern.ch/cms-elog/1143438. Full crash report: https://swmukher.web.cern.ch/swmukher/hltcrash_June13_StableBeam.txt
type 2
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: PathStatusInserter:Dataset_ExpressPhysics
Module: EcalRawToDigi:hltEcalDigisLegacy
A fatal system signal has occurred: segmentation violation
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none
A fatal system signal has occurred: segmentation violation
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: none
Module: HcalCPURecHitsProducer:hltHbherecoFromGPU
A fatal system signal has occurred: segmentation violation
This type of crashes happened in GPUs (for example: fu-c2a02-35-01). It happened during collision runs when no real collisions were happening. On June 14th (run 353744, Pixel subdetector was out), and on June 18th (run 353932, 353935, 353941, Pixel and tracker subdetectors were out).
type 3
[2] Prefetching for module MeasurementTrackerEventProducer/'hltSiStripClusters'
[3] Prefetching for module SiPixelDigiErrorsFromSoA/'hltSiPixelDigisFromSoA'
[4] Calling method for module SiPixelDigiErrorsSoAFromCUDA/'hltSiPixelDigiErrorsSoA'
Exception Message:
A std::exception was thrown.
cannot create std::vector larger than max_size()
happened in fu-c2a02-39-01 (GPU), in collision run 353941 (Pixel and tracker subdetectors were out), no real collision was ongoing.
Reason of crash (2) and (3) might even be related. Relevant elog on (2) and (3): http://cmsonline.cern.ch/cms-elog/1143515
Regards, Swagata, as HLT DOC during June 13-20.
A new Issue was created by @swagata87 Swagata Mukherjee.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt, reconstruction
New categories assigned: hlt,reconstruction
@jpata,@missirol,@clacaputo,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks
@swagata87 could you provide the full stack traces for the job that failed with the segmentation violations?
Three examples are pasted below:
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Sat Jun 18 18:31:53 CEST 2022
Thread 1 (Thread 0x7fde7a331540 (LWP 194148) "cmsRun"):
#0 0x00007fde7c1d3ddd in poll () from /lib64/libc.so.6
#1 0x00007fde70bf428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007fde70bf4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007fde70bf756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fde7c2366a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007fddb6e786ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007fddb6e76fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007fde7ec2dd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fde7ec16eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007fde7eb720e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007fde7eb723db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007fde7eb749c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007fde7eab8c45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007fde7d2c1b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7fddb5dd2300, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fde799da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007fde7eae2ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007fde7eaed8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007fde7d2b015b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: EcalRawToDigi:hltEcalDigisLegacy
Module: none
A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Tue Jun 14 06:45:22 CEST 2022
Thread 1 (Thread 0x7f1d0ef42540 (LWP 251002) "cmsRun"):
#0 0x00007f1d10de4ddd in poll () from /lib64/libc.so.6
#1 0x00007f1d057f428f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f1d057f4c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f1d057f756b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f1d10e45d29 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f1c4b0876ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f1c4b085fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f1d1383fd83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f1d13828eaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f1d137840e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f1d137843db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f1d137869c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f1d136cac45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f1d11ed3b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f1c4a9a1500, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f1d0e5da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f1d136f4ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f1d136ff8fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f1d11ec215b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: none
Module: HcalHitReconstructor:hltHoreco
Module: HcalHitReconstructor:hltHoreco
A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Tue Jun 14 06:45:23 CEST 2022
Thread 1 (Thread 0x7f6148fd5540 (LWP 250893) "cmsRun"):
#0 0x00007f614ae77ddd in poll () from /lib64/libc.so.6
#1 0x00007f613f1f228f in full_read.constprop () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f613f1f2c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f613f1f556b in sig_dostack_then_abort () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f614aed8cb5 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007f60850e76ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl >, std::less, std::allocator > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map >, std::less, std::allocator > > > > const*&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007f60850e5fab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007f614d8d4d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f614d8bdeaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f614d8190e5 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(edm::Worker::runModule >(edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00007f614d8193db in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f614d81b9c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f614d75fc45 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00007f614bf5fb8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f60849ad400, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f61485da880) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#16 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#17 0x00007f614d789ac8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x00007f614d7948fb in edm::EventProcessor::runToCompletion() () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#19 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007f614bf4e15b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#21 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040971c in main ()
Current Modules:
Module: SiPixelDigiErrorsSoAFromCUDA:hltSiPixelDigiErrorsSoA (crashed)
Module: CAHitNtupletCUDA:hltPixelTracksCPU
Module: none
Module: none
A fatal system signal has occurred: segmentation violation
[ message truncated - showing only crashed thread ]
The full list is here:
- Run 353932 https://swmukher.web.cern.ch/swmukher/run_353932_f3mon_logtable_2022-06-19T06%2046%2029.068Z.txt
- Run 353935 https://swmukher.web.cern.ch/swmukher/run_353935_f3mon_logtable_2022-06-19T06%2045%2025.179Z.txt
- Run 353941 https://swmukher.web.cern.ch/swmukher/run_353941_f3mon_logtable_2022-06-19T06%2043%2038.432Z.txt
Experts are working on providing a recipe to reproduce the crashes offline. (tagging @mzarucki and @fwyzard ) Once that is available, that can be posted here so that tracker DPG can have a look. The code that triggered the crashes are under tracker DPG.
Dear tracker DPG, (@cms-sw/trk-dpg-l2)
I managed to reproduce the GPU crash happened during run 353941 in the machine gputest-milan-01.cms at Point 5.
I used CMSSW_12_3_5.
$CMSSW_RELEASE_BASE/ is -bash: /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/: Is a directory
General instructions to set up CMSSW area in GPU nodes online, is here https://twiki.cern.ch/twiki/bin/viewauth/CMS/TriggerDevelopmentWithGPUs
The HLT configuration file is: https://swmukher.web.cern.ch/swmukher/hlt_v5.py
The .raw file I ran on is this-> run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw
This .raw file and all the other .raw files are available in the online machines under /store/error_stream.
I have copied one .raw here: https://swmukher.web.cern.ch/swmukher/run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw
In case it is useful,
The HLT configuration file was obtained by the following command:
https_proxy=http://cmsproxy.cms:3128/ hltConfigFromDB --adg --configName /cdaq/physics/firstCollisions22/v2.4/HLT/V5 > hlt_v5.py
Then, at the end, the following block was added:
process.EvFDaqDirector = cms.Service(
"EvFDaqDirector",
runNumber=cms.untracked.uint32(353941), #maybe_replace_me
baseDir=cms.untracked.string("tmp"),
buBaseDir=cms.untracked.string(
"/nfshome0/swmukher/check/CMSSW_12_3_5/src" #replace_me
),
useFileBroker=cms.untracked.bool(False),
fileBrokerKeepAlive=cms.untracked.bool(True),
fileBrokerPort=cms.untracked.string("8080"),
fileBrokerUseLocalLock=cms.untracked.bool(True),
fuLockPollInterval=cms.untracked.uint32(2000),
requireTransfersPSet=cms.untracked.bool(False),
selectedTransferMode=cms.untracked.string(""),
mergingPset=cms.untracked.string(""),
outputAdler32Recheck=cms.untracked.bool(False),
)
process.source.fileNames = cms.untracked.vstring("file:run353941_ls0019_index000175_fu-c2a02-39-04_pid194400.raw") #maybe_replace_me
process.source.fileListMode = True
cmsRun hlt_v5.py reproduces the crash.
It will create a /tmp folder.
To reproduce the crash again, I had to remove the /tmp folder before doing cmsRun again.
Let me know if something was unclear.
@swagata87 thank you for providing these instructions !
@tsusa you can use the online GPU machines to reproduce the issue:
ssh gpu-c2a02-39-01.cms
mkdir -p /data/$USER
cd /data/$USER
source /data/cmssw/cmsset_default.sh
cmsrel CMSSW_12_3_5
cd CMSSW_12_3_5
mkdir run
cd run
cp ~hltpro/error/hlt_error_run353941.py .
cmsRun hlt_error_run353941.py
In my test the problem did not happen every time, I had to run the job a few times before it crashed:
while cmsRun hlt_error_run353941.py; do clear; rm -rf output; done
It eventually crashed, though I'm not 100% sure if it was due to the same problem :-/
Yes, looks like the same crash:
#4 <signal handler called>
#5 0x00007fbbf5c9f6a6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#6 0x00007fbb34ed06ba in edm::OrphanHandle<SiPixelErrorsSoA> edm::Event::emplaceImpl<SiPixelErrorsSoA, int, SiPixelErrorCompact const*, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&>(unsigned int, int&&, SiPixelErrorCompact const*&&, std::map<unsigned int, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, std::vector<SiPixelRawDataError, std::allocator<SiPixelRawDataError> > > > > const*&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#7 0x00007fbb34ecefab in non-virtual thunk to SiPixelDigiErrorsSoAFromCUDA::produce(edm::Event&, edm::EventSetup const&) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/pluginEventFilterSiPixelRawToDigiPlugins.so
#8 0x00007fbbf8696d83 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007fbbf867feaf in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /data/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
...
As a guess, I think the problem is an extremely large amount of data is being requested to be copied which leads to some memory overwrite into a protected memory space. This is just based on what edm::Event::emplaceImpl is doing which is basically calling
https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/DataFormats/SiPixelRawData/interface/SiPixelErrorsSoA.h#L13-L14
So cms::cuda::SimpleVector does not initialize any of its member data in its constructor
https://github.com/cms-sw/cmssw/blob/1c3608474c4821e4dafbd6f4defb5fa03121cb6d/HeterogeneousCore/CUDAUtilities/interface/SimpleVector.h#L16
If the first call to SiPixelDigiErrorsSoAFromCUDA::acquire hits this condition https://github.com/cms-sw/cmssw/blob/d573dd29448b13dea818ed927bba7a63814ba29a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc#L54-L55
then this call in produce https://github.com/cms-sw/cmssw/blob/d573dd29448b13dea818ed927bba7a63814ba29a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc#L73
will just copy a random number of bytes from a random memory address.
@Dr15Jones thanks for investigating the issue.
So
cms::cuda::SimpleVectordoes not initialize any of its member data in its constructor
This is intended, because a SimpleVector is often allocated by the host in GPU memory, so the constructor cannot be run.
However the does leave open the possibility of using uninitialised memory :-(
A minimal fix could be
diff --git a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
index 4037b4d5061..554f1425cef 100644
--- a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
+++ b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
@@ -28,7 +28,7 @@ private:
edm::EDPutTokenT<SiPixelErrorsSoA> digiErrorPutToken_;
cms::cuda::host::unique_ptr<SiPixelErrorCompact[]> data_;
- cms::cuda::SimpleVector<SiPixelErrorCompact> error_;
+ cms::cuda::SimpleVector<SiPixelErrorCompact> error_ = cms::cuda::make_SimpleVector<SiPixelErrorCompact>(0, nullptr);
const SiPixelFormatterErrors* formatterErrors_ = nullptr;
};
With it I have been able to run over 20 times on the same input as before without triggering any errors.
PRs with this fix:
- CMSSW_12_5_X: https://github.com/cms-sw/cmssw/pull/38476
- CMSSW_12_4_X: https://github.com/cms-sw/cmssw/pull/38477
- CMSSW_12_3_X: https://github.com/cms-sw/cmssw/pull/38478
Hm, looks like I am late to the party... but, if it's any help, here are instructions for the error seen in Run 353744 (AFAICT you have been testing with Run 353941). Running in Hilton this time:
Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353744_ls0009.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2
I also see the same problem, it crashes only every once in a while. It's probably the same bug, but I add it here for completeness.
I also have here the other crash, this one is fully reproducible:
Input file: file:/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run353709_ls0085.root
CMSSW: CMSSW_12_3_5
GT: 123X_dataRun3_HLT_v7
Menu: /cdaq/physics/firstCollisions22/v2.4/HLT/V2
It will always crash on the 52nd event, Run 353709, Event 76567528, LumiSection 85, with the message:
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_5-slc7_amd64_gcc10/build/CMSSW_12_3_5-build/tmp/BUILDROOT/32f4c0d8c5d5ff0fb0f1b58023d4424d/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_5/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray<short unsigned int, 48> >; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc<unsigned int, 32769, 163840>; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray<unsigned int, 6>]: Assertion `tmpNtuplet.size() <= 4' failed.
PS: it's not needed to run on Hilton at all, I was running in offline-like mode.
@trtomei could you clarify
- what was the original error stream file ? is it
/store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw - are you running with or without GPUs ?
- does the error happen consistently, or only randomly ?
Running online, I have not been able to reproduce the error using the .raw input file, neither with nor without GPUs.
@fwyzard To clarify:
- Original error stream file was indeed:
/store/error_stream/run353709/run353709_ls0085_index000141_fu-c2a05-35-01_pid90386.raw - I understand I am using GPUs, as I am using the Skylake machine (
hilton-c2e36-35-04), using theprocess.options = cms.untracked.PSet( accelerators = cms.untracked.vstring( '*' ) )option, and I see the lines
%MSG-i CUDAService: (NoModuleName) 26-Jun-2022 12:24:47 pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
%MSG
- For me, the error happens consistently.
Maybe sit together with me tomorrow and we solve this.
@swagata87 @trtomei
Is this issue still relevant?
Is this issue still relevant?
actually, yesterday we had a crash which looks like the type1 crash mentioned in the issue-description.
Here are some relevant information on yesterday's crash:
Run number: 360224
StartTime: Oct 12 2022, 02:52
EndTime: Oct 12 2022, 04:36
HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.1/HLT/V1
CMSSW_12_4_9
Crash happened in: fu-c2b05-23-01
The error stream file has been copied to hilton. So I think FOG will check if it is reproducible or not, and will follow up.
cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_9-el8_amd64_gcc10/build/CMSSW_12_4_9-build/tmp/BUILDROOT/dc6747a684df926e1faea7ef7c301e1a/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:293: void GPUCACell::find_ntuplets(const Hits&, GPUCACell*, GPUCACell::CellTracksVector&, GPUCACell::HitContainer&, cms::cuda::AtomicPairCounter&, GPUCACell::Quality*, GPUCACell::TmpTuple&, unsigned int, bool) const [with int DEPTH = 2; GPUCACell::Hits = TrackingRecHit2DSOAView; GPUCACell::CellTracksVector = cms::cuda::SimpleVector<cms::cuda::VecArray; GPUCACell::HitContainer = cms::cuda::OneToManyAssoc; GPUCACell::Quality = pixelTrack::Quality; GPUCACell::TmpTuple = cms::cuda::VecArray]: Assertion `tmpNtuplet.size() <= 4' failed.
The files in ROOT format and the HLT configuration are in: /afs/cern.ch/user/t/tomei/public/issue38453
This is reproducible in the Hilton with GPU:
%MSG-i ThreadStreamSetup: (NoModuleName) 14-Oct-2022 02:05:46 pre-events
setting # threads 4
setting # streams 4
%MSG
%MSG-i CUDAService: (NoModuleName) 14-Oct-2022 02:05:47 pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
@cms-sw/tracking-pog-l2
In this issue, one HLT crash is not yet solved, and I would say we need help from tracking experts in order to find a fix.
The crash is reproducible offline (see https://github.com/cms-sw/cmssw/issues/38453#issuecomment-1278364360), it comes from the (HLT) pixel reconstruction, and it only happens on CPU, not on GPU (for what we have seen so far).
Removing some assert calls, one can find a tmpNtuplet with size=5, but that's as far as my insight goes.
https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293
I have a vague recollection of a comment from @VinInn sayng that we should simply remove the assert...
I think now it's OK to have ntuplets with 5 hits, so an alternative could be to change the condition to <= 5 ?
At least, removing the asserts [1,2] does not lead to any other crashes, fwiw.
And just for my understanding: the fact that, for the same event, we do not see a ntuplet with size=5 on GPU can be expected?
[1] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293 [2] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L334
It does not happen on GPU because assert are removed. this is a sort of sextuplet candidate (rare? impossible?) anyhow if on GPU does not cause havoc I would either change the condition following @fwyzard advice or just remove the assert. mind the assert at the end of the function as well
It does not happen on GPU because assert are removed.
Okay, thanks, but still I tried to just print the ntuplet size while running on GPU, and I didn't see a size=5..
will try to have a look on cpu how this track looks like (and if possible compare to the one on GPU)
Thanks for having a look.
I checked that (unsurprisingly) the HLT runs fine on these 'error events', for both CPU and GPU, after changing the 4 to a 5 in the asserts, so in the meantime I'll open PRs with that change to gain time.
The PRs with the 4 -> 5 change are #39780 (12_6_X), #39781 (12_5_X), and #39782 (12_4_X).
@cms-sw/hlt-l2 (now speaking with the ORM hat, in order to better coordinate the creation of the next patch releases):
- will this issue be fully solved after the merge of the backports of https://github.com/cms-sw/cmssw/pull/39780 ?
- is there any other outstanding HLT crash with recent data that still needs to followed up (outside of this ticket)?
will this issue be fully solved after the merge of the backports of https://github.com/cms-sw/cmssw/pull/39780 ?
Yes, that is my understanding.
is there any other outstanding HLT crash with recent data that still needs to followed up (outside of this ticket)?
There are two more issues, but those crashes have been rare: #39568 , which ECAL has promised to look into, and #38651, which might somehow have been a glitch (seen only once).
FOG (@trtomei) can tell us if there are any new online crashes without a CMSSW issue.