cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

HLT crashes in Run 380399

Open trtomei opened this issue 1 year ago • 109 comments

Crashes observed in collisions Run 380399. Stack traces:

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:33:54 CEST 2024
Thread 1 (Thread 0x7f9b37c4a640 (LWP 552137) "cmsRun"):
#0 0x00007f9b388160e1 in poll () from /lib64/libc.so.6
#1 0x00007f9b2eb792ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007f9b2eb2cafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007f9b2eb2d460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f9b38ac882b in __lll_lock_wait () from /lib64/libpthread.so.0
#6 0x00007f9b38ac1ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#7 0x00007f9ad36b5266 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt::tryReuseCachedBlock(cms::alpakatools::CachingAllocator >::BlockDescriptor&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#8 0x00007f9ad36b641f in alpaka_cuda_async::EcalRawToDigiPortable::produce(alpaka_cuda_async::device::Event&, alpaka_cuda_async::device::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#9 0x00007f9ad36b1f73 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#10 0x00007f9b3b27a47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007f9b3b25ec2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007f9b3b1e6f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007f9b3b1e74c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007f9b3b157bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007f9b39992281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f9b3b16841b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007f9b3b17224d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00007f9b3b1727b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f9b3997e9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:25:57 CEST 2024
Thread 1 (Thread 0x7fcd588eb640 (LWP 2367851) "cmsRun"):
#0 0x00007fcd594b70e1 in poll () from /lib64/libc.so.6
#1 0x00007fcd4f8092ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007fcd4f7bcafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007fcd4f7bd460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fcd59765020 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#6 0x00007fcd3ae57e86 in ?? () from /lib64/libcuda.so.1
#7 0x00007fcd3ab754d7 in ?? () from /lib64/libcuda.so.1
#8 0x00007fcd3ac46009 in ?? () from /lib64/libcuda.so.1
#9 0x00007fcd3c7b41b7 in ?? () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#10 0x00007fcd3c7f0490 in cudaEventQuery () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#11 0x00007fcd464a00a5 in cms::alpakatools::EventCache<alpaka::EventUniformCudaHipRt::get(alpaka::DevUniformCudaHipRt) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#12 0x00007fcd464a436b in alpaka_cuda_async::detail::EDMetadataSentry::EDMetadataSentry(edm::StreamID) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#13 0x00007fccf4392ed5 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#14 0x00007fcd5bf1b47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fcd5beffc2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fcd5be87f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fcd5be884c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fcd5bdf8bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fcd5a633281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#20 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#21 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#22 0x00007fcd5be0941b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00007fcd5be1324d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#24 0x00007fcd5be137b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#25 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#26 0x00007fcd5a61f9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#27 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#28 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

We tried to reproduce it with the following recipe, but it didn't reproduce.

#!/bin/bash -ex

# CMSSW_14_0_6_MULTIARCHS

hltGetConfiguration run:380399 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000145_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000225_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000211_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000280_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000281_fu-c2b01-28-01_pid552137.root'\
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Message #8 in the first stack trace seems to point to alpaka_cuda_async::EcalRawToDigiPortable::produce() method.

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

Best regards, Thiago (for FOG)

trtomei avatar May 07 '24 16:05 trtomei

cms-bot internal usage

cmsbuild avatar May 07 '24 16:05 cmsbuild

A new Issue was created by @trtomei.

@makortel, @smuzaffar, @sextonkennedy, @rappoccio, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar May 07 '24 16:05 cmsbuild

Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different: http://cmsonline.cern.ch/cms-elog/1213820

The stack traces that I posted above are the ones I got from F3mon, though.

trtomei avatar May 07 '24 16:05 trtomei

For reference, here are the log files of the 2 jobs. old_hlt_run380399_pid552137.log old_hlt_run380399_pid2367851.log

As reported in the elog above, they both contain the following.

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77:
void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc =
alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]:
block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.

missirol avatar May 07 '24 17:05 missirol

assign heterogeneous, reconstruction

makortel avatar May 07 '24 17:05 makortel

New categories assigned: heterogeneous,reconstruction

@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar May 07 '24 17:05 cmsbuild

So in both cases the assertion in https://github.com/cms-sw/cmssw/blob/8391245a8a5febd6fdaec1e66395b67f9d2dd3db/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h#L76-L80 fails, after which the processes get stuck, until they gets killed 6-7 hours later. (and because of the external termination the stack traces of only "the crashed thread" are meaningless alone)

makortel avatar May 07 '24 18:05 makortel

The state of cmsRun seems similar to https://github.com/cms-sw/cmssw/issues/44769#issuecomment-2066613569 : all the TBB threads are stuck in acquiring locks (old_hlt_run380399_pid2367851.log has two TBB threads that don't have any work to do), and the CUDA threads are doing something and presumably holding the lock.

makortel avatar May 07 '24 18:05 makortel

FYI @AdrianoDee

makortel avatar May 07 '24 18:05 makortel

Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different:

@trtomei, all circumstantial evidence points to vertex reconstruction. Can you edit the title to remove the EcalRawToDigi mention?

mmusich avatar May 07 '24 18:05 mmusich

For emphasis: external termination request means it timed out, likely due to a deadlock. The timeout signal will be delivered to thread 1, but to diagnose the deadlock we need to see the stack traces of all the threads.

dan131riley avatar May 11 '24 13:05 dan131riley

Looks like this happened again. One LS (run 380624, LS 411) could not be "closed" because one process was stuck and had to be terminated manually hours later. The full log (see below) shows again

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: [..]
 block: [0,0,0], thread: [21,0,0] Assertion `wv[i] > 0.f` failed.

similarly to https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2098979035.

http://cmsonline.cern.ch/cms-elog/1214996

old_hlt_run380624_pid2375140.log

missirol avatar May 12 '24 10:05 missirol

There should be no assert on GPU in first place. In my opinion this is something that was seen already in the past. in standard CUDA in CMSSW we never had the assert on.

VinInn avatar May 12 '24 10:05 VinInn

maybe I missing something

https://github.com/cms-sw/cmssw/blob/22f751f0592542ae94a797e3fc22294ddc4626cf/RecoTracker/PixelVertexFinding/plugins/gpuFitVertices.h#L70

mmusich avatar May 12 '24 10:05 mmusich

The assert in CUDA were compiled out because they are slow. There is another thread on the subject. I understood that (by mistake) they are back since ALPAKA.

Of course the assert is still there on CPU but it will not print block: [0,0,0], thread: [21,0,0] so this is coming from CUDA and in the past it would not trigger!

VinInn avatar May 12 '24 10:05 VinInn

Currently we do have asserts in Alpaka code, because the code is new and IMHO it's better to get meaningful errors with a well defined location than random failures.

Within alpaka code:

  • assert are enabled like in host code, and so are enabled in CMSSW;
  • ALPAKA_ASSERT should behave in the same way as assert - but I don't think we use them;
  • ALPAKA_ASSERT_ACC can be conditionally enabled on CPU and disabled on GPU, but for the time being we keep them enabled also on GPU.

I understood that (by mistake) they are back since ALPAKA.

The plan is to change all assert(...) in alpaka code to ALPAKA_ASSERT_ACC(...), then measure the impact of enabling/disabling them. I'd probably keep them enabled for a 1%-2% slowdown, disable them for a 5% slowdown.

The fact that asserts may be causing deadlocks is a different matter: in principle, a failed assert in gpu code should cause an asynchronous error, and the next CUDA runtime function should detect it and fail.

Do we have a way to reproduce the problem offline ?

fwyzard avatar May 12 '24 15:05 fwyzard

here is the old issue on CPU #37820

VinInn avatar May 12 '24 16:05 VinInn

Did anybody reproduce the assert offline?

VinInn avatar May 12 '24 16:05 VinInn

This crash seems to appear again in the collision run 381067. Attaching the error log to this comment old_hlt_run381067_pid1000282.log

We were unable to reproduce the error after running the HLT config again on the error stream files.

gparida avatar May 24 '24 06:05 gparida

Do not see any sign of assert in the log from run 381067: can somebody post a script to reproduce it offline?

VinInn avatar May 24 '24 07:05 VinInn

Do not see any sign of assert in the log from run 381067:

it's at line 36196 in the log attached:

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]: block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.

mmusich avatar May 24 '24 08:05 mmusich

can somebody post a script to reproduce it offline?

@gparida if you tried to reproduce offline, can you post the script you used here?

mmusich avatar May 24 '24 08:05 mmusich

Note that (also) in this case, the job was stuck not processing data for a while, and was eventually killed.

fwyzard avatar May 24 '24 08:05 fwyzard

I tried reproducing it (CMSSW_14_0_7_MULTIARCHS) offline (unsuccessfully). Used the following:

#!/bin/bash -ex

# CMSSW_14_0_7_MULTIARCHS

hltGetConfiguration run:381067 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000095_fu-c2b04-28-01_pid1000282.root,'\
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000109_fu-c2b04-28-01_pid1000282.root'&> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Also, tried to rerun the job with

process.options.numberOfThreads = 32
process.options.numberOfStreams = 32

gparida avatar May 24 '24 08:05 gparida

Attaching the error log to this comment old_hlt_run381067_pid1000282.log

Umm, the log indicates the job had 992 threads spawned by OpenMP, all in state

Thread 1036 (Thread 0x7fb6e37e2700 (LWP 1012080) "cmsRun"):
#0  futex_wait (val=11872, addr=0x7fbae8882b44) at ../../../libgomp/config/linux/x86/futex.h:97
#1  do_wait (val=11872, addr=0x7fbae8882b44) at ../../../libgomp/config/linux/wait.h:67
#2  gomp_barrier_wait_end (bar=0x7fbae8882b40, state=11872) at ../../../libgomp/config/linux/bar.c:48
#3  0x00007fbb369263e8 in gomp_simple_barrier_wait (bar=0x7fbae8882b40) at ../../../libgomp/config/posix/simple-bar.h:60
#4  gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:133
#5  0x00007fbc4c0191ca in start_thread () from /lib64/libpthread.so.0
#6  0x00007fbc4bc84e73 in clone () from /lib64/libc.so.6

What is bringing OpenMP threading?

makortel avatar May 24 '24 13:05 makortel

first library in the list! ldd RivetGeneratorInterfaceRivetInterface.so | grep gomp libgomp.so.1 => /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02838/slc7_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libgomp.so.1 (0x00007f91271d0000)

VinInn avatar May 24 '24 15:05 VinInn

more libPhysicsToolsPatAlgos.so libRecoEgammaPhotonIdentification.so pluginGeneratorInterfaceRivetInterface_plugins.so pluginPhysicsToolsNanoAODPlugins.so pluginPhysicsToolsNanoAODRNuplePlugins.so pluginPhysicsToolsPatAlgos_plugins.so pluginPhysicsToolsPatExamples_plugins.so pluginPhysicsToolsScoutingPlugins.so

and few more

VinInn avatar May 24 '24 15:05 VinInn

maybe we should set export OMP_NUM_THREADS=1

VinInn avatar May 24 '24 15:05 VinInn

so used gdb and info threads it looks to me that each cmssw thread spawn ncpu-1 omp threads all sitting in

 103  Thread 0x7fd468fb2700 (LWP 107236) "cmsRun"          futex_wait (val=32, addr=0x7fd5051e5dc4) at ../../../libgomp/config/linux/x86/futex.h:97

one can break gomp_thread_start and see each of them starting

VinInn avatar May 24 '24 15:05 VinInn

export OMP_NUM_THREADS=1 will make gomp_thread_start never called as gomp uses the main thread as thread 0...

VinInn avatar May 24 '24 15:05 VinInn