cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

DQM Harvest jobs getting stuck at T0

Open germanfgv opened this issue 2 years ago • 55 comments

Tests of CMSSW_12_3_3_patch1 at T0 show some Express jobs of task ExpressMergewrite_StreamExpressCosmics_DQMIOEndOfRunDQMHarvestMerged taking more that 20h to complete.

Here we have the logs of one job that finished after 22h of execution:

%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  21-May-2022 17:33:51 CEST Run: 349840
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  21-May-2022 17:33:51 CEST Run: 349840
 Process name 'HLT' not found in registry!
%MSG
[2022-05-22 10:34:32.107229 +0200][Error  ][AsyncSock         ][  549] [p06253947b90717.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 11:33:25.164556 +0200][Error  ][AsyncSock         ][  549] [p06636710g40375.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 12:27:54.115905 +0200][Error  ][AsyncSock         ][  549] [p06253947g81422.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:23:41.509030 +0200][Error  ][AsyncSock         ][  549] [p06253947q54042.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:53:39.441991 +0200][Error  ][AsyncSock         ][  549] [st-096-gg50030g.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 14:56:59.860605 +0200][Error  ][AsyncSock         ][  549] [st-096-dd904d00.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:06:20.567969 +0200][Error  ][AsyncSock         ][  549] [st-048-388af89c.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:12:21.511984 +0200][Error  ][AsyncSock         ][  549] [p06636710n60578.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:20:03.375578 +0200][Error  ][AsyncSock         ][  549] [p06636710p75593.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
%MSG-e DQMGenericClient:  DQMGenericClient:hltMuonEfficiencies@endRun  22-May-2022 15:40:50 CEST End Run: 349840
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon !!!
%MSG
%MSG-e DQMGenericClient:  DQMGenericClient:hltMuonEfficienciesMR@endRun  22-May-2022 15:40:50 CEST End Run: 349840
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon/MR !!!
%MSG

The logs report nothing for 20h, after which there a few network error messages apparently related to XRootD and then the job continues to finish in a few minutes. It is worth mentioning that this jobs had already been tried, getting stuck in a similar way. That execution was manually interrupted after the first few network error messages showed up.

Full logs and PSet can be found here:

/afs/cern.ch/user/c/cmst0/public/LongExecution/CollisionMay2022/job_41469/SuccessfulExec/job/WMTaskSpace/cmsRun1

More information on the issue can be found here: CMS Talk post

germanfgv avatar May 23 '22 13:05 germanfgv

A new Issue was created by @germanfgv Germn Felipe Giraldo Villa.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar May 23 '22 13:05 cmsbuild

assign core

makortel avatar May 23 '22 13:05 makortel

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar May 23 '22 13:05 cmsbuild

One of this jobs is currently running for 24h. We have access to the node and here is the las few lines of the current cmsRun log:

%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  22-May-2022 22:56:19 CEST Run: 349840
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider:   EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun  22-May-2022 22:56:19 CEST Run: 349840
 Process name 'HLT' not found in registry!
%MSG
[2022-05-23 15:41:21.710046 +0200][Error  ][AsyncSock         ][  549] [p06636710t83205.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: interrupted system call
[2022-05-23 15:41:21.710514 +0200][Error  ][AsyncSock         ][  549] [p06636710k01172.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: interrupted system call

As you can see, there is nothing for 17h and then network errors. The current stack trace for the cmsRun process is the following:

Thread 7 (Thread 0x2b5990f33700 (LWP 198686) "cmsRun"):
#0  0x00002b590e587b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1  0x00002b590e587bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00002b590e587c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3  0x00002b591434a616 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4  0x00002b591434a6d9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#5  0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#6  0x00002b590e894b0d in clone () from target:/lib64/libc.so.6

Thread 6 (Thread 0x2b5990d32700 (LWP 198685) "cmsRun"):
#0  0x00002b590e587b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1  0x00002b590e587bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00002b590e587c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3  0x00002b591434a616 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4  0x00002b591434a6d9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#5  0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#6  0x00002b590e894b0d in clone () from target:/lib64/libc.so.6

Thread 5 (Thread 0x2b5990b31700 (LWP 198684) "cmsRun"):
#0  0x00002b590e587b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1  0x00002b590e587bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00002b590e587c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3  0x00002b591434a616 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4  0x00002b591434a6d9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#5  0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#6  0x00002b590e894b0d in clone () from target:/lib64/libc.so.6

Thread 4 (Thread 0x2b5990930700 (LWP 198683) "cmsRun"):
#0  0x00002b590e588e9d in nanosleep () from target:/lib64/libpthread.so.0
#1  0x00002b591409008f in XrdSysTimer::Wait(int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#2  0x00002b59142b9fa3 in XrdCl::TaskManager::RunTasks() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#3  0x00002b59142ba0f9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4  0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00002b590e894b0d in clone () from target:/lib64/libc.so.6

Thread 3 (Thread 0x2b599072f700 (LWP 198682) "cmsRun"):
#0  0x00002b590e8950e3 in epoll_wait () from target:/lib64/libc.so.6
#1  0x00002b591408a4b2 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#2  0x00002b5914086c3d in XrdSys::IOEvents::BootStrap::Start(void*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#3  0x00002b591408f798 in XrdSysThread_Xeq () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#4  0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00002b590e894b0d in clone () from target:/lib64/libc.so.6

Thread 2 (Thread 0x2b5933134700 (LWP 198559) "cmsRun"):
#0  0x00002b590e58875d in read () from target:/lib64/libpthread.so.0
#1  0x00002b591499822f in full_read.constprop () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x00002b5914998b0f in edm::service::InitRootHandlers::stacktraceHelperThread() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  0x00002b590e15df90 in std::execute_native_thread_routine (__p=0x2b5920665640) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00002b590e894b0d in clone () from target:/lib64/libc.so.6

Thread 1 (Thread 0x2b5910918b00 (LWP 198351) "cmsRun"):
#0  0x00002b590e8eac70 in __memcpy_ssse3_back () from target:/lib64/libc.so.6
#1  0x00002b598b588fd9 in void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .isra.0] () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#2  0x00002b598b58d481 in GEMDQMHarvester::createInactiveChannelFracHist(edm::Service<dqm::legacy::DQMStore>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#3  0x00002b598b58f679 in GEMDQMHarvester::drawSummaryHistogram(edm::Service<dqm::legacy::DQMStore>&, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#4  0x00002b598b58fb4f in GEMDQMHarvester::dqmEndLuminosityBlock(dqm::implementation::IBooker&, dqm::implementation::IGetter&, edm::LuminosityBlock const&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#5  0x00002b598b592f8d in DQMEDHarvester::endLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#6  0x00002b590c218bad in edm::one::EDProducerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#7  0x00002b590c201c90 in edm::WorkerT<edm::one::EDProducerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#8  0x00002b590c10b194 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9  0x00002b590c10b547 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00002b590c10bad8 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00002b590c10bbc1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunMo--Type <RET> for more, q to quit, c to continue without paging--
duleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00002b590befc075 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#13 0x00002b590d989b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b59901fa400, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#15 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#16 0x00002b590c0d49d8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#17 0x00002b590c0df82b in edm::EventProcessor::runToCompletion() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#19 0x00002b590d97815b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#20 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#21 0x000000000040971c in main ()

germanfgv avatar May 23 '22 13:05 germanfgv

assign dqm

Dr15Jones avatar May 23 '22 14:05 Dr15Jones

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar May 23 '22 14:05 cmsbuild

In a private chat @germanfgv showed me a stack trace from the same job as in https://github.com/cms-sw/cmssw/issues/38044#issuecomment-1134715457 but 20 minutes earlier (showing only thread 1)

Thread 1 (Thread 0x2b5910918b00 (LWP 198351) "cmsRun"):
#0  0x00002b590bd3d96c in _dl_update_slotinfo () from target:/lib64/ld-linux-x86-64.so.2
#1  0x00002b590bd2c098 in update_get_addr () from target:/lib64/ld-linux-x86-64.so.2
#2  0x00002b590bd429f8 in __tls_get_addr () from target:/lib64/ld-linux-x86-64.so.2
#3  0x00002b590d030391 in Format(char const*, __va_list_tag*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libCore.so
#4  0x00002b590d03523c in Form(char const*, ...) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libCore.so
#5  0x00002b598b58d37a in GEMDQMHarvester::createInactiveChannelFracHist(edm::Service<dqm::legacy::DQMStore>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#6  0x00002b598b58f679 in GEMDQMHarvester::drawSummaryHistogram(edm::Service<dqm::legacy::DQMStore>&, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#7  0x00002b598b58fb4f in GEMDQMHarvester::dqmEndLuminosityBlock(dqm::implementation::IBooker&, dqm::implementation::IGetter&, edm::LuminosityBlock const&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#8  0x00002b598b592f8d in DQMEDHarvester::endLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#9  0x00002b590c218bad in edm::one::EDProducerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00002b590c201c90 in edm::WorkerT<edm::one::EDProducerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00002b590c10b194 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00002b590c10b547 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00002b590c10bad8 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlock--Type <RET> for more, q to quit, c to continue without paging--
Principal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00002b590c10bbc1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#15 0x00002b590befc075 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#16 0x00002b590d989b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b59901fa400, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#17 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#18 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#19 0x00002b590c0d49d8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#20 0x00002b590c0df82b in edm::EventProcessor::runToCompletion() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#21 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00002b590d97815b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#23 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040971c in main ()

It seems to me that the job is making progress, but GEMDQMHarvester::drawSummaryHistogram() takes a lot of time.

makortel avatar May 23 '22 14:05 makortel

The last traceback posted appear to be here

https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L685-L687

Dr15Jones avatar May 23 '22 14:05 Dr15Jones

That part of the code was recently modified in https://github.com/cms-sw/cmssw/pull/37852, entered first in CMSSW_12_3_3 Maybe @quark2 can help with debugging it

perrotta avatar May 23 '22 14:05 perrotta

On an (likely) unrelated note, I noticed this module is using DQMStore explicitly https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L174-L181 Shouldn't DQMEDHarvesters use the IBooker and IGetter interfaces instead?

makortel avatar May 23 '22 14:05 makortel

Maybe this won't help, but setBinLabel reminded me that there was an HLT-related PR that discussed how those calls are very slow (and in some cases, maybe not this case, setBinLabel was being called unnecessarily): https://github.com/cms-sw/cmssw/pull/36313.

missirol avatar May 23 '22 15:05 missirol

@Dr15Jones @makortel @quark2 anything new to report here? Do you need any extra information form our side?

germanfgv avatar May 23 '22 16:05 germanfgv

I ran the job myself with IgProf over 4 files (the original PSet.py had 237) files, the full profile is here https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_06_perf

The GEMDQMHarvester::dqmEndLuminosityBlock() takes ~74 % of the time spent in edm::one::EDProducerBase::doEndLuminosityBlock() https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_06_perf/18

Majority of the time is spent in dqm::impl::MonitorElement::getBinContent() calls within GEMDQMHarvester::createInactiveChannelFracHist() https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_06_perf/21 i.e. likely this loop https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L705-L713

Blind extrapolation from 4 files to 237 (on this machine which is likely different from Tier0) gives 28.94*237/4 = 1714 seconds, which is much faster than the 20 hours at Tier0. I'll profile again with more files, e.g. in case there is some component that scales with the number of files.

Since the job is DQM harvesting (i.e. uses DQMRootSource), the error messages from XrootD are likely a red herring, because DQMRootSource reads all input data before starting the data processing loop.

makortel avatar May 23 '22 16:05 makortel

unassign core

makortel avatar May 23 '22 16:05 makortel

I'll profile again with more files, e.g. in case there is some component that scales with the number of files.

Here is a profile with 10 files. The GEMDQMHarvester::dqmEndLuminosityBlock() took 183 seconds, which is much more than what would expect with simple extrapolation from 4 files (that would be 10/4*28.94 = 72 seconds). https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_16_perf/19

makortel avatar May 23 '22 17:05 makortel

20 files took 723 seconds https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_26_perf/19

(I'll stop here, scaling is clearly faster than linear)

makortel avatar May 23 '22 18:05 makortel

@germanfgv I think we have enough evidence towards where the problem lies (and it is reproducible), thanks.

makortel avatar May 23 '22 18:05 makortel

Thanks @makortel ! How can we fix this? I'm guessing we will need a patch. Who can make it?

germanfgv avatar May 23 '22 18:05 germanfgv

So do I understand correctly the issue is caused by the call to setBinLabel? Or we are having an issue connected with multiple concurrent files in harvesting and the direct access to DQMStore?

emanueleusai avatar May 23 '22 18:05 emanueleusai

How can we fix this?

I suppose as an emergency fix the GEM harvesting module could be excluded, but that is really for @cms-sw/dqm-l2 to comment.

I'm guessing we will need a patch. Who can make it?

Likely someone from @quark2 @cms-sw/gem-dpg-l2 @cms-sw/dqm-l2

makortel avatar May 23 '22 18:05 makortel

So do I understand correctly the issue is caused by the call to setBinLabel? Or we are having an issue connected with multiple concurrent files in harvesting and the direct access to DQMStore?

My guess is that something causes this loop https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L705-L713 to become long or called many times with more files (or more luminosity blocks).

makortel avatar May 23 '22 18:05 makortel

Is there an efficient way to go through all elements in the TH2? I made a quick search, but I ouldn't find if going through all Ybins in all Xbins is the most efficient way to loop through all them.

perrotta avatar May 23 '22 18:05 perrotta

createInactiveChannelFracHist is called in a loop.. entirely possible that that loop has an increasing length...

davidlange6 avatar May 23 '22 18:05 davidlange6

In the loop h2SrcChamberOcc has a MonitorElement type which add several checks in getBinContent function. Probably, could be casted into TH2 before the loop. Anyway, if number of bins nNumBinX/Y is big, maybe, something smarter could be proposed to store a number of active channels.

pmandrik avatar May 23 '22 18:05 pmandrik

I added a printout of listLayer_.size() before the loop https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L273-L278 and indeed the size increases by 5 for each LuminosityBlock.

makortel avatar May 23 '22 18:05 makortel

The only place adding elements to listLayer_ seems to be in GEMDQMHarvester::getGeometryInfo() https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L333-L339 which is called for every dqmEndLuminosityBlock(). The listLayer_ is cleared in the other branch of this if-else.

makortel avatar May 23 '22 18:05 makortel

So the getGeometry() calls listLayer_.clear() (and then fills it again) if the argument MonitorElement* is not nullptr https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L315-L317

I checked at the caller https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L216 that the h2SrcStatusEOH is always nullptr in this case.

This may be a long shot, but could it be that the input DQMIO files do not contain the histograms this harvester expects to operate on?

makortel avatar May 23 '22 18:05 makortel

Fewer virtual function calls are certainly better... Perhaps the axis ticks can get set in the constructor?

On May 23, 2022 8:42 PM, pmandrik @.***> wrote:

In the loop h2SrcChamberOcc has a MonitorElement type which add several checks in getBinContent function. Probably, could be casted into TH2 before the loop. Anyway, if number of bins nNumBinX/Y is big, maybe, something smarter could be proposed to store a number of active channels.

— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/38044#issuecomment-1135016109, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGPFQ62IOTCVZG2IDEAN43VLPGPZANCNFSM5WV5XOMQ. You are receiving this because you commented.Message ID: @.***>

davidlange6 avatar May 23 '22 19:05 davidlange6

So, looking at

void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
  if (h2Src != nullptr) {  // For online and offline
    listLayer_.clear();
    mapIdxLayer_.clear();
    mapNumChPerChamber_.clear();

maybe the calls to clear() should be moved outside of the if ?

void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
  listLayer_.clear();
  mapIdxLayer_.clear();
  mapNumChPerChamber_.clear();

  if (h2Src != nullptr) {  // For online and offline

This would at least prevent the size of listLayer_ from growing indefinitely ...

fwyzard avatar May 23 '22 19:05 fwyzard

I have prepared #38052 which reverts https://github.com/cms-sw/cmssw/pull/37852, to have it ready just in case we decide to merge it.

Operationally I would propose:

  • Revert #37852 in 12_3_X, and prepare a patch release with it
  • Prepare a fix for the issue discussed here, and merge it in master and 12_4_X
  • A later backport of the original PR + fix could be integrated in the DQM 12_3_X release, if really wanted

perrotta avatar May 23 '22 20:05 perrotta