cmssw
cmssw copied to clipboard
DQM Harvest jobs getting stuck at T0
Tests of CMSSW_12_3_3_patch1 at T0 show some Express jobs of task ExpressMergewrite_StreamExpressCosmics_DQMIOEndOfRunDQMHarvestMerged
taking more that 20h to complete.
Here we have the logs of one job that finished after 22h of execution:
%MSG-e HLTConfigProvider: EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun 21-May-2022 17:33:51 CEST Run: 349840
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider: EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun 21-May-2022 17:33:51 CEST Run: 349840
Process name 'HLT' not found in registry!
%MSG
[2022-05-22 10:34:32.107229 +0200][Error ][AsyncSock ][ 549] [p06253947b90717.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 11:33:25.164556 +0200][Error ][AsyncSock ][ 549] [p06636710g40375.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 12:27:54.115905 +0200][Error ][AsyncSock ][ 549] [p06253947g81422.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:23:41.509030 +0200][Error ][AsyncSock ][ 549] [p06253947q54042.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 13:53:39.441991 +0200][Error ][AsyncSock ][ 549] [st-096-gg50030g.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 14:56:59.860605 +0200][Error ][AsyncSock ][ 549] [st-096-dd904d00.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:06:20.567969 +0200][Error ][AsyncSock ][ 549] [st-048-388af89c.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:12:21.511984 +0200][Error ][AsyncSock ][ 549] [p06636710n60578.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
[2022-05-22 15:20:03.375578 +0200][Error ][AsyncSock ][ 549] [p06636710p75593.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: resource temporarily unavailable
%MSG-e DQMGenericClient: DQMGenericClient:hltMuonEfficiencies@endRun 22-May-2022 15:40:50 CEST End Run: 349840
DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon !!!
%MSG
%MSG-e DQMGenericClient: DQMGenericClient:hltMuonEfficienciesMR@endRun 22-May-2022 15:40:50 CEST End Run: 349840
DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon/MR !!!
%MSG
The logs report nothing for 20h, after which there a few network error messages apparently related to XRootD and then the job continues to finish in a few minutes. It is worth mentioning that this jobs had already been tried, getting stuck in a similar way. That execution was manually interrupted after the first few network error messages showed up.
Full logs and PSet can be found here:
/afs/cern.ch/user/c/cmst0/public/LongExecution/CollisionMay2022/job_41469/SuccessfulExec/job/WMTaskSpace/cmsRun1
More information on the issue can be found here: CMS Talk post
A new Issue was created by @germanfgv Germn Felipe Giraldo Villa.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
One of this jobs is currently running for 24h. We have access to the node and here is the las few lines of the current cmsRun log:
%MSG-e HLTConfigProvider: EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun 22-May-2022 22:56:19 CEST Run: 349840
Falling back to ProcessName-only init using ProcessName 'HLT' !
%MSG
%MSG-e HLTConfigProvider: EgHLTOfflineSummaryClient:egHLTOffDQMSummaryClient@beginRun 22-May-2022 22:56:19 CEST Run: 349840
Process name 'HLT' not found in registry!
%MSG
[2022-05-23 15:41:21.710046 +0200][Error ][AsyncSock ][ 549] [p06636710t83205.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: interrupted system call
[2022-05-23 15:41:21.710514 +0200][Error ][AsyncSock ][ 549] [p06636710k01172.cern.ch:1095.0] Socket error encountered: [ERROR] Socket error: interrupted system call
As you can see, there is nothing for 17h and then network errors. The current stack trace for the cmsRun process is the following:
Thread 7 (Thread 0x2b5990f33700 (LWP 198686) "cmsRun"):
#0 0x00002b590e587b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1 0x00002b590e587bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2 0x00002b590e587c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3 0x00002b591434a616 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4 0x00002b591434a6d9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#5 0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#6 0x00002b590e894b0d in clone () from target:/lib64/libc.so.6
Thread 6 (Thread 0x2b5990d32700 (LWP 198685) "cmsRun"):
#0 0x00002b590e587b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1 0x00002b590e587bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2 0x00002b590e587c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3 0x00002b591434a616 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4 0x00002b591434a6d9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#5 0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#6 0x00002b590e894b0d in clone () from target:/lib64/libc.so.6
Thread 5 (Thread 0x2b5990b31700 (LWP 198684) "cmsRun"):
#0 0x00002b590e587b3b in do_futex_wait.constprop () from target:/lib64/libpthread.so.0
#1 0x00002b590e587bcf in __new_sem_wait_slow.constprop.0 () from target:/lib64/libpthread.so.0
#2 0x00002b590e587c6b in sem_wait@@GLIBC_2.2.5 () from target:/lib64/libpthread.so.0
#3 0x00002b591434a616 in XrdCl::JobManager::RunJobs() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4 0x00002b591434a6d9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#5 0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#6 0x00002b590e894b0d in clone () from target:/lib64/libc.so.6
Thread 4 (Thread 0x2b5990930700 (LWP 198683) "cmsRun"):
#0 0x00002b590e588e9d in nanosleep () from target:/lib64/libpthread.so.0
#1 0x00002b591409008f in XrdSysTimer::Wait(int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#2 0x00002b59142b9fa3 in XrdCl::TaskManager::RunTasks() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#3 0x00002b59142ba0f9 in RunRunnerThread () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdCl.so.3
#4 0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#5 0x00002b590e894b0d in clone () from target:/lib64/libc.so.6
Thread 3 (Thread 0x2b599072f700 (LWP 198682) "cmsRun"):
#0 0x00002b590e8950e3 in epoll_wait () from target:/lib64/libc.so.6
#1 0x00002b591408a4b2 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#2 0x00002b5914086c3d in XrdSys::IOEvents::BootStrap::Start(void*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#3 0x00002b591408f798 in XrdSysThread_Xeq () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libXrdUtils.so.3
#4 0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#5 0x00002b590e894b0d in clone () from target:/lib64/libc.so.6
Thread 2 (Thread 0x2b5933134700 (LWP 198559) "cmsRun"):
#0 0x00002b590e58875d in read () from target:/lib64/libpthread.so.0
#1 0x00002b591499822f in full_read.constprop () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00002b5914998b0f in edm::service::InitRootHandlers::stacktraceHelperThread() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00002b590e15df90 in std::execute_native_thread_routine (__p=0x2b5920665640) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00002b590e581ea5 in start_thread () from target:/lib64/libpthread.so.0
#5 0x00002b590e894b0d in clone () from target:/lib64/libc.so.6
Thread 1 (Thread 0x2b5910918b00 (LWP 198351) "cmsRun"):
#0 0x00002b590e8eac70 in __memcpy_ssse3_back () from target:/lib64/libc.so.6
#1 0x00002b598b588fd9 in void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .isra.0] () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#2 0x00002b598b58d481 in GEMDQMHarvester::createInactiveChannelFracHist(edm::Service<dqm::legacy::DQMStore>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#3 0x00002b598b58f679 in GEMDQMHarvester::drawSummaryHistogram(edm::Service<dqm::legacy::DQMStore>&, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#4 0x00002b598b58fb4f in GEMDQMHarvester::dqmEndLuminosityBlock(dqm::implementation::IBooker&, dqm::implementation::IGetter&, edm::LuminosityBlock const&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#5 0x00002b598b592f8d in DQMEDHarvester::endLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#6 0x00002b590c218bad in edm::one::EDProducerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#7 0x00002b590c201c90 in edm::WorkerT<edm::one::EDProducerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#8 0x00002b590c10b194 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00002b590c10b547 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00002b590c10bad8 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00002b590c10bbc1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunMo--Type <RET> for more, q to quit, c to continue without paging--
duleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00002b590befc075 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#13 0x00002b590d989b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b59901fa400, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#15 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#16 0x00002b590c0d49d8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#17 0x00002b590c0df82b in edm::EventProcessor::runToCompletion() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#19 0x00002b590d97815b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#20 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#21 0x000000000040971c in main ()
assign dqm
New categories assigned: dqm
@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
In a private chat @germanfgv showed me a stack trace from the same job as in https://github.com/cms-sw/cmssw/issues/38044#issuecomment-1134715457 but 20 minutes earlier (showing only thread 1)
Thread 1 (Thread 0x2b5910918b00 (LWP 198351) "cmsRun"):
#0 0x00002b590bd3d96c in _dl_update_slotinfo () from target:/lib64/ld-linux-x86-64.so.2
#1 0x00002b590bd2c098 in update_get_addr () from target:/lib64/ld-linux-x86-64.so.2
#2 0x00002b590bd429f8 in __tls_get_addr () from target:/lib64/ld-linux-x86-64.so.2
#3 0x00002b590d030391 in Format(char const*, __va_list_tag*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libCore.so
#4 0x00002b590d03523c in Form(char const*, ...) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/external/slc7_amd64_gcc10/lib/libCore.so
#5 0x00002b598b58d37a in GEMDQMHarvester::createInactiveChannelFracHist(edm::Service<dqm::legacy::DQMStore>&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#6 0x00002b598b58f679 in GEMDQMHarvester::drawSummaryHistogram(edm::Service<dqm::legacy::DQMStore>&, int) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#7 0x00002b598b58fb4f in GEMDQMHarvester::dqmEndLuminosityBlock(dqm::implementation::IBooker&, dqm::implementation::IGetter&, edm::LuminosityBlock const&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#8 0x00002b598b592f8d in DQMEDHarvester::endLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_3_patch1/lib/slc7_amd64_gcc10/pluginDQMGEMPlugins.so
#9 0x00002b590c218bad in edm::one::EDProducerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00002b590c201c90 in edm::WorkerT<edm::one::EDProducerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#11 0x00002b590c10b194 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00002b590c10b547 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00002b590c10bad8 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlock--Type <RET> for more, q to quit, c to continue without paging--
Principal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#14 0x00002b590c10bbc1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#15 0x00002b590befc075 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#16 0x00002b590d989b8c in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x2b59901fa400, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#17 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x2b59115c7e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#18 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#19 0x00002b590c0d49d8 in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#20 0x00002b590c0df82b in edm::EventProcessor::runToCompletion() () from target:/cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_3/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#21 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00002b590d97815b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_1-slc7_amd64_gcc10/build/CMSSW_12_3_1-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-d3ee2fc4dbf589032bbf635c7b35f820/tbb-v2021.4.0/src/tbb/arena.cpp:698
#23 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040971c in main ()
It seems to me that the job is making progress, but GEMDQMHarvester::drawSummaryHistogram()
takes a lot of time.
The last traceback posted appear to be here
https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L685-L687
That part of the code was recently modified in https://github.com/cms-sw/cmssw/pull/37852, entered first in CMSSW_12_3_3 Maybe @quark2 can help with debugging it
On an (likely) unrelated note, I noticed this module is using DQMStore
explicitly
https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L174-L181
Shouldn't DQMEDHarvester
s use the IBooker
and IGetter
interfaces instead?
Maybe this won't help, but setBinLabel
reminded me that there was an HLT-related PR that discussed how those calls are very slow (and in some cases, maybe not this case, setBinLabel
was being called unnecessarily): https://github.com/cms-sw/cmssw/pull/36313.
@Dr15Jones @makortel @quark2 anything new to report here? Do you need any extra information form our side?
I ran the job myself with IgProf over 4 files (the original PSet.py
had 237) files, the full profile is here
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_06_perf
The GEMDQMHarvester::dqmEndLuminosityBlock()
takes ~74 % of the time spent in edm::one::EDProducerBase::doEndLuminosityBlock()
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_06_perf/18
Majority of the time is spent in dqm::impl::MonitorElement::getBinContent()
calls within GEMDQMHarvester::createInactiveChannelFracHist()
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_06_perf/21
i.e. likely this loop
https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L705-L713
Blind extrapolation from 4 files to 237 (on this machine which is likely different from Tier0) gives 28.94*237/4 = 1714 seconds, which is much faster than the 20 hours at Tier0. I'll profile again with more files, e.g. in case there is some component that scales with the number of files.
Since the job is DQM harvesting (i.e. uses DQMRootSource
), the error messages from XrootD are likely a red herring, because DQMRootSource
reads all input data before starting the data processing loop.
unassign core
I'll profile again with more files, e.g. in case there is some component that scales with the number of files.
Here is a profile with 10 files. The GEMDQMHarvester::dqmEndLuminosityBlock()
took 183 seconds, which is much more than what would expect with simple extrapolation from 4 files (that would be 10/4*28.94 = 72 seconds).
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_16_perf/19
20 files took 723 seconds https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38044/test_26_perf/19
(I'll stop here, scaling is clearly faster than linear)
@germanfgv I think we have enough evidence towards where the problem lies (and it is reproducible), thanks.
Thanks @makortel ! How can we fix this? I'm guessing we will need a patch. Who can make it?
So do I understand correctly the issue is caused by the call to setBinLabel? Or we are having an issue connected with multiple concurrent files in harvesting and the direct access to DQMStore?
How can we fix this?
I suppose as an emergency fix the GEM harvesting module could be excluded, but that is really for @cms-sw/dqm-l2 to comment.
I'm guessing we will need a patch. Who can make it?
Likely someone from @quark2 @cms-sw/gem-dpg-l2 @cms-sw/dqm-l2
So do I understand correctly the issue is caused by the call to setBinLabel? Or we are having an issue connected with multiple concurrent files in harvesting and the direct access to DQMStore?
My guess is that something causes this loop https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L705-L713 to become long or called many times with more files (or more luminosity blocks).
Is there an efficient way to go through all elements in the TH2? I made a quick search, but I ouldn't find if going through all Ybins in all Xbins is the most efficient way to loop through all them.
createInactiveChannelFracHist is called in a loop.. entirely possible that that loop has an increasing length...
In the loop h2SrcChamberOcc has a MonitorElement type which add several checks in getBinContent function. Probably, could be casted into TH2 before the loop. Anyway, if number of bins nNumBinX/Y is big, maybe, something smarter could be proposed to store a number of active channels.
I added a printout of listLayer_.size()
before the loop
https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L273-L278
and indeed the size increases by 5 for each LuminosityBlock.
The only place adding elements to listLayer_
seems to be in GEMDQMHarvester::getGeometryInfo()
https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L333-L339
which is called for every dqmEndLuminosityBlock()
. The listLayer_
is cleared in the other branch of this if-else.
So the getGeometry()
calls listLayer_.clear()
(and then fills it again) if the argument MonitorElement*
is not nullptr
https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L315-L317
I checked at the caller
https://github.com/cms-sw/cmssw/blob/9dab6a97de58a930407ae847dcd4210b1946807f/DQM/GEM/plugins/GEMDQMHarvester.cc#L216
that the h2SrcStatusEOH
is always nullptr
in this case.
This may be a long shot, but could it be that the input DQMIO files do not contain the histograms this harvester expects to operate on?
Fewer virtual function calls are certainly better... Perhaps the axis ticks can get set in the constructor?
On May 23, 2022 8:42 PM, pmandrik @.***> wrote:
In the loop h2SrcChamberOcc has a MonitorElement type which add several checks in getBinContent function. Probably, could be casted into TH2 before the loop. Anyway, if number of bins nNumBinX/Y is big, maybe, something smarter could be proposed to store a number of active channels.
— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/38044#issuecomment-1135016109, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGPFQ62IOTCVZG2IDEAN43VLPGPZANCNFSM5WV5XOMQ. You are receiving this because you commented.Message ID: @.***>
So, looking at
void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
if (h2Src != nullptr) { // For online and offline
listLayer_.clear();
mapIdxLayer_.clear();
mapNumChPerChamber_.clear();
maybe the calls to clear()
should be moved outside of the if
?
void GEMDQMHarvester::getGeometryInfo(edm::Service<DQMStore> &store, MonitorElement *h2Src) {
listLayer_.clear();
mapIdxLayer_.clear();
mapNumChPerChamber_.clear();
if (h2Src != nullptr) { // For online and offline
This would at least prevent the size of listLayer_
from growing indefinitely ...
I have prepared #38052 which reverts https://github.com/cms-sw/cmssw/pull/37852, to have it ready just in case we decide to merge it.
Operationally I would propose:
- Revert #37852 in 12_3_X, and prepare a patch release with it
- Prepare a fix for the issue discussed here, and merge it in master and 12_4_X
- A later backport of the original PR + fix could be integrated in the DQM 12_3_X release, if really wanted