cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Implement support for concurrent runs in the Framework

Open wddgit opened this issue 3 years ago • 33 comments

PR description:

Implement support for concurrent runs in the Framework.

The design is analogous to what was done to implement concurrent luminosity blocks as much as possible, although there are unavoidable differences.

One can configure how many concurrent runs are allowed. The default is one. With that setting, almost nothing externally visible should change in the behavior of cmsRun. There are significant and complex changes in the Framework implementation to support this new ability.

This pull request does NOT upgrade modules and services outside the Framework to support concurrent runs. We expect many of them will fail if the number of concurrent runs is configured to be more than one in an existing production configuration. We have not surveyed existing code to see which modules and services cannot support concurrent runs. Most should be OK because they do not depend on run transitions. But for example, a module designed to create per run histograms might have problems with concurrent runs.

One configures the "numberOfConcurrentRuns" in the top level options parameter set. If it is 0 or greater than the number of concurrent lumis, then it will be reset to equal the number of concurrent lumis.

If an EventSetup IOV changes at a run boundary, then one also would need to configure concurrent IOVs for that record to two to actually have the runs on both sides of that run boundary process concurrently. Without that cmsRun would execute properly, but the IOVs would block concurrent execution. In addition, it is technically possible for the sequence of transitions beginLumi, endRun, beginRun, and beginLumi to all have different IOVs. An EventSetup record with such IOVs would need to be configured to allow 4 concurrent IOVs to process both runs concurrently across such a run boundary.

It is the design intent that the rest of changes are transparent to the user (beyond what is discussed above).

PR validation:

There are a few new unit tests. Existing unit tests pass. In fact existing unit tests covered most of the features one might be concerned about with this pull request. Of the new tests, these two configurations are the most significant:

FWCore/Integration/test/testConcurrentIOVsAndRuns_cfg.py FWCore/Integration/test/testConcurrentIOVsAndRunsRead_cfg.py

wddgit avatar Jul 20 '22 15:07 wddgit

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31156

  • This PR adds an extra 408KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
    • File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
    • File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
    • File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454

cmsbuild avatar Jul 20 '22 15:07 cmsbuild

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • FWCore/Framework (core)
  • FWCore/Integration (core)
  • FWCore/ParameterSet (core)
  • FWCore/TFWLiteSelector (core)
  • FWCore/TestProcessor (core)
  • IOPool/Input (core)
  • Mixing/Base (simulation)

@smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please review it and eventually sign? Thanks. @makortel, @fabiocos this is something you requested to watch as well. @perrotta, @dpiparo, @qliphy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar Jul 20 '22 15:07 cmsbuild

please test

wddgit avatar Jul 20 '22 15:07 wddgit

-1

Failed Tests: ClangBuild Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26346/summary.html COMMIT: bedd1092815d039ac9ebc890aa37478928b0d0a6 CMSSW: CMSSW_12_5_X_2022-07-20-1100/el8_amd64_gcc10 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38801/26346/install.sh to create a dev area with all the needed externals and cmssw changes.

Clang Build

I found compilation warning while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

See details on the summary page.

cmsbuild avatar Jul 20 '22 18:07 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31160

  • This PR adds an extra 240KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
    • File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
    • File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
    • File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454

cmsbuild avatar Jul 20 '22 18:07 cmsbuild

Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.

cmsbuild avatar Jul 20 '22 18:07 cmsbuild

please test

wddgit avatar Jul 20 '22 18:07 wddgit

-1

Failed Tests: UnitTests Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26354/summary.html COMMIT: dc64873d908feb8cc73b209a32b5c70f1dcef274 CMSSW: CMSSW_12_5_X_2022-07-20-1100/el8_amd64_gcc10 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38801/26354/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test TestFWCoreFrameworkGlobalStreamOne had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3662417
  • DQMHistoTests: Total failures: 19
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3662375
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Jul 21 '22 01:07 cmsbuild

I figured out the problem. The test module stream::RunSummaryIntAnalyzer is not properly written to support concurrent runs (with serial end run transitions the problem cannot occur). I guess I was just lucky when I ran the test in my private working area that the timing of execution allowed them to pass.

These two lines in streamEndRunSummary:

        gCache->value += runCache()->value;
        runCache()->value = 0;

could run concurrently with

 void analyze(edm::Event const&, edm::EventSetup const&) override {
        ++m_count;
        ++(runCache()->value);
      }

If value is incremented in analyze between the other two lines, then that increment of 1 is lost. I have a fix and this test module needed some additional rewriting to be more appropriate for testing the summary methods. Unfortunately, I also need to check 18 = 3 x 3 x 2 other test modules for the same problem (stream, global, limited) x (filter, analyzer, producer) x (lumi, run) for the same problem. That will keep me busy this afternoon for a while...

wddgit avatar Jul 22 '22 17:07 wddgit

We discussed this a short time ago (Chris, Matti and me) and our thought was that we should make it a goal to merge this PR as early as possible in the 12_6_X release series (assuming everyone else agrees with that and it gets through the review process). There is too much risk to merge it into 12_5_X at this point in time.

wddgit avatar Jul 25 '22 18:07 wddgit

Thanks Chris. I will go through all your comments soon.

An update on the unit test problem I am still fixing: In addition to fixing the test modules, I've found I've missed something in the base classes for modules in the global, limited and one cases. So when I push the commit to fix that there will some more changes. None of those changes should affect any of the files already changed in the pull request. These additional changes are in different files, so you will not be wasting any time reviewing what is already there. There will just be more files to review in this additional commit.

wddgit avatar Jul 25 '22 21:07 wddgit

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31300

  • This PR adds an extra 376KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
    • File FWCore/Framework/interface/global/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/global/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/global/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/stream/ProducingModuleAdaptorBase.h modified in PR(s): #38454
    • File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
    • File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
    • File FWCore/Framework/src/global/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/global/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/limited/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/limited/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/one/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/one/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38806

cmsbuild avatar Jul 27 '22 22:07 cmsbuild

Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.

cmsbuild avatar Jul 27 '22 22:07 cmsbuild

please test

wddgit avatar Jul 27 '22 22:07 wddgit

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26494/summary.html COMMIT: 38d19912793919a27a01d28b0c4162db6164369b CMSSW: CMSSW_12_5_X_2022-07-27-2300/el8_amd64_gcc10 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38801/26494/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3667670
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3667640
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 50 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 210 log files, 47 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Jul 28 '22 06:07 cmsbuild

I'll rerun the tests tomorrow after I rebase. I think I've fixed everything from the first round of comments from Chris now.

wddgit avatar Jul 28 '22 22:07 wddgit

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31321

  • This PR adds an extra 60KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
    • File FWCore/Framework/interface/global/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/global/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/global/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/stream/ProducingModuleAdaptorBase.h modified in PR(s): #38454
    • File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
    • File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
    • File FWCore/Framework/src/global/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/global/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/limited/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/limited/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/one/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/one/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38806

cmsbuild avatar Jul 28 '22 22:07 cmsbuild

Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.

cmsbuild avatar Jul 28 '22 22:07 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31332

  • This PR adds an extra 36KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
    • File FWCore/Framework/interface/global/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/global/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/global/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/limited/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/EDFilterBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/EDProducerBase.h modified in PR(s): #38454
    • File FWCore/Framework/interface/one/implementors.h modified in PR(s): #38454
    • File FWCore/Framework/interface/stream/ProducingModuleAdaptorBase.h modified in PR(s): #38454
    • File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
    • File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
    • File FWCore/Framework/src/global/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/global/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/limited/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/limited/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/one/EDFilterBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/one/EDProducerBase.cc modified in PR(s): #38454
    • File FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
    • File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38806

cmsbuild avatar Jul 29 '22 17:07 cmsbuild

Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.

cmsbuild avatar Jul 29 '22 17:07 cmsbuild

please test

rebased on top of 7/26/22 IB (last IB all green for IB tests)

base now includes the SubProcess unit test fix so I removed the commit related to that from the PR

1st commit is the main one 2nd commit deals with module base classes in run/lumi/summary caches The rest of the commits are in response to Chris's comments

After these test finish and hopefully pass, I am just waiting on further review comments.

wddgit avatar Jul 29 '22 17:07 wddgit

-1

Failed Tests: UnitTests Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26535/summary.html COMMIT: 0051b36b2a90f99cf1a1079c545c493e5aadaff6 CMSSW: CMSSW_12_5_X_2022-07-29-1100/el8_amd64_gcc10 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38801/26535/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test testSiStripDQM_OfflineTkMap had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3668050
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3668020
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 210 log files, 47 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Jul 29 '22 23:07 cmsbuild

The unit test that fails is also failing in the IBs already. It is unrelated to the changes in this PR. It looks like everything else passed. I'm only waiting for more review comments on this one. As far as I know, everything else is done.

wddgit avatar Aug 01 '22 14:08 wddgit

Matti, Chris and I had a ZOOM conversation where we were discussing this PR. Here are some changes that were suggested that seemed like improvements worth making:

  • Replace PauseStreamQueuesSentry with a SerialQueue
  • Add unit tests for 9 major transitions where exceptions might be thrown. The test should intentionally throw during those transitions. Transitions: Event, (begin/end) x (run/lumi) x (stream/global). Unless we already have such tests somewhere. If they want more for other cases Matti and Chris will let me know which ones. There are almost infinite contexts where an exception might be thrown, so we can just test the most important ones.
  • In handleNextEventForStreamAsync, add a comment in the code handling an exception while processing an event. The fall through to the recursive call to handleNextEventForStreamAsync is intentional after an exception
  • Get rid of "=" capture in processEventAsync
  • We'll use git to squash the little commits with fixes in response to minor review comments when the reviewers are OK with the fixes (Matti will let me know when that is)

Let me know if I missed or misunderstood anything from our discussion. Thanks.

wddgit avatar Aug 03 '22 15:08 wddgit

Apparently I'm not able to comment in the EventProcessor.cc itself, so I'll try here. In EventProcessor::beginRunAsync() I noticed the status and iSync are captured for lambdas by reference here https://github.com/cms-sw/cmssw/blob/0051b36b2a90f99cf1a1079c545c493e5aadaff6/FWCore/Framework/src/EventProcessor.cc#L1146 but by value here and afterwards https://github.com/cms-sw/cmssw/blob/0051b36b2a90f99cf1a1079c545c493e5aadaff6/FWCore/Framework/src/EventProcessor.cc#L1154

Should they be captured consistently for all tasks in the chain? I'd assume iSync could always be captured by reference (but could be wrong), but should status be captured by value?

makortel avatar Aug 03 '22 16:08 makortel

In this comment I'm addressing iSync only. First this follows the pattern in the existing beginLumiAsync.

Partially this works because in a chain the functor passed to the "first" function is run serially. The functors passed to following "then" functions in the chain are run asynchronously. I've been wondering whether taking advantage of that fact was a good idea or not. Could it ever change? So the second capture must be by value.

Further, if you follow the usage of iSync down the stack it is passed by reference multiple times. So we are potentially saving several copies (although it is not that big...). If I remember correctly at the level in the stack when it is actually needed, a by value copy is made.

wddgit avatar Aug 03 '22 17:08 wddgit

For status, the same comment applies as for iSync.

For status, there is the additional question as to whether one could use std::move. std::move does not make sense for IOVSyncValue because it does not have data members for which move makes any sense, but status is a std::shared_ptr which is movable. Even so, we cannot move status because in the captures in beginRunAsync, the status object is used after the capture and so the capture cannot be via std::move.

wddgit avatar Aug 03 '22 18:08 wddgit

Thanks, so the current behavior is as expected.

My next question is if EventProcessor::readAndMergeRunEntries() and EventProcessor::readAndMergeLumiEntries() would better be named as ...Async() because they only push a task into a serial task queue and return immediately?

makortel avatar Aug 03 '22 18:08 makortel

Yes. That is our convention. I'll change the names and add Async to them.

wddgit avatar Aug 03 '22 18:08 wddgit

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31413

  • This PR adds an extra 76KB to repository

cmsbuild avatar Aug 03 '22 22:08 cmsbuild