Implement support for concurrent runs in the Framework
PR description:
Implement support for concurrent runs in the Framework.
The design is analogous to what was done to implement concurrent luminosity blocks as much as possible, although there are unavoidable differences.
One can configure how many concurrent runs are allowed. The default is one. With that setting, almost nothing externally visible should change in the behavior of cmsRun. There are significant and complex changes in the Framework implementation to support this new ability.
This pull request does NOT upgrade modules and services outside the Framework to support concurrent runs. We expect many of them will fail if the number of concurrent runs is configured to be more than one in an existing production configuration. We have not surveyed existing code to see which modules and services cannot support concurrent runs. Most should be OK because they do not depend on run transitions. But for example, a module designed to create per run histograms might have problems with concurrent runs.
One configures the "numberOfConcurrentRuns" in the top level options parameter set. If it is 0 or greater than the number of concurrent lumis, then it will be reset to equal the number of concurrent lumis.
If an EventSetup IOV changes at a run boundary, then one also would need to configure concurrent IOVs for that record to two to actually have the runs on both sides of that run boundary process concurrently. Without that cmsRun would execute properly, but the IOVs would block concurrent execution. In addition, it is technically possible for the sequence of transitions beginLumi, endRun, beginRun, and beginLumi to all have different IOVs. An EventSetup record with such IOVs would need to be configured to allow 4 concurrent IOVs to process both runs concurrently across such a run boundary.
It is the design intent that the rest of changes are transparent to the user (beyond what is discussed above).
PR validation:
There are a few new unit tests. Existing unit tests pass. In fact existing unit tests covered most of the features one might be concerned about with this pull request. Of the new tests, these two configurations are the most significant:
FWCore/Integration/test/testConcurrentIOVsAndRuns_cfg.py FWCore/Integration/test/testConcurrentIOVsAndRunsRead_cfg.py
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31156
-
This PR adds an extra 408KB to repository
-
There are other open Pull requests which might conflict with changes you have proposed:
- File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
- File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
- File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
- File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
A new Pull Request was created by @wddgit (W. David Dagenhart) for master.
It involves the following packages:
- FWCore/Framework (core)
- FWCore/Integration (core)
- FWCore/ParameterSet (core)
- FWCore/TFWLiteSelector (core)
- FWCore/TestProcessor (core)
- IOPool/Input (core)
- Mixing/Base (simulation)
@smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please review it and eventually sign? Thanks. @makortel, @fabiocos this is something you requested to watch as well. @perrotta, @dpiparo, @qliphy, @rappoccio you are the release manager for this.
cms-bot commands are listed here
please test
-1
Failed Tests: ClangBuild
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26346/summary.html
COMMIT: bedd1092815d039ac9ebc890aa37478928b0d0a6
CMSSW: CMSSW_12_5_X_2022-07-20-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38801/26346/install.sh to create a dev area with all the needed externals and cmssw changes.
Clang Build
I found compilation warning while trying to compile with clang. Command used:
USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'
See details on the summary page.
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31160
-
This PR adds an extra 240KB to repository
-
There are other open Pull requests which might conflict with changes you have proposed:
- File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
- File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
- File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
- File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.
please test
-1
Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26354/summary.html
COMMIT: dc64873d908feb8cc73b209a32b5c70f1dcef274
CMSSW: CMSSW_12_5_X_2022-07-20-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38801/26354/install.sh to create a dev area with all the needed externals and cmssw changes.
Unit Tests
I found errors in the following unit tests:
---> test TestFWCoreFrameworkGlobalStreamOne had ERRORS
Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 12 differences found in the comparisons
- DQMHistoTests: Total files compared: 50
- DQMHistoTests: Total histograms compared: 3662417
- DQMHistoTests: Total failures: 19
- DQMHistoTests: Total nulls: 1
- DQMHistoTests: Total successes: 3662375
- DQMHistoTests: Total skipped: 22
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: -0.004 KiB( 49 files compared)
- DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
- Checked 208 log files, 45 edm output root files, 50 DQM output files
- TriggerResults: no differences found
I figured out the problem. The test module stream::RunSummaryIntAnalyzer is not properly written to support concurrent runs (with serial end run transitions the problem cannot occur). I guess I was just lucky when I ran the test in my private working area that the timing of execution allowed them to pass.
These two lines in streamEndRunSummary:
gCache->value += runCache()->value;
runCache()->value = 0;
could run concurrently with
void analyze(edm::Event const&, edm::EventSetup const&) override {
++m_count;
++(runCache()->value);
}
If value is incremented in analyze between the other two lines, then that increment of 1 is lost. I have a fix and this test module needed some additional rewriting to be more appropriate for testing the summary methods. Unfortunately, I also need to check 18 = 3 x 3 x 2 other test modules for the same problem (stream, global, limited) x (filter, analyzer, producer) x (lumi, run) for the same problem. That will keep me busy this afternoon for a while...
We discussed this a short time ago (Chris, Matti and me) and our thought was that we should make it a goal to merge this PR as early as possible in the 12_6_X release series (assuming everyone else agrees with that and it gets through the review process). There is too much risk to merge it into 12_5_X at this point in time.
Thanks Chris. I will go through all your comments soon.
An update on the unit test problem I am still fixing: In addition to fixing the test modules, I've found I've missed something in the base classes for modules in the global, limited and one cases. So when I push the commit to fix that there will some more changes. None of those changes should affect any of the files already changed in the pull request. These additional changes are in different files, so you will not be wasting any time reviewing what is already there. There will just be more files to review in this additional commit.
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31300
-
This PR adds an extra 376KB to repository
-
There are other open Pull requests which might conflict with changes you have proposed:
- File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
- File FWCore/Framework/interface/global/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/global/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/global/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/stream/ProducingModuleAdaptorBase.h modified in PR(s): #38454
- File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
- File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
- File FWCore/Framework/src/global/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/global/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/limited/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/limited/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/one/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/one/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
- File IOPool/Input/src/PoolSource.cc modified in PR(s): #38806
Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.
please test
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26494/summary.html
COMMIT: 38d19912793919a27a01d28b0c4162db6164369b
CMSSW: CMSSW_12_5_X_2022-07-27-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38801/26494/install.sh to create a dev area with all the needed externals and cmssw changes.
Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 4 differences found in the comparisons
- DQMHistoTests: Total files compared: 51
- DQMHistoTests: Total histograms compared: 3667670
- DQMHistoTests: Total failures: 7
- DQMHistoTests: Total nulls: 1
- DQMHistoTests: Total successes: 3667640
- DQMHistoTests: Total skipped: 22
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: -0.004 KiB( 50 files compared)
- DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
- Checked 210 log files, 47 edm output root files, 51 DQM output files
- TriggerResults: no differences found
I'll rerun the tests tomorrow after I rebase. I think I've fixed everything from the first round of comments from Chris now.
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31321
-
This PR adds an extra 60KB to repository
-
There are other open Pull requests which might conflict with changes you have proposed:
- File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
- File FWCore/Framework/interface/global/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/global/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/global/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/stream/ProducingModuleAdaptorBase.h modified in PR(s): #38454
- File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
- File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
- File FWCore/Framework/src/global/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/global/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/limited/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/limited/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/one/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/one/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
- File IOPool/Input/src/PoolSource.cc modified in PR(s): #38806
Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31332
-
This PR adds an extra 36KB to repository
-
There are other open Pull requests which might conflict with changes you have proposed:
- File FWCore/Framework/interface/GlobalSchedule.h modified in PR(s): #38603
- File FWCore/Framework/interface/global/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/global/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/global/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/limited/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/EDFilterBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/EDProducerBase.h modified in PR(s): #38454
- File FWCore/Framework/interface/one/implementors.h modified in PR(s): #38454
- File FWCore/Framework/interface/stream/ProducingModuleAdaptorBase.h modified in PR(s): #38454
- File FWCore/Framework/src/EventProcessor.cc modified in PR(s): #38603
- File FWCore/Framework/src/GlobalSchedule.cc modified in PR(s): #38603
- File FWCore/Framework/src/global/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/global/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/limited/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/limited/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/one/EDFilterBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/one/EDProducerBase.cc modified in PR(s): #38454
- File FWCore/Framework/src/stream/ProducingModuleAdaptorBase.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/global_producer_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_filter_t.cppunit.cc modified in PR(s): #38454
- File FWCore/Framework/test/limited_producer_t.cppunit.cc modified in PR(s): #38454
- File IOPool/Input/src/PoolSource.cc modified in PR(s): #38806
Pull request #38801 was updated. @smuzaffar, @civanch, @Dr15Jones, @makortel, @mdhildreth, @cmsbuild can you please check and sign again.
please test
rebased on top of 7/26/22 IB (last IB all green for IB tests)
base now includes the SubProcess unit test fix so I removed the commit related to that from the PR
1st commit is the main one 2nd commit deals with module base classes in run/lumi/summary caches The rest of the commits are in response to Chris's comments
After these test finish and hopefully pass, I am just waiting on further review comments.
-1
Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-555ed7/26535/summary.html
COMMIT: 0051b36b2a90f99cf1a1079c545c493e5aadaff6
CMSSW: CMSSW_12_5_X_2022-07-29-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38801/26535/install.sh to create a dev area with all the needed externals and cmssw changes.
Unit Tests
I found errors in the following unit tests:
---> test testSiStripDQM_OfflineTkMap had ERRORS
Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 2 differences found in the comparisons
- DQMHistoTests: Total files compared: 51
- DQMHistoTests: Total histograms compared: 3668050
- DQMHistoTests: Total failures: 8
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3668020
- DQMHistoTests: Total skipped: 22
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
- Checked 210 log files, 47 edm output root files, 51 DQM output files
- TriggerResults: no differences found
The unit test that fails is also failing in the IBs already. It is unrelated to the changes in this PR. It looks like everything else passed. I'm only waiting for more review comments on this one. As far as I know, everything else is done.
Matti, Chris and I had a ZOOM conversation where we were discussing this PR. Here are some changes that were suggested that seemed like improvements worth making:
- Replace PauseStreamQueuesSentry with a SerialQueue
- Add unit tests for 9 major transitions where exceptions might be thrown. The test should intentionally throw during those transitions. Transitions: Event, (begin/end) x (run/lumi) x (stream/global). Unless we already have such tests somewhere. If they want more for other cases Matti and Chris will let me know which ones. There are almost infinite contexts where an exception might be thrown, so we can just test the most important ones.
- In handleNextEventForStreamAsync, add a comment in the code handling an exception while processing an event. The fall through to the recursive call to handleNextEventForStreamAsync is intentional after an exception
- Get rid of "=" capture in processEventAsync
- We'll use git to squash the little commits with fixes in response to minor review comments when the reviewers are OK with the fixes (Matti will let me know when that is)
Let me know if I missed or misunderstood anything from our discussion. Thanks.
Apparently I'm not able to comment in the EventProcessor.cc itself, so I'll try here. In EventProcessor::beginRunAsync() I noticed the status and iSync are captured for lambdas by reference here
https://github.com/cms-sw/cmssw/blob/0051b36b2a90f99cf1a1079c545c493e5aadaff6/FWCore/Framework/src/EventProcessor.cc#L1146
but by value here and afterwards
https://github.com/cms-sw/cmssw/blob/0051b36b2a90f99cf1a1079c545c493e5aadaff6/FWCore/Framework/src/EventProcessor.cc#L1154
Should they be captured consistently for all tasks in the chain? I'd assume iSync could always be captured by reference (but could be wrong), but should status be captured by value?
In this comment I'm addressing iSync only. First this follows the pattern in the existing beginLumiAsync.
Partially this works because in a chain the functor passed to the "first" function is run serially. The functors passed to following "then" functions in the chain are run asynchronously. I've been wondering whether taking advantage of that fact was a good idea or not. Could it ever change? So the second capture must be by value.
Further, if you follow the usage of iSync down the stack it is passed by reference multiple times. So we are potentially saving several copies (although it is not that big...). If I remember correctly at the level in the stack when it is actually needed, a by value copy is made.
For status, the same comment applies as for iSync.
For status, there is the additional question as to whether one could use std::move. std::move does not make sense for IOVSyncValue because it does not have data members for which move makes any sense, but status is a std::shared_ptr which is movable. Even so, we cannot move status because in the captures in beginRunAsync, the status object is used after the capture and so the capture cannot be via std::move.
Thanks, so the current behavior is as expected.
My next question is if EventProcessor::readAndMergeRunEntries() and EventProcessor::readAndMergeLumiEntries() would better be named as ...Async() because they only push a task into a serial task queue and return immediately?
Yes. That is our convention. I'll change the names and add Async to them.
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38801/31413
- This PR adds an extra 76KB to repository