cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Initialize BranchDescription from Dictionary only if the branch is present

Open makortel opened this issue 2 years ago • 46 comments

PR description:

This PR limits the BranchDescription's ROOT dicrionary part be initialized only if the branch is present. This change allows a job to process a file that contains products whose parent product (that is not in the file, was either transient or dropped) dictionary does not exist anymore in the job. Resolves https://github.com/cms-sw/cmssw/issues/38781.

I'm uncertain about the usefulness of the test added in this PR. It is sufficient to demonstrate that the changes of this PR work as intended, but it has some serious flaws for longer term.

PR validation:

Problematic job runs, framework unit tests pass.

makortel avatar Jul 20 '22 21:07 makortel

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31164

  • This PR adds an extra 68KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File DataFormats/Provenance/interface/BranchDescription.h modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38801

Code check has found code style and quality issues which could be resolved by applying following patch(s)

  • code-format: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31164/code-format.patch e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31164/code-format.patch | patch -p1 You can also run scram build code-format to apply code format directly

cmsbuild avatar Jul 20 '22 21:07 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31165

  • This PR adds an extra 68KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File DataFormats/Provenance/interface/BranchDescription.h modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38801

cmsbuild avatar Jul 20 '22 21:07 cmsbuild

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

  • DataFormats/Common (core)
  • DataFormats/Provenance (core)
  • DataFormats/TestObjects (core)
  • FWCore/Framework (core)
  • IOPool/Input (core)

@cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please review it and eventually sign? Thanks. @wddgit, @rovere this is something you requested to watch as well. @perrotta, @dpiparo, @qliphy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar Jul 20 '22 21:07 cmsbuild

@cmsbuild, please test

makortel avatar Jul 20 '22 21:07 makortel

@Dr15Jones @wddgit Can you think of any ill side effects that might not be covered by our unit tests?

makortel avatar Jul 20 '22 21:07 makortel

I'll take a look at it.

wddgit avatar Jul 20 '22 21:07 wddgit

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-384df2/26357/summary.html COMMIT: fa0f54193995534e034258dbd6cda73ec3e69317 CMSSW: CMSSW_12_5_X_2022-07-20-1100/el8_amd64_gcc10 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38806/26357/install.sh to create a dev area with all the needed externals and cmssw changes.

CMS deprecated warnings: 9 CMS deprecated warnings found, see summary page for details.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3662417
  • DQMHistoTests: Total failures: 17
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3662377
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Jul 21 '22 06:07 cmsbuild

We already discussed that we keep BranchDescriptions for kept products and their ancestors. There is one other case where we also keep the BranchDescription. If there is a BranchAlias and we keep the alias BranchDescription, then we also keep the original BranchDescription. I'm not sure that this affects anything... Just mentioning it because we didn't include that when we were discussing it earlier.

wddgit avatar Jul 21 '22 18:07 wddgit

Is there anything that inhibits callWhenNewProductsRegistered from being called on BranchDescriptions not present? Code like this would fail in that case.

https://cmssdt.cern.ch/dxr/CMSSW/source/RecoMuon/L3MuonProducer/src/L3TkMuonProducer.cc#38

wddgit avatar Jul 21 '22 20:07 wddgit

This looks potentially problematic:

https://cmssdt.cern.ch/dxr/CMSSW/source/FWCore/Framework/interface/stream/ThinningProducer.h#163

wddgit avatar Jul 21 '22 21:07 wddgit

That is all I noticed going through this quickly this afternoon. I could have missed something. There are a number of uses of callWhenNewProductsRegistered that might be problematic. (Now back to debugging the unit test that failed in the concurrent run PR)

wddgit avatar Jul 21 '22 21:07 wddgit

Thanks @wddgit. I would think we could either restrict callWhenNewProductsRegistered() callback to be called only for branches that are present, or go through all callWhenNewProductsRegistered() callbacks and add a check for branch presence for cases where the type information is used. I would (naively) think he callbacks to not be very useful for non-present branches.

The ThinningProducer looks potentially more fundamental problem, need to think about it more.

makortel avatar Jul 22 '22 01:07 makortel

I tested that currently indeed callWhenNewProductsRegistered() callbacks are called for non-present branches as well.

I went through all uses of callWhenNewProductsRegistered(), here are notes from that

L3TkMuonProducer

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/RecoMuon/L3MuonProducer/src/L3TkMuonProducer.cc#L36-L42

AFAICT this is used only to get L3MuonTrajectorySeedCollection from the Event in https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/RecoMuon/L3MuonProducer/src/L3TkMuonProducer.cc#L144-L151 I think this could be replaced with the use of RefProd<L3MuonTrajectorySeedCollection>.

EcalDeadCellTriggerPrimitiveFilter

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/RecoMET/METFilters/plugins/EcalDeadCellTriggerPrimitiveFilter.cc#L188-L208

Here the behavior would certainly change if callWhenNewProductsRegistered() would be called only for present branches. On a cursory look the code looks like it wants to get the products whose existence is tested in callWhenNewProductsRegistered() though, so changing the (theoretical) behavior could be fine.

Framework test modules

Code like this https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/FWCore/Framework/test/stubs/TestGlobalAnalyzers.cc#L52-L56 in edmtest::global::StreamIntAnalyzer, edmtest::global::StreamIntProducer, edmtest::limited::StreamIntAnalyzer, edmtest::one::SharedResourcesAnalyzer, and edmtest::stream::GlobalIntAnalyzer. I'd say these would be unaffected.

ManyIntWhenRegisteredProducer

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/FWCore/Framework/test/stubs/ToyIntProducers.cc#L516-L528 effectively assumes all relevant branches to exist

EventContentAnalyzer

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/FWCore/Modules/src/EventContentAnalyzer.cc#L299-L318

Here the callWhenNewProductsRegistered() is used only conjunction of getData_, i.e. it ignoring dropped branches should be fine.

GenericConsumer

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/FWCore/Modules/src/GenericConsumer.cc#L45-L57 etc. This code only declares consumption of the matched branches, so ignoring dropped branches should be fine.

On a separate note, this module does not currently support ProcessBlocks (it will throw an exception if the ProductRegistry has any ProcessBlock branches).

LogErrorHarvester

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/FWCore/Modules/src/LogErrorHarvester.cc#L65-L68

Explicitly limits to the current process, so ignoring dropped branches should be fine.

SwitchProducer

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/FWCore/Modules/src/SwitchProducer.cc#L27-L28

Explicitly limits to the current process, so ignoring dropped branches should be fine.

On a separate note I see now that there is another way to get the current process name there, so I could look into removing the @process_name parameter of the module.

ConcurrentHadronizerFilter, HadronizerFilter

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/GeneratorInterface/Core/interface/ConcurrentHadronizerFilter.h#L139-L149

The call is about consuming the matched branch, so ignoring dropped branches should be fine.

TriggerResultsFilter

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/HLTrigger/HLTfilters/plugins/TriggerResultsFilter.cc#L52-L57

The call is about consuming the matched branch, and non-existing products are ignored, so ignoring dropped branches should be fine.

L1GtUtilsHelper

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/L1Trigger/GlobalTriggerAnalyzer/interface/L1GtUtilsHelper.h#L177 https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/L1Trigger/GlobalTriggerAnalyzer/src/L1GtUtilsHelper.cc#L33-L38

Lot's of complex code, but the comment above suggests that the idea is to find a branch that matches to a specific condition, and then consume that branch. So ignoring dropped branches could be fine.

l1t::L1TGlobalUtilHelper

https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/L1Trigger/L1TGlobal/interface/L1TGlobalUtilHelper.h#L158 https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/L1Trigger/L1TGlobal/src/L1TGlobalUtilHelper.cc#L51-L53

Explicitly ignores dropped branches.

Users of edm::GetterOfProducts

edm::GetterOfProducts explicitly ignores dropped branches https://github.com/cms-sw/cmssw/blob/1a59ba00193440d65339de1c600e01c15993e5de/FWCore/Framework/interface/GetterOfProducts.h#L130-L132

The following modules used only edm::GetterOfProducts: DQMFileSaver, DQMEDHarvester, edmtest::TestFindProduct, edmtest::TestGetterOfProducts, TriggerSummaryProducerAOD, TriggerSummaryProducerRAW, edmtest_thing::StreamThingAnalyzer, PATTriggerEventProducer and PATTriggerProducer.

makortel avatar Jul 22 '22 16:07 makortel

The ThinningProducer looks potentially more fundamental problem, need to think about it more.

After some more thought and poking around the code, I think now that the ThinningProducer is not a fundamental problem. The loop over ProductRegistry's BranchDescription elements contains the following conditionals

  • If product's type matches the collection type
    • If branch is produced and label matches the current module: record BranchID as a thinned collection
    • If label and instance names match those of the ThinningProducer's inputTag parameter
      • If InputTag should skip the current process, and if the branch is not produced, record it as a parent collection
      • Else if InputTag's process name is empty or the current process, record it as a parent collection
  • If product's type matches ThinnedAssociation, branch is produced, label matches the current module: record the branch as an association

The thinned collection and association must be present (branches are produced, by the same module). The ThinningProducer also expects the parent collection to be present, otherwise the module will throw a ProductNotFound exception in its produce().

I think the loop over BranchDescriptions could just ignore all dropped branches.

makortel avatar Jul 25 '22 19:07 makortel

I added a protection for dropped branches in ThinningProducers in the last push. The callWhenNewProductsRegistered() are handled in #38872.

makortel avatar Jul 29 '22 01:07 makortel

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31322

  • This PR adds an extra 76KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File DataFormats/Provenance/interface/BranchDescription.h modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38801

cmsbuild avatar Jul 29 '22 01:07 cmsbuild

Pull request #38806 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please check and sign again.

cmsbuild avatar Jul 29 '22 01:07 cmsbuild

@cmsbuild, please test

makortel avatar Jul 29 '22 01:07 makortel

It seems tests are stuck because of a failure in one of our nodes. I am going to restart them again. Sorry for the inconveniences!

aandvalenzuela avatar Jul 29 '22 07:07 aandvalenzuela

please abort

aandvalenzuela avatar Jul 29 '22 07:07 aandvalenzuela

please test

aandvalenzuela avatar Jul 29 '22 07:07 aandvalenzuela

I think the failure in unit test is unrelated to this PR (https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-384df2/26522/unitTests/failed.html). It seems it is related to the failure in IBs in DQM/SiStripMonitorClient module already reported in https://github.com/cms-sw/cmssw/issues/38885.

aandvalenzuela avatar Jul 29 '22 12:07 aandvalenzuela

-1

Failed Tests: UnitTests Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-384df2/26522/summary.html COMMIT: a618ee2a0b0b3d14982f7295f708401e4ad39eac CMSSW: CMSSW_12_5_X_2022-07-28-2300/el8_amd64_gcc10 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38806/26522/install.sh to create a dev area with all the needed externals and cmssw changes.

CMS deprecated warnings: 9 CMS deprecated warnings found, see summary page for details.

Unit Tests

I found errors in the following unit tests:

---> test testSiStripDQM_OfflineTkMap had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 10 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3668050
  • DQMHistoTests: Total failures: 29
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3667998
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 50 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 210 log files, 47 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Jul 29 '22 17:07 cmsbuild

4.53 shows the same difference as in #38603, which @Dr15Jones and I have understood to be a deficiency in L1GtUtilsHelper and for which @Dr15Jones is preparing a fix.

The unit test fails already in IB.

makortel avatar Jul 29 '22 17:07 makortel

@cmsbuild, please test

Refresh to see if the comparison differences are gone

makortel avatar Aug 03 '22 13:08 makortel

-1

Failed Tests: UnitTests Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-384df2/26622/summary.html COMMIT: a618ee2a0b0b3d14982f7295f708401e4ad39eac CMSSW: CMSSW_12_5_X_2022-08-03-1100/el8_amd64_gcc10 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38806/26622/install.sh to create a dev area with all the needed externals and cmssw changes.

CMS deprecated warnings: 9 CMS deprecated warnings found, see summary page for details.

Unit Tests

I found errors in the following unit tests:

---> test EcalTPG_updateWeightGroup_test had ERRORS

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3691510
  • DQMHistoTests: Total failures: 14
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3691474
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Aug 03 '22 20:08 cmsbuild

The unit test fails randomly in IBs as well (#38964).

makortel avatar Aug 03 '22 20:08 makortel

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31435

  • This PR adds an extra 80KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File DataFormats/Provenance/interface/BranchDescription.h modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38801

Code check has found code style and quality issues which could be resolved by applying following patch(s)

  • code-format: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31435/code-format.patch e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31435/code-format.patch | patch -p1 You can also run scram build code-format to apply code format directly

cmsbuild avatar Aug 06 '22 00:08 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38806/31436

  • This PR adds an extra 80KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File DataFormats/Provenance/interface/BranchDescription.h modified in PR(s): #38454
    • File IOPool/Input/src/PoolSource.cc modified in PR(s): #38801

cmsbuild avatar Aug 06 '22 00:08 cmsbuild

Pull request #38806 was updated. @cmsbuild, @smuzaffar, @Dr15Jones, @makortel can you please check and sign again.

cmsbuild avatar Aug 06 '22 00:08 cmsbuild