cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Crashes in muon HLT reconstruction (`reco::TrackExtra` product not found)

Open missirol opened this issue 2 years ago • 3 comments

Over the last few weeks, HLT suffered 3 online crashes coming the HLT-muon reconstruction; the first 2 errors are almost identical, while the 3rd one is rather similar to the first 2, but comes from a different producer. The error messages are given below [*].

So far, no one has been able to reproduce any of these errors locally with the relevant error-stream files.

I open an issue to keep track of this, and to ask experts if the error messages suggest to them anything about what might be going wrong.

This config file should be representative of the HLT menu used online during the crashes (representative at least for what concerns the sequences that contain the problematic modules).

FYI: @JanFSchulte @khaosmos93 (Muon-HLT contacts), @silviodonato @Martin-Grunewald @fwyzard

[*]

  1. Run-356433 (CMSSW_12_4_3, Jul 29th):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
  1. Run-356530 (CMSSW_12_4_3, Aug 1st):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
  1. Run-357442 (CMSSW_12_4_6, Aug 14th):
[2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

missirol avatar Aug 15 '22 15:08 missirol

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Aug 15 '22 15:08 cmsbuild

assign hlt

makortel avatar Aug 15 '22 15:08 makortel

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Aug 15 '22 15:08 cmsbuild

The new error is at least somewhat illuminating since it comes from a different CMSSW module as before, one I am more familiar with. Looking at the code, the only place where a TrackRef is created in that module is here https://github.com/cms-sw/cmssw/blob/6ba826d71058bcc4456e66ad7dc2c01edc542626/RecoMuon/TrackerSeedGenerator/plugins/TSGForOIFromL2.cc#L118, which suggests that the missing TrackExtra collection is the one associated with L2 muons. However, I have still no idea why that one would be missing, since the L2 muons themselves are clearly there.

JanFSchulte avatar Aug 15 '22 20:08 JanFSchulte

For completeness, below are the details to rerun the relevant HLT menu on the problematic events, even though the 3 HLT crashes do not seem to be reproducible offline.

  1. Run-356433 (CMSSW_12_4_3)

    • Error:
    [2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
    Exception Message:
    RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802'
    can not be satisfied because the product cannot be found.
    Probably the branch containing the product is not stored in the input file.
    Additional Info:
    [a] If you wish to continue processing events after a ProductNotFound exception,
    add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
    
    • Recipe (not reproducing error):
    cmsrel CMSSW_12_4_3; cd CMSSW_12_4_3/src; cmsenv
    hltConfigFromDB --runNumber 356433 > hlt.py
    cat >> hlt.py <<@EOF
    process.source.fileListMode = True
    process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run356433/run356433_ls0007_index000039_fu-c2b01-15-01_pid176387.raw']
    @EOF
    cmsRun hlt.py &> hlt.log
    
  2. Run-356530 (CMSSW_12_4_3)

    • Error:
    [2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
    Exception Message:
    RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804'
    can not be satisfied because the product cannot be found.
    Probably the branch containing the product is not stored in the input file.
    Additional Info:
    [a] If you wish to continue processing events after a ProductNotFound exception,
    add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
    
    • Recipe (not reproducing error):
    cmsrel CMSSW_12_4_3; cd CMSSW_12_4_3/src; cmsenv
    hltConfigFromDB --runNumber 356530 > hlt.py
    cat >> hlt.py <<@EOF
    process.source.fileListMode = True
    process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run356530/run356530_ls0004_index000064_fu-c2b04-40-01_pid1803760.raw']
    @EOF
    cmsRun hlt.py &> hlt.log
    
  3. Run-357442 (CMSSW_12_4_6)

    • Error:
    [2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx'
    Exception Message:
    RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782'
    can not be satisfied because the product cannot be found.
    Probably the branch containing the product is not stored in the input file.
    Additional Info:
    [a] If you wish to continue processing events after a ProductNotFound exception,
    add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
    
    • Recipe (not reproducing error):
    cmsrel CMSSW_12_4_6; cd CMSSW_12_4_6/src; cmsenv
    hltConfigFromDB --runNumber 357442 > hlt.py
    cat >> hlt.py <<@EOF
    process.source.fileListMode = True
    process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run357442/run357442_ls0880_index000100_fu-c2b02-39-01_pid416797.raw']
    @EOF
    cmsRun hlt.py &> hlt.log
    

missirol avatar Aug 19 '22 14:08 missirol

I ran the recipe for Run-357442 by adding EventContentAnalyzer, ~~and see that the ProductID 1:2782 corresponds to hltL2CosmicMuons module. In the configuration, I see hltL2CosmicMuons would be consumed by hltL2MuonCandidatesNoVtxMeanTimerCosmicSeed EDProducer and several EDFilters, and hltL2MuonCandidatesNoVtxMeanTimerCosmicSeed only by several EDFilters. Assuming these EDFilters do not produce data products into the event (that anything else would be consuming), I don't see how a TrackExtraRef to hltL2CosmicMuons could propagate to hltIterL3OISeedsFromL2MuonsNoVtx. (but I could easily be missing something)~~

makortel avatar Aug 19 '22 16:08 makortel

Ok, so I ran my earlier test in a flawed way that lead the ProductIDs to shift. Being more careful I see 1:2782 to correspond hltL2Muons which makes more sense (but doesn't really reveal anything new).

makortel avatar Aug 19 '22 18:08 makortel

Thanks for having a look, Matti.

Trying to find where the output collections of hltL2Muons are produced (incl. the TrackExtras), and I think this happens in https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/TrackingTools/src/MuonTrackLoader.cc#L141

A non-expert look inside suggests that the trackExtra ref is always set for the output tracks (I might be wrong): https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/TrackingTools/src/MuonTrackLoader.cc#L262

Also, I don't know enough to really understand the error message. Based on the latter, can one say that the collection std::vector<reco::TrackExtra> was not produced during the job?

missirol avatar Aug 19 '22 19:08 missirol

The exception in RefCore is thrown here https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L125-L131 which is called from RefCore::getProductPtr() https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L72-L75 or from RefCore::getThinnedProductPtr() https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L103-L107 Since the TrackExtraRef is a Ref to a std::vector<reco::TrackExtra>, the getThinnedProductPtr() is actually the one being called (even if there is no thinning going on). The code goes to check the existence of a thinned product if a non-thinned one is not found in https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/interface/RefItemGet.h#L68-L76 The tryToGetProductWithCoreFromRef() ends up calling https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L92 (as in RefCore::getProductPtr()).

The prodGetter->getIt() call goes into https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/EventPrincipal.cc#L288 and https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/EventPrincipal.cc#L254 where this case could happen only if there is no product at all with the given ID, or the producing module was run but did not produce the product (both of which should be reproducible, and should not be the case where).

I was first wondering why exactly the code ends up searching the vector<TrackExtra> product by ProductID, but that is because the TrackExtraRef is constructed from a RefProd<...> and index in https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/TrackingTools/src/MuonTrackLoader.cc#L259

The conditions leading to the ProductNotFound exception should be reproducible. The fact that there is a rare failure makes me wonder a (strange) race condition somewhere.

makortel avatar Aug 19 '22 20:08 makortel

assign core

makortel avatar Aug 25 '22 18:08 makortel

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Aug 25 '22 18:08 cmsbuild

@Dr15Jones and I discovered a "logical race condition" in the framework that would cause symptoms like this (but we can't tell if it is really causing these problems).

The L2MuonProducer that produced the Track and TrackExtra collections declares first the production of the Track collection and then the TrackExtra collection https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/L2MuonProducer/plugins/L2MuonProducer.cc#L114-L117 and Track objects hold Refs to the TrackExtra. Downstream modules consume only the Track collection. The order of produces() declarations dictates the order where Event::commit_aux() moves the products into the EventPrincipal (and to the corresponding ProductResolvers) after the EDProducer::produce() has successfully finished. When a scheduled module puts a product in a ProductResolver, the consumers of that product become eligible to run (unless some other product they depend on have not yet been produced) https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L433-L439

This means that the following can happen

  1. the Track collection is put in the ProductResolver (during the end of hltL2Muons module's produce)
  2. a consumer of that Track collection (e.g. hltIterL3OISeedsFromL2MuonsNoVtx) gets run by another thread
    • the other thread must have nothing else to do in order to steal a task from the thread still running hltL2Muons
  3. the operating system pauses the thread running the hltL2Muons
  4. the consumer calls a method Track that de-references the TrackExtraRef, but the TrackExtra collection is not in the corresponding ProductResolver yet
  5. the hltL2Muons puts the TrackExtra collection into the ProductResolver, but the job is going to terminate

A quick workaround (which I'm going to prepare) is to declare first the production of TrackExtra collection and only then the Track collection.

We will need some time to think for a more general solution (I'd guess the TrackExtra is not the only collection type we have that is only referenced-to by other products without explicit consumption).

As far as I can tell, unscheduled modules are not affected, because for them the product insertion into ProductResolver does not impact module scheduling. Instead, upon prefetch a task that releases all the modules consuming that product is inserted into the WaitingTaskList of the Worker producing the product. https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L468-L515

makortel avatar Aug 25 '22 19:08 makortel

A quick workaround (which I'm going to prepare) is to declare first the production of TrackExtra collection and only then the Track collection.

The PR is here #39201. I limited to modules that I easily saw were being used in the HLT.

makortel avatar Aug 25 '22 20:08 makortel

I was able to reproduce the exception with the "Run-357442" by adding a 1-second sleep between the iterations of the loop https://github.com/cms-sw/cmssw/blob/d21a67ba82ceab762e54b1529b8da55c874d528e/FWCore/Framework/src/Event.cc#L220-L229 for the hltL2Muons module (running with 1 stream and 4 threads).

I tested the mitigation in #39201 with this setup, and the test job succeeded. Of course this test does not imply that there couldn't be some other module(s), or other pairs of similar data products that could cause this problem to show up.

makortel avatar Aug 26 '22 17:08 makortel

I think I have a proper fix in #39245. At least it works with the "reproducer" in https://github.com/cms-sw/cmssw/issues/39064#issuecomment-1228755900 .

makortel avatar Aug 29 '22 20:08 makortel

https://github.com/cms-sw/cmssw/pull/39245 converged. I also made backports to 12_4_X and 12_5_X in #39484 and #39485.

makortel avatar Sep 23 '22 01:09 makortel

+hlt

Thanks @makortel

missirol avatar Sep 23 '22 07:09 missirol

+core

makortel avatar Sep 23 '22 13:09 makortel

This issue is fully signed and ready to be closed.

cmsbuild avatar Sep 23 '22 13:09 cmsbuild