cmssw
cmssw copied to clipboard
Crashes in muon HLT reconstruction (`reco::TrackExtra` product not found)
Over the last few weeks, HLT suffered 3 online crashes coming the HLT-muon reconstruction; the first 2 errors are almost identical, while the 3rd one is rather similar to the first 2, but comes from a different producer. The error messages are given below [*].
So far, no one has been able to reproduce any of these errors locally with the relevant error-stream files.
I open an issue to keep track of this, and to ask experts if the error messages suggest to them anything about what might be going wrong.
This config file should be representative of the HLT menu used online during the crashes (representative at least for what concerns the sequences that contain the problematic modules).
FYI: @JanFSchulte @khaosmos93 (Muon-HLT contacts), @silviodonato @Martin-Grunewald @fwyzard
[*]
- Run-356433 (
CMSSW_12_4_3
, Jul 29th):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
- Run-356530 (
CMSSW_12_4_3
, Aug 1st):
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
- Run-357442 (
CMSSW_12_4_6
, Aug 14th):
[2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
A new Issue was created by @missirol Marino Missiroli.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt
New categories assigned: hlt
@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks
The new error is at least somewhat illuminating since it comes from a different CMSSW module as before, one I am more familiar with. Looking at the code, the only place where a TrackRef
is created in that module is here https://github.com/cms-sw/cmssw/blob/6ba826d71058bcc4456e66ad7dc2c01edc542626/RecoMuon/TrackerSeedGenerator/plugins/TSGForOIFromL2.cc#L118, which suggests that the missing TrackExtra
collection is the one associated with L2 muons. However, I have still no idea why that one would be missing, since the L2 muons themselves are clearly there.
For completeness, below are the details to rerun the relevant HLT menu on the problematic events, even though the 3 HLT crashes do not seem to be reproducible offline.
-
Run-356433 (
CMSSW_12_4_3
)- Error:
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx' Exception Message: RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2802' can not be satisfied because the product cannot be found. Probably the branch containing the product is not stored in the input file. Additional Info: [a] If you wish to continue processing events after a ProductNotFound exception, add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
- Recipe (not reproducing error):
cmsrel CMSSW_12_4_3; cd CMSSW_12_4_3/src; cmsenv hltConfigFromDB --runNumber 356433 > hlt.py cat >> hlt.py <<@EOF process.source.fileListMode = True process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run356433/run356433_ls0007_index000039_fu-c2b01-15-01_pid176387.raw'] @EOF cmsRun hlt.py &> hlt.log
-
Run-356530 (
CMSSW_12_4_3
)- Error:
[2] Calling method for module TSGFromL2Muon/'hltL3TrajSeedOIStateNoVtx' Exception Message: RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2804' can not be satisfied because the product cannot be found. Probably the branch containing the product is not stored in the input file. Additional Info: [a] If you wish to continue processing events after a ProductNotFound exception, add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
- Recipe (not reproducing error):
cmsrel CMSSW_12_4_3; cd CMSSW_12_4_3/src; cmsenv hltConfigFromDB --runNumber 356530 > hlt.py cat >> hlt.py <<@EOF process.source.fileListMode = True process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run356530/run356530_ls0004_index000064_fu-c2b04-40-01_pid1803760.raw'] @EOF cmsRun hlt.py &> hlt.log
-
Run-357442 (
CMSSW_12_4_6
)- Error:
[2] Calling method for module TSGForOIFromL2/'hltIterL3OISeedsFromL2MuonsNoVtx' Exception Message: RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:2782' can not be satisfied because the product cannot be found. Probably the branch containing the product is not stored in the input file. Additional Info: [a] If you wish to continue processing events after a ProductNotFound exception, add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.
- Recipe (not reproducing error):
cmsrel CMSSW_12_4_6; cd CMSSW_12_4_6/src; cmsenv hltConfigFromDB --runNumber 357442 > hlt.py cat >> hlt.py <<@EOF process.source.fileListMode = True process.source.fileNames = ['/afs/cern.ch/work/m/missirol/public/fog/error_stream/run357442/run357442_ls0880_index000100_fu-c2b02-39-01_pid416797.raw'] @EOF cmsRun hlt.py &> hlt.log
I ran the recipe for Run-357442 by adding EventContentAnalyzer
, ~~and see that the ProductID 1:2782
corresponds to hltL2CosmicMuons
module. In the configuration, I see hltL2CosmicMuons
would be consumed by hltL2MuonCandidatesNoVtxMeanTimerCosmicSeed
EDProducer and several EDFilters, and hltL2MuonCandidatesNoVtxMeanTimerCosmicSeed
only by several EDFilters. Assuming these EDFilters do not produce data products into the event (that anything else would be consuming), I don't see how a TrackExtraRef
to hltL2CosmicMuons
could propagate to hltIterL3OISeedsFromL2MuonsNoVtx
. (but I could easily be missing something)~~
Ok, so I ran my earlier test in a flawed way that lead the ProductIDs to shift. Being more careful I see 1:2782
to correspond hltL2Muons
which makes more sense (but doesn't really reveal anything new).
Thanks for having a look, Matti.
Trying to find where the output collections of hltL2Muons
are produced (incl. the TrackExtra
s), and I think this happens in
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/TrackingTools/src/MuonTrackLoader.cc#L141
A non-expert look inside suggests that the trackExtra ref is always set for the output tracks (I might be wrong): https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/TrackingTools/src/MuonTrackLoader.cc#L262
Also, I don't know enough to really understand the error message. Based on the latter, can one say that the collection std::vector<reco::TrackExtra>
was not produced during the job?
The exception in RefCore
is thrown here
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L125-L131
which is called from RefCore::getProductPtr()
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L72-L75
or from RefCore::getThinnedProductPtr()
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L103-L107
Since the TrackExtraRef
is a Ref
to a std::vector<reco::TrackExtra>
, the getThinnedProductPtr()
is actually the one being called (even if there is no thinning going on). The code goes to check the existence of a thinned product if a non-thinned one is not found in
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/interface/RefItemGet.h#L68-L76
The tryToGetProductWithCoreFromRef()
ends up calling
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/DataFormats/Common/src/RefCore.cc#L92
(as in RefCore::getProductPtr()
).
The prodGetter->getIt()
call goes into
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/EventPrincipal.cc#L288
and
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/EventPrincipal.cc#L254
where this case could happen only if there is no product at all with the given ID, or the producing module was run but did not produce the product (both of which should be reproducible, and should not be the case where).
I was first wondering why exactly the code ends up searching the vector<TrackExtra>
product by ProductID, but that is because the TrackExtraRef
is constructed from a RefProd<...>
and index in
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/TrackingTools/src/MuonTrackLoader.cc#L259
The conditions leading to the ProductNotFound
exception should be reproducible. The fact that there is a rare failure makes me wonder a (strange) race condition somewhere.
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
@Dr15Jones and I discovered a "logical race condition" in the framework that would cause symptoms like this (but we can't tell if it is really causing these problems).
The L2MuonProducer
that produced the Track
and TrackExtra
collections declares first the production of the Track
collection and then the TrackExtra
collection
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/RecoMuon/L2MuonProducer/plugins/L2MuonProducer.cc#L114-L117
and Track
objects hold Refs to the TrackExtra
. Downstream modules consume only the Track
collection. The order of produces()
declarations dictates the order where Event::commit_aux()
moves the products into the EventPrincipal
(and to the corresponding ProductResolvers) after the EDProducer::produce()
has successfully finished. When a scheduled module puts a product in a ProductResolver, the consumers of that product become eligible to run (unless some other product they depend on have not yet been produced)
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L433-L439
This means that the following can happen
- the
Track
collection is put in the ProductResolver (during the end ofhltL2Muons
module's produce) - a consumer of that
Track
collection (e.g.hltIterL3OISeedsFromL2MuonsNoVtx
) gets run by another thread- the other thread must have nothing else to do in order to steal a task from the thread still running
hltL2Muons
- the other thread must have nothing else to do in order to steal a task from the thread still running
- the operating system pauses the thread running the
hltL2Muons
- the consumer calls a method
Track
that de-references theTrackExtraRef
, but theTrackExtra
collection is not in the corresponding ProductResolver yet - the
hltL2Muons
puts theTrackExtra
collection into the ProductResolver, but the job is going to terminate
A quick workaround (which I'm going to prepare) is to declare first the production of TrackExtra
collection and only then the Track
collection.
We will need some time to think for a more general solution (I'd guess the TrackExtra
is not the only collection type we have that is only referenced-to by other products without explicit consumption).
As far as I can tell, unscheduled modules are not affected, because for them the product insertion into ProductResolver does not impact module scheduling. Instead, upon prefetch a task that releases all the modules consuming that product is inserted into the WaitingTaskList
of the Worker producing the product.
https://github.com/cms-sw/cmssw/blob/2af4be6bb84338820c8d181f4d5dc4f4a5e61dee/FWCore/Framework/src/ProductResolvers.cc#L468-L515
A quick workaround (which I'm going to prepare) is to declare first the production of
TrackExtra
collection and only then theTrack
collection.
The PR is here #39201. I limited to modules that I easily saw were being used in the HLT.
I was able to reproduce the exception with the "Run-357442" by adding a 1-second sleep between the iterations of the loop
https://github.com/cms-sw/cmssw/blob/d21a67ba82ceab762e54b1529b8da55c874d528e/FWCore/Framework/src/Event.cc#L220-L229
for the hltL2Muons
module (running with 1 stream and 4 threads).
I tested the mitigation in #39201 with this setup, and the test job succeeded. Of course this test does not imply that there couldn't be some other module(s), or other pairs of similar data products that could cause this problem to show up.
I think I have a proper fix in #39245. At least it works with the "reproducer" in https://github.com/cms-sw/cmssw/issues/39064#issuecomment-1228755900 .
https://github.com/cms-sw/cmssw/pull/39245 converged. I also made backports to 12_4_X and 12_5_X in #39484 and #39485.
+hlt
Thanks @makortel
+core
This issue is fully signed and ready to be closed.