Segmentation violation for StandAloneMuonProducer in CMSSW_14_0_7
Dear all, As reported in https://cms-talk.web.cern.ch/t/segmentation-violation-in-promptreco-for-parkingdoublemuonlowmass1-run-380963/41520 we have a segmentation violation for the ParkingDoubleMuonLowMass dataset in Run 380963, with the following stack trace:
Begin processing the 1590th record. Run 380963, Event 73250130, LumiSection 85 on stream 3 at 24-May-2024 02:57:25.663 UTC
%MSG-e KFUpdator: StandAloneMuonProducer:displacedStandAloneMuons 24-May-2024 02:57:26 UTC Run: 380963 Event: 73250130
could not invert martix:
[ 3.0909e+13 4.23366e+13 6.14231e+14 8.06196e+14
4.23366e+13 5.79893e+13 8.41324e+14 1.10426e+15
6.14231e+14 8.41324e+14 1.22062e+16 1.60209e+16
8.06196e+14 1.10426e+15 1.60209e+16 2.10279e+16 ]
%MSG
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
...
Thread 7 (Thread 0x14b2a4ffd700 (LWP 1099) "cmsRun"):
#0 0x000014b2fe1bd301 in poll () from /lib64/libc.so.6
#1 0x000014b2fa18c6af in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x000014b2fa140dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 0x000014b2fa141720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x000014b2c4879cb7 in Propagator::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackP
ropagationSteppingHelixPropagator.so
#6 0x000014b2f4a1b219 in ForwardDetLayer::compatible(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0
_7/lib/el8_amd64_gcc12/libTrackingToolsDetLayers.so
#7 0x000014b2c47ef36f in MuRingForwardLayer::compatibleDets(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMS
SW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonDetLayers.so
#8 0x000014b2c47efc73 in MuRingForwardDoubleLayer::groupedCompatibleDets(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/
cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonDetLayers.so
#9 0x000014b28ba43c78 in MuonDetLayerMeasurements::groupedMeasurements(DetLayer const*, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, edm::Event const&) () from /cvmfs/
cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonMeasurementDet.so
#10 0x000014b28ba44989 in MuonDetLayerMeasurements::groupedMeasurements(DetLayer const*, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) () from /cvmfs/cms.cern.ch/el8_amd
64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonMeasurementDet.so
#11 0x000014b2683aed74 in StandAloneMuonFilter::findBestMeasurements(DetLayer const*, TrajectoryStateOnSurface const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12
/libRecoMuonStandAloneTrackFinder.so
#12 0x000014b2683b2585 in StandAloneMuonFilter::refit(TrajectoryStateOnSurface const&, DetLayer const*, Trajectory&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/l
ibRecoMuonStandAloneTrackFinder.so
#13 0x000014b2683b46a4 in StandAloneMuonTrajectoryBuilder::trajectories(TrajectorySeed const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonStandAloneTr
ackFinder.so
...
Current Modules:
Module: StandAloneMuonProducer:displacedStandAloneMuons (crashed)
Module: CkfTrackCandidateMaker:lowPtTripletStepTrackCandidates
Module: CkfTrackCandidateMaker:lowPtQuadStepTrackCandidates
Module: TrackingMonitor:TrackSeedMondetachedQuadStep
Module: PoolOutputModule:write_MINIAOD
Module: SiStripMonitorTrack:SiStripMonitorTrackCommon
Module: MkFitProducer:detachedTripletStepTrackCandidatesMkFit
Module: CkfTrackCandidateMaker:pixelPairStepTrackCandidates
A fatal system signal has occurred: segmentation violation
Complete
A minimal recipe to reproduce the error is:
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3221638/job/WMTaskSpace/cmsRun1/PSet.p* .
cat <<EOF >> PSet.py
process.options.numberOfThreads=cms.untracked.uint32(1)
process.options.numberOfStreams=cms.untracked.uint32(1)
process.source.skipEvents = cms.untracked.uint32(1589)
EOF
cmsRun -e PSet.py
cms-bot internal usage
A new Issue was created by @francescobrivio.
@Dr15Jones, @antoniovilela, @rappoccio, @sextonkennedy, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction
New categories assigned: reconstruction
@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
FYI @cms-sw/muon-pog-l2 and @cms-sw/muon-dpg-l2
type muon
For the record with this:
diff --git a/TrackingTools/DetLayers/src/ForwardDetLayer.cc b/TrackingTools/DetLayers/src/ForwardDetLayer.cc
index d6f0afdf22c..b5d2eaa27ac 100644
--- a/TrackingTools/DetLayers/src/ForwardDetLayer.cc
+++ b/TrackingTools/DetLayers/src/ForwardDetLayer.cc
@@ -88,7 +88,13 @@ pair<bool, TrajectoryStateOnSurface> ForwardDetLayer::compatible(const Trajector
edm::LogError("DetLayers") << "ERROR: BarrelDetLayer::compatible() is used before the layer surface is initialized";
// throw an exception? which one?
- TrajectoryStateOnSurface myState = prop.propagate(ts, specificSurface());
+ TrajectoryStateOnSurface myState;
+ if UNLIKELY (!ts.isValid()) {
+ edm::LogError("DetLayers") << "ERROR: trying to propagate invalid TSOS" << std::endl;
+ } else {
+ myState = prop.propagate(ts, specificSurface());
+ }
+
if UNLIKELY (!myState.isValid())
return make_pair(false, myState);
one gets past the error and in the log running the script at https://github.com/cms-sw/cmssw/issues/45035#issue-2314665547 one can find:
%MSG-e KFUpdator: StandAloneMuonProducer:displacedStandAloneMuons 24-May-2024 11:33:49 CEST Run: 380963 Event: 73250130
could not invert martix:
[ 3.0909e+13 4.23366e+13 6.14231e+14 8.06196e+14
4.23366e+13 5.79893e+13 8.41324e+14 1.10426e+15
6.14231e+14 8.41324e+14 1.22062e+16 1.60209e+16
8.06196e+14 1.10426e+15 1.60209e+16 2.10279e+16 ]
%MSG
%MSG-e DetLayers: StandAloneMuonProducer:displacedStandAloneMuons 24-May-2024 11:33:49 CEST Run: 380963 Event: 73250130
ERROR: trying to propage invalid TSOS
%MSG
%MSG-e DetLayers: StandAloneMuonProducer:displacedStandAloneMuons 24-May-2024 11:33:49 CEST Run: 380963 Event: 73250130
ERROR: trying to propage invalid TSOS
%MSG
it seems plausible that then the problem originates here:
https://github.com/cms-sw/cmssw/blob/32dc27774b47d882609be07a885b060b9398d96e/TrackingTools/KalmanUpdators/src/KFUpdator.cc#L170-L173
where upon being unable to invert a matrix, a default constructed TSOS (with invalid state) is returned.
Let me tag the Muon Reco contacts too: @24LopezR @rbhattacharya04
it seems plausible that then the problem originates here:
it's not in the crash stack.
Do you mean the preceding LogError?
For a fix, instead of ForwardDetLayer::compatible, an upstream caller may be a better place to decide what to do with a bad state that originated there.
Do you mean the preceding LogError?
yes
an upstream caller may be a better place to decide what to do with a bad state that originated there.
is:
diff --git a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
index 5253626ab46..0150b405859 100644
--- a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
+++ b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
@@ -299,6 +299,11 @@ std::vector<TrajectoryMeasurement> StandAloneMuonFilter::findBestMeasurements(co
std::vector<TrajectoryMeasurement> result;
std::vector<TrajectoryMeasurement> measurements;
+ if (!tsos.isValid()) {
+ edm::LogError(metname) << "ERROR: trying to propagate invalid TSOS" << std::endl;
+ return result;
+ }
+
if (theOverlappingChambersFlag && layer->hasGroups()) {
std::vector<TrajectoryMeasurementGroup> measurementGroups =
theMeasurementExtractor->groupedMeasurements(layer, tsos, *propagator(), *estimator());
upstream enough ?
@@ -299,6 +299,11 @@ std::vector<TrajectoryMeasurement> StandAloneMuonFilter::findBestMeasurements(co
StandAloneMuonFilter::refit is missing checks of validity, although it does check the empty output from the findBestMeasurements in lieu of that.
In case the invalid TSOS is the StandAloneMuonFilter::refit input, then going further up would be more clear.
on a second thought, I'm not really against the edm::LogError in the ForwardDetLayers; it's still more appropriate than a crash.
The hope from going upstream would be to see if there it could become a case more suitable for a LogInfo
type tracking
In case the invalid TSOS is the
StandAloneMuonFilter::refitinput, then going further up would be more clear. [...] The hope from going upstream would be to see if there it could become a case more suitable for aLogInfo
for this case I tested (successfully) also this variant:
diff --git a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
index 5253626ab46..08deb0200a1 100644
--- a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
+++ b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
@@ -233,7 +233,13 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
LogTrace(metname) << "search Trajectory Measurement from: " << lastTSOS.globalPosition();
// pick the best measurement from each group
- std::vector<TrajectoryMeasurement> bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+ std::vector<TrajectoryMeasurement> bestMeasurements{};
+
+ if (lastTSOS.isValid()) {
+ bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+ } else {
+ edm::LogInfo(metname) << "Invalid last TSOS, will not find best measurements ";
+ }
// RB: Different ways can be choosen if no bestMeasurement is available:
// 1- check on lastTSOS-initialTSOS eta difference
@@ -248,10 +254,16 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
// wrt the initial one (i.e. seed), then try to find the measurements
// according to the lastButOne FTS. (1B)
double lastdEta = fabs(lastTSOS.freeTrajectoryState()->momentum().eta() - eta0);
+
if (bestMeasurements.empty() && lastdEta > 0.1) {
LogTrace(metname) << "No measurement and big eta variation wrt seed" << endl
<< "trying with lastButOneUpdatedTSOS";
- bestMeasurements = findBestMeasurements(*layer, theLastButOneUpdatedTSOS);
+
+ if (theLastButOneUpdatedTSOS.isValid()) {
+ bestMeasurements = findBestMeasurements(*layer, theLastButOneUpdatedTSOS);
+ } else {
+ edm::LogInfo(metname) << "Invalid last but one updated TSOS, will not find best measurements ";
+ }
}
//if no measurement found and the current FTS has an eta very different
@@ -259,7 +271,12 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
//according to the initial FTS. (1A)
if (bestMeasurements.empty() && lastdEta > 0.1) {
LogTrace(metname) << "No measurement and big eta variation wrt seed" << endl << "tryng with seed TSOS";
- bestMeasurements = findBestMeasurements(*layer, initialTSOS);
+
+ if (initialTSOS.isValid()) {
+ bestMeasurements = findBestMeasurements(*layer, initialTSOS);
+ } else {
+ edm::LogInfo(metname) << "Invalid initial TSOS, will not find best measurements ";
+ }
}
// FIXME: uncomment this line!!
(in the specifics of this crash, it was actually theLastButOneUpdatedTSOS not being valid that lead to the crash in StandAloneMuonFilter::findBestMeasurements.).
I'm not really against the edm::LogError in the ForwardDetLayers; it's still more appropriate than a crash.
if either way is fine, I'd let the domain experts to decide / implement a protection.
We currently have other two failures all reporting
Module: StandAloneMuonProducer:displacedStandAloneMuons (crashed)
- PromptReco_Run381065_ParkingDoubleMuonLowMass3
- Tarball in
/eos/cms/tier0/store/unmerged/data/logs/prod/2024/5/25/PromptReco_Run381065_ParkingDoubleMuonLowMass3/Reco/0000/3/3508ad47-e770-4fff-bf82-400ad0f155cd-63-3-logArchive.tar.gz
- Tarball in
- PromptReco_Run381067_ParkingDoubleMuonLowMass2
- Tarball in
/eos/cms/tier0/store/unmerged/data/logs/prod/2024/5/26/PromptReco_Run381067_ParkingDoubleMuonLowMass2/Reco/0000/3/a85151ad-f91a-4493-ac15-756b7c6fba0f-3-3-logArchive.tar.gz
- Tarball in
@cms-sw/reconstruction-l2 @24LopezR @rbhattacharya04 Do you have any update? @mmusich proposed two different possible solutions in the comments above...
I have tested the provided fix on the crashing jobs, and while it solves the first issue we observe, it doesn't help with the crashes observed in the other 2 jobs:
- PromptReco_Run380963_ParkingDoubleMuonLowMass1 --> solved
- PromptReco_Run381065_ParkingDoubleMuonLowMass3 --> still crashing
- PromptReco_Run381067_ParkingDoubleMuonLowMass2 --> still crashing
Recipes to reproduce the second two jobs Run381065
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3544004/job/WMTaskSpace/cmsRun1/PSet* .
cat <<EOF >> PSet.py
process.options.numberOfThreads=cms.untracked.uint32(1)
process.options.numberOfStreams=cms.untracked.uint32(1)
process.source.skipEvents = cms.untracked.uint32(1807)
EOF
cmsRun PSet.py
Run381067
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3585803/job/WMTaskSpace/cmsRun1/PSet* .
cat <<EOF >> PSet.py
process.options.numberOfThreads=cms.untracked.uint32(1)
process.options.numberOfStreams=cms.untracked.uint32(1)
process.source.skipEvents = cms.untracked.uint32(1463)
EOF
cmsRun PSet.py
And Matt's fix can be fetched with:
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
git cms-addpkg RecoMuon/StandAloneTrackFinder
git remote add matt [email protected]:mandrenguyen/cmssw.git
git fetch matt muSegFault_140X
git cherry-pick addd479c0ea87af0b586ed366c64d0de3212e4df
scram b -j 8
PromptReco_Run381065_ParkingDoubleMuonLowMass3 --> still crashing PromptReco_Run381067_ParkingDoubleMuonLowMass2 --> still crashing
if you have reproduced offline, can you update the recipe with the number of events to skip to get faster at the crash?
PromptReco_Run381065_ParkingDoubleMuonLowMass3 --> still crashing PromptReco_Run381067_ParkingDoubleMuonLowMass2 --> still crashing
if you have reproduced offline, can you update the recipe with the number of events to skip to get faster at the crash?
sorry i forgot to add it in the recipe! I've edited the message above.
Run381065
cmsrel CMSSW_14_0_7 cd CMSSW_14_0_7/src cmsenv cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3544004/job/WMTaskSpace/cmsRun1/PSet* . cat <<EOF >> PSet.py process.options.numberOfThreads=cms.untracked.uint32(1) process.options.numberOfStreams=cms.untracked.uint32(1) process.source.skipEvents = cms.untracked.uint32(1807) EOF cmsRun PSet.pyRun381067
cmsrel CMSSW_14_0_7 cd CMSSW_14_0_7/src cmsenv cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3585803/job/WMTaskSpace/cmsRun1/PSet* . cat <<EOF >> PSet.py process.options.numberOfThreads=cms.untracked.uint32(1) process.options.numberOfStreams=cms.untracked.uint32(1) process.source.skipEvents = cms.untracked.uint32(1463) EOF cmsRun PSet.py
the two other crashes happen at a different location:
https://github.com/cms-sw/cmssw/blob/32dc27774b47d882609be07a885b060b9398d96e/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc#L250
in this case lastTSOS is not valid.
By initializing lastdEta only if lastTSOS is valid fixes both crashes, though I don't have any feeling about which value lastdEta should be initialized to (in my example below I initialize to 0. so that the two other clauses triggered by lastdEta > 0.1 are never entered).
diff --git a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
index 5253626ab46..30f1003b9eb 100644
--- a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
+++ b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
@@ -233,7 +233,15 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
LogTrace(metname) << "search Trajectory Measurement from: " << lastTSOS.globalPosition();
// pick the best measurement from each group
- std::vector<TrajectoryMeasurement> bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+ std::vector<TrajectoryMeasurement> bestMeasurements{};
+
+ double lastdEta{0.f};
+ if (lastTSOS.isValid()) {
+ lastdEta = fabs(lastTSOS.freeTrajectoryState()->momentum().eta() - eta0);
+ bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+ } else {
+ edm::LogInfo(metname) << "Invalid last TSOS, will not find best measurements ";
+ }
Hi all, It looks like the crash is ultimately coming from this part of MuonTrajectoryUpdator.cc (which is one step below StandAloneMuonFilter.cc which Marco is addressing):
https://github.com/cms-sw/cmssw/blob/f1b02548e1dfa1269f96ce117a39857e8d50cd89/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc#L157
which calls KFUpdator::update and returns an invalid TSOS. A possible workaround could be to skip the recHit whenever this occurs.
diff --git a/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc b/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc
index f4afc5ba640..30c3ac95fd3 100644
--- a/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc
+++ b/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc
@@ -156,6 +156,12 @@ pair<bool, TrajectoryStateOnSurface> MuonTrajectoryUpdator::update(const Traject
lastUpdatedTSOS = measurementUpdator()->update(propagatedTSOS, *((*recHit).get()));
+ if (!lastUpdatedTSOS.isValid()) {
+ edm::LogInfo(metname) << "Invalid last TSOS, will skip RecHit ";
+ lastUpdatedTSOS = propagatedTSOS; // Revert update
+ continue;
+ }
+
LogTrace(metname) << " Fit Position : " << lastUpdatedTSOS.globalPosition()
I just tested and this modification avoids the crash in both jobs, but implies a slight modification in muon trajectory updator, which may be sensitive. As this condition (returning invalid TSOS) should never happen, I think we can go with it for the moment.
But I would like to hear a second opinion on it from CMSSW experts, thanks.
Thanks @24LopezR
May I suggest you make a PR with that solution to 14_1_X and a backport to 14_0_X?
People can then give feedback in the PR thread while the tests are running in parallel.
I just tested and this modification avoids the crash in both jobs, but implies a slight modification in muon trajectory updator, which may be sensitive. As this condition (returning invalid TSOS) should never happen, I think we can go with it for the moment.
so this fixes all the 3 instances, (including the one in run-380963 ), right? Then perhaps https://github.com/cms-sw/cmssw/pull/45049 should be reverted.
so this fixes all the 3 instances, (including the one in run-380963 ), right? Then perhaps #45049 should be reverted.
Well, I tested my modification on top of PR #45049. So in principle it can't be safely reverted. Let me do in the meanwhile more standalone tests of my PR.
Well, I tested my modification on top of PR https://github.com/cms-sw/cmssw/pull/45049.
I did test it without and it didn't crash.
We have a have a similar segmentation violation in Run 381115 . Here is the tarball with the log:
/eos/cms/tier0/store/unmerged/data/logs/prod/2024/5/28/PromptReco_Run381115_ParkingSingleMuon0/Reco/0000/3/0609ffe4-6b01-45f9-bb3f-b9263252040e-268-3-logArchive.tar.gz
If anyone wants to test the patch for that too.
Matteo (ORM)
Considering that the fix is merged and already included in 14_0_8 then maybe the issue can be closed?
Giovanni (ORM)
+1
This issue is fully signed and ready to be closed.
Closing as completed