cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Segmentation violation for StandAloneMuonProducer in CMSSW_14_0_7

Open francescobrivio opened this issue 1 year ago • 26 comments

Dear all, As reported in https://cms-talk.web.cern.ch/t/segmentation-violation-in-promptreco-for-parkingdoublemuonlowmass1-run-380963/41520 we have a segmentation violation for the ParkingDoubleMuonLowMass dataset in Run 380963, with the following stack trace:

Begin processing the 1590th record. Run 380963, Event 73250130, LumiSection 85 on stream 3 at 24-May-2024 02:57:25.663 UTC
%MSG-e KFUpdator:  StandAloneMuonProducer:displacedStandAloneMuons  24-May-2024 02:57:26 UTC Run: 380963 Event: 73250130
 could not invert martix:
[   3.0909e+13 4.23366e+13 6.14231e+14 8.06196e+14
   4.23366e+13 5.79893e+13 8.41324e+14 1.10426e+15
   6.14231e+14 8.41324e+14 1.22062e+16 1.60209e+16
   8.06196e+14 1.10426e+15 1.60209e+16 2.10279e+16 ]
%MSG

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

...
Thread 7 (Thread 0x14b2a4ffd700 (LWP 1099) "cmsRun"):
#0  0x000014b2fe1bd301 in poll () from /lib64/libc.so.6
#1  0x000014b2fa18c6af in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x000014b2fa140dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x000014b2fa141720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x000014b2c4879cb7 in Propagator::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libTrackP
ropagationSteppingHelixPropagator.so
#6  0x000014b2f4a1b219 in ForwardDetLayer::compatible(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0
_7/lib/el8_amd64_gcc12/libTrackingToolsDetLayers.so
#7  0x000014b2c47ef36f in MuRingForwardLayer::compatibleDets(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMS
SW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonDetLayers.so
#8  0x000014b2c47efc73 in MuRingForwardDoubleLayer::groupedCompatibleDets(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/
cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonDetLayers.so
#9  0x000014b28ba43c78 in MuonDetLayerMeasurements::groupedMeasurements(DetLayer const*, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, edm::Event const&) () from /cvmfs/
cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonMeasurementDet.so
#10 0x000014b28ba44989 in MuonDetLayerMeasurements::groupedMeasurements(DetLayer const*, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) () from /cvmfs/cms.cern.ch/el8_amd
64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonMeasurementDet.so
#11 0x000014b2683aed74 in StandAloneMuonFilter::findBestMeasurements(DetLayer const*, TrajectoryStateOnSurface const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12
/libRecoMuonStandAloneTrackFinder.so
#12 0x000014b2683b2585 in StandAloneMuonFilter::refit(TrajectoryStateOnSurface const&, DetLayer const*, Trajectory&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/l
ibRecoMuonStandAloneTrackFinder.so
#13 0x000014b2683b46a4 in StandAloneMuonTrajectoryBuilder::trajectories(TrajectorySeed const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoMuonStandAloneTr
ackFinder.so
...
Current Modules:

Module: StandAloneMuonProducer:displacedStandAloneMuons (crashed)
Module: CkfTrackCandidateMaker:lowPtTripletStepTrackCandidates
Module: CkfTrackCandidateMaker:lowPtQuadStepTrackCandidates
Module: TrackingMonitor:TrackSeedMondetachedQuadStep
Module: PoolOutputModule:write_MINIAOD
Module: SiStripMonitorTrack:SiStripMonitorTrackCommon
Module: MkFitProducer:detachedTripletStepTrackCandidatesMkFit
Module: CkfTrackCandidateMaker:pixelPairStepTrackCandidates

A fatal system signal has occurred: segmentation violation
Complete

A minimal recipe to reproduce the error is:

cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3221638/job/WMTaskSpace/cmsRun1/PSet.p* .
cat <<EOF >> PSet.py
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.skipEvents = cms.untracked.uint32(1589)
EOF
cmsRun -e PSet.py

francescobrivio avatar May 24 '24 07:05 francescobrivio

cms-bot internal usage

cmsbuild avatar May 24 '24 07:05 cmsbuild

A new Issue was created by @francescobrivio.

@Dr15Jones, @antoniovilela, @rappoccio, @sextonkennedy, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar May 24 '24 07:05 cmsbuild

assign reconstruction

francescobrivio avatar May 24 '24 07:05 francescobrivio

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar May 24 '24 07:05 cmsbuild

FYI @cms-sw/muon-pog-l2 and @cms-sw/muon-dpg-l2

francescobrivio avatar May 24 '24 07:05 francescobrivio

type muon

francescobrivio avatar May 24 '24 08:05 francescobrivio

For the record with this:

diff --git a/TrackingTools/DetLayers/src/ForwardDetLayer.cc b/TrackingTools/DetLayers/src/ForwardDetLayer.cc
index d6f0afdf22c..b5d2eaa27ac 100644
--- a/TrackingTools/DetLayers/src/ForwardDetLayer.cc
+++ b/TrackingTools/DetLayers/src/ForwardDetLayer.cc
@@ -88,7 +88,13 @@ pair<bool, TrajectoryStateOnSurface> ForwardDetLayer::compatible(const Trajector
     edm::LogError("DetLayers") << "ERROR: BarrelDetLayer::compatible() is used before the layer surface is initialized";
   // throw an exception? which one?
 
-  TrajectoryStateOnSurface myState = prop.propagate(ts, specificSurface());
+  TrajectoryStateOnSurface myState;
+  if UNLIKELY (!ts.isValid()) {
+    edm::LogError("DetLayers") << "ERROR: trying to propagate invalid TSOS" << std::endl;
+  } else {
+    myState = prop.propagate(ts, specificSurface());
+  }
+
   if UNLIKELY (!myState.isValid())
     return make_pair(false, myState);
 

one gets past the error and in the log running the script at https://github.com/cms-sw/cmssw/issues/45035#issue-2314665547 one can find:

%MSG-e KFUpdator:  StandAloneMuonProducer:displacedStandAloneMuons  24-May-2024 11:33:49 CEST Run: 380963 Event: 73250130
 could not invert martix:
[   3.0909e+13 4.23366e+13 6.14231e+14 8.06196e+14
   4.23366e+13 5.79893e+13 8.41324e+14 1.10426e+15
   6.14231e+14 8.41324e+14 1.22062e+16 1.60209e+16
   8.06196e+14 1.10426e+15 1.60209e+16 2.10279e+16 ]
%MSG
%MSG-e DetLayers:  StandAloneMuonProducer:displacedStandAloneMuons  24-May-2024 11:33:49 CEST Run: 380963 Event: 73250130
ERROR: trying to propage invalid TSOS
%MSG
%MSG-e DetLayers:  StandAloneMuonProducer:displacedStandAloneMuons  24-May-2024 11:33:49 CEST Run: 380963 Event: 73250130
ERROR: trying to propage invalid TSOS
%MSG

it seems plausible that then the problem originates here:

https://github.com/cms-sw/cmssw/blob/32dc27774b47d882609be07a885b060b9398d96e/TrackingTools/KalmanUpdators/src/KFUpdator.cc#L170-L173

where upon being unable to invert a matrix, a default constructed TSOS (with invalid state) is returned.

mmusich avatar May 24 '24 09:05 mmusich

Let me tag the Muon Reco contacts too: @24LopezR @rbhattacharya04

jfernan2 avatar May 24 '24 10:05 jfernan2

it seems plausible that then the problem originates here:

cmssw/TrackingTools/KalmanUpdators/src/KFUpdator.cc

it's not in the crash stack. Do you mean the preceding LogError? For a fix, instead of ForwardDetLayer::compatible, an upstream caller may be a better place to decide what to do with a bad state that originated there.

slava77 avatar May 24 '24 12:05 slava77

Do you mean the preceding LogError?

yes

an upstream caller may be a better place to decide what to do with a bad state that originated there.

is:

diff --git a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
index 5253626ab46..0150b405859 100644
--- a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
+++ b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
@@ -299,6 +299,11 @@ std::vector<TrajectoryMeasurement> StandAloneMuonFilter::findBestMeasurements(co
   std::vector<TrajectoryMeasurement> result;
   std::vector<TrajectoryMeasurement> measurements;
 
+  if (!tsos.isValid()) {
+    edm::LogError(metname) << "ERROR: trying to propagate invalid TSOS" << std::endl;
+    return result;
+  }
+
   if (theOverlappingChambersFlag && layer->hasGroups()) {
     std::vector<TrajectoryMeasurementGroup> measurementGroups =
         theMeasurementExtractor->groupedMeasurements(layer, tsos, *propagator(), *estimator());

upstream enough ?

mmusich avatar May 25 '24 09:05 mmusich

@@ -299,6 +299,11 @@ std::vector<TrajectoryMeasurement> StandAloneMuonFilter::findBestMeasurements(co

StandAloneMuonFilter::refit is missing checks of validity, although it does check the empty output from the findBestMeasurements in lieu of that. In case the invalid TSOS is the StandAloneMuonFilter::refit input, then going further up would be more clear.

slava77 avatar May 25 '24 12:05 slava77

on a second thought, I'm not really against the edm::LogError in the ForwardDetLayers; it's still more appropriate than a crash.

The hope from going upstream would be to see if there it could become a case more suitable for a LogInfo

slava77 avatar May 25 '24 12:05 slava77

type tracking

slava77 avatar May 25 '24 12:05 slava77

In case the invalid TSOS is the StandAloneMuonFilter::refit input, then going further up would be more clear. [...] The hope from going upstream would be to see if there it could become a case more suitable for a LogInfo

for this case I tested (successfully) also this variant:

diff --git a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
index 5253626ab46..08deb0200a1 100644
--- a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
+++ b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
@@ -233,7 +233,13 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
     LogTrace(metname) << "search Trajectory Measurement from: " << lastTSOS.globalPosition();
 
     // pick the best measurement from each group
-    std::vector<TrajectoryMeasurement> bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+    std::vector<TrajectoryMeasurement> bestMeasurements{};
+
+    if (lastTSOS.isValid()) {
+      bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+    } else {
+      edm::LogInfo(metname) << "Invalid last TSOS, will not find best measurements ";
+    }
 
     // RB: Different ways can be choosen if no bestMeasurement is available:
     // 1- check on lastTSOS-initialTSOS eta difference
@@ -248,10 +254,16 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
     // wrt the initial one (i.e. seed), then try to find the measurements
     // according to the lastButOne FTS. (1B)
     double lastdEta = fabs(lastTSOS.freeTrajectoryState()->momentum().eta() - eta0);
+
     if (bestMeasurements.empty() && lastdEta > 0.1) {
       LogTrace(metname) << "No measurement and big eta variation wrt seed" << endl
                         << "trying with lastButOneUpdatedTSOS";
-      bestMeasurements = findBestMeasurements(*layer, theLastButOneUpdatedTSOS);
+
+      if (theLastButOneUpdatedTSOS.isValid()) {
+        bestMeasurements = findBestMeasurements(*layer, theLastButOneUpdatedTSOS);
+      } else {
+        edm::LogInfo(metname) << "Invalid last but one updated TSOS, will not find best measurements ";
+      }
     }
 
     //if no measurement found and the current FTS has an eta very different
@@ -259,7 +271,12 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
     //according to the initial FTS. (1A)
     if (bestMeasurements.empty() && lastdEta > 0.1) {
       LogTrace(metname) << "No measurement and big eta variation wrt seed" << endl << "tryng with seed TSOS";
-      bestMeasurements = findBestMeasurements(*layer, initialTSOS);
+
+      if (initialTSOS.isValid()) {
+        bestMeasurements = findBestMeasurements(*layer, initialTSOS);
+      } else {
+        edm::LogInfo(metname) << "Invalid initial TSOS, will not find best measurements ";
+      }
     }
 
     // FIXME: uncomment this line!!

(in the specifics of this crash, it was actually theLastButOneUpdatedTSOS not being valid that lead to the crash in StandAloneMuonFilter::findBestMeasurements.).

I'm not really against the edm::LogError in the ForwardDetLayers; it's still more appropriate than a crash.

if either way is fine, I'd let the domain experts to decide / implement a protection.

mmusich avatar May 25 '24 14:05 mmusich

We currently have other two failures all reporting

Module: StandAloneMuonProducer:displacedStandAloneMuons (crashed)
  • PromptReco_Run381065_ParkingDoubleMuonLowMass3
    • Tarball in
      /eos/cms/tier0/store/unmerged/data/logs/prod/2024/5/25/PromptReco_Run381065_ParkingDoubleMuonLowMass3/Reco/0000/3/3508ad47-e770-4fff-bf82-400ad0f155cd-63-3-logArchive.tar.gz
      
  • PromptReco_Run381067_ParkingDoubleMuonLowMass2
    • Tarball in
      /eos/cms/tier0/store/unmerged/data/logs/prod/2024/5/26/PromptReco_Run381067_ParkingDoubleMuonLowMass2/Reco/0000/3/a85151ad-f91a-4493-ac15-756b7c6fba0f-3-3-logArchive.tar.gz
      

francescobrivio avatar May 26 '24 11:05 francescobrivio

@cms-sw/reconstruction-l2 @24LopezR @rbhattacharya04 Do you have any update? @mmusich proposed two different possible solutions in the comments above...

francescobrivio avatar May 26 '24 11:05 francescobrivio

I have tested the provided fix on the crashing jobs, and while it solves the first issue we observe, it doesn't help with the crashes observed in the other 2 jobs:

  • PromptReco_Run380963_ParkingDoubleMuonLowMass1 --> solved
  • PromptReco_Run381065_ParkingDoubleMuonLowMass3 --> still crashing
  • PromptReco_Run381067_ParkingDoubleMuonLowMass2 --> still crashing

Recipes to reproduce the second two jobs Run381065

cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3544004/job/WMTaskSpace/cmsRun1/PSet* .
cat <<EOF >> PSet.py
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.skipEvents = cms.untracked.uint32(1807)
EOF
cmsRun PSet.py

Run381067

cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3585803/job/WMTaskSpace/cmsRun1/PSet* .
cat <<EOF >> PSet.py
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.skipEvents = cms.untracked.uint32(1463)
EOF
cmsRun PSet.py

And Matt's fix can be fetched with:

cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
git cms-addpkg RecoMuon/StandAloneTrackFinder
git remote add matt [email protected]:mandrenguyen/cmssw.git
git fetch matt muSegFault_140X
git cherry-pick addd479c0ea87af0b586ed366c64d0de3212e4df
scram b -j 8

francescobrivio avatar May 27 '24 07:05 francescobrivio

PromptReco_Run381065_ParkingDoubleMuonLowMass3 --> still crashing PromptReco_Run381067_ParkingDoubleMuonLowMass2 --> still crashing

if you have reproduced offline, can you update the recipe with the number of events to skip to get faster at the crash?

mmusich avatar May 27 '24 09:05 mmusich

PromptReco_Run381065_ParkingDoubleMuonLowMass3 --> still crashing PromptReco_Run381067_ParkingDoubleMuonLowMass2 --> still crashing

if you have reproduced offline, can you update the recipe with the number of events to skip to get faster at the crash?

sorry i forgot to add it in the recipe! I've edited the message above.

francescobrivio avatar May 27 '24 09:05 francescobrivio

Run381065

cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3544004/job/WMTaskSpace/cmsRun1/PSet* .
cat <<EOF >> PSet.py
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.skipEvents = cms.untracked.uint32(1807)
EOF
cmsRun PSet.py

Run381067

cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/displacedStandAloneMuons/job_3585803/job/WMTaskSpace/cmsRun1/PSet* .
cat <<EOF >> PSet.py
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.skipEvents = cms.untracked.uint32(1463)
EOF
cmsRun PSet.py

the two other crashes happen at a different location:

https://github.com/cms-sw/cmssw/blob/32dc27774b47d882609be07a885b060b9398d96e/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc#L250

in this case lastTSOS is not valid. By initializing lastdEta only if lastTSOS is valid fixes both crashes, though I don't have any feeling about which value lastdEta should be initialized to (in my example below I initialize to 0. so that the two other clauses triggered by lastdEta > 0.1 are never entered).

diff --git a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
index 5253626ab46..30f1003b9eb 100644
--- a/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
+++ b/RecoMuon/StandAloneTrackFinder/src/StandAloneMuonFilter.cc
@@ -233,7 +233,15 @@ void StandAloneMuonFilter::refit(const TrajectoryStateOnSurface& initialTSOS,
     LogTrace(metname) << "search Trajectory Measurement from: " << lastTSOS.globalPosition();
 
     // pick the best measurement from each group
-    std::vector<TrajectoryMeasurement> bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+    std::vector<TrajectoryMeasurement> bestMeasurements{};
+
+    double lastdEta{0.f};
+    if (lastTSOS.isValid()) {
+      lastdEta = fabs(lastTSOS.freeTrajectoryState()->momentum().eta() - eta0);
+      bestMeasurements = findBestMeasurements(*layer, lastTSOS);
+    } else {
+      edm::LogInfo(metname) << "Invalid last TSOS, will not find best measurements ";
+    }

mmusich avatar May 27 '24 10:05 mmusich

Hi all, It looks like the crash is ultimately coming from this part of MuonTrajectoryUpdator.cc (which is one step below StandAloneMuonFilter.cc which Marco is addressing):

https://github.com/cms-sw/cmssw/blob/f1b02548e1dfa1269f96ce117a39857e8d50cd89/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc#L157

which calls KFUpdator::update and returns an invalid TSOS. A possible workaround could be to skip the recHit whenever this occurs.

diff --git a/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc b/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc
index f4afc5ba640..30c3ac95fd3 100644
--- a/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc
+++ b/RecoMuon/TrackingTools/src/MuonTrajectoryUpdator.cc
@@ -156,6 +156,12 @@ pair<bool, TrajectoryStateOnSurface> MuonTrajectoryUpdator::update(const Traject
 
             lastUpdatedTSOS = measurementUpdator()->update(propagatedTSOS, *((*recHit).get()));
 
+            if (!lastUpdatedTSOS.isValid()) {
+              edm::LogInfo(metname) << "Invalid last TSOS, will skip RecHit ";
+              lastUpdatedTSOS = propagatedTSOS; // Revert update
+              continue;
+            }
+
             LogTrace(metname) << "  Fit   Position : " << lastUpdatedTSOS.globalPosition()

I just tested and this modification avoids the crash in both jobs, but implies a slight modification in muon trajectory updator, which may be sensitive. As this condition (returning invalid TSOS) should never happen, I think we can go with it for the moment.

But I would like to hear a second opinion on it from CMSSW experts, thanks.

24LopezR avatar May 27 '24 10:05 24LopezR

Thanks @24LopezR May I suggest you make a PR with that solution to 14_1_X and a backport to 14_0_X?
People can then give feedback in the PR thread while the tests are running in parallel.

mandrenguyen avatar May 27 '24 10:05 mandrenguyen

I just tested and this modification avoids the crash in both jobs, but implies a slight modification in muon trajectory updator, which may be sensitive. As this condition (returning invalid TSOS) should never happen, I think we can go with it for the moment.

so this fixes all the 3 instances, (including the one in run-380963 ), right? Then perhaps https://github.com/cms-sw/cmssw/pull/45049 should be reverted.

mmusich avatar May 27 '24 11:05 mmusich

so this fixes all the 3 instances, (including the one in run-380963 ), right? Then perhaps #45049 should be reverted.

Well, I tested my modification on top of PR #45049. So in principle it can't be safely reverted. Let me do in the meanwhile more standalone tests of my PR.

24LopezR avatar May 27 '24 12:05 24LopezR

Well, I tested my modification on top of PR https://github.com/cms-sw/cmssw/pull/45049.

I did test it without and it didn't crash.

mmusich avatar May 27 '24 12:05 mmusich

We have a have a similar segmentation violation in Run 381115 . Here is the tarball with the log: /eos/cms/tier0/store/unmerged/data/logs/prod/2024/5/28/PromptReco_Run381115_ParkingSingleMuon0/Reco/0000/3/0609ffe4-6b01-45f9-bb3f-b9263252040e-268-3-logArchive.tar.gz If anyone wants to test the patch for that too.

Matteo (ORM)

mpresill avatar May 28 '24 16:05 mpresill

Considering that the fix is merged and already included in 14_0_8 then maybe the issue can be closed?

Giovanni (ORM)

gpetruc avatar Jun 11 '24 09:06 gpetruc

+1

mandrenguyen avatar Jun 11 '24 10:06 mandrenguyen

This issue is fully signed and ready to be closed.

cmsbuild avatar Jun 11 '24 10:06 cmsbuild

Closing as completed

francescobrivio avatar Jun 11 '24 10:06 francescobrivio