cmssw GsfTracking crash at HLT in run 359686

This issue is related to https://github.com/cms-sw/cmssw/issues/39026 in the sense that similar crashes happened before and a fix for that was put in https://github.com/cms-sw/cmssw/pull/39074. It is unclear to me why the crash came back despite the fix.

Some details of the crash is here

With the error stream file, I could reproduce the crash in GPU machines.

The error stream file(.raw), which is available in Hilton nodes, is also copied to this location: /eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw

In case it is useful, the file is also available in .root format: /eos/cms/store/group/phys_egamma/swmukher/outputFileGSF.root

Recipe to reproduce the crash in GPU machine is given below:

cmsrel CMSSW_12_4_9 
cd CMSSW_12_4_9/src 
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py

It crashes with the following messages:

%MSG-i CUDAService:  (NoModuleName) 01-Oct-2022 22:49:54 CEST pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
CUDA device 1: Tesla T4 (sm_75)
%MSG
.....
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 3 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 4 is invalid. skipping.
.....
----- Begin Fatal Exception 01-Oct-2022 22:50:29 CEST-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 359686 lumi: 263 event: 165691107 stream: 0
   [1] Running path 'DST_Run3_PFScoutingPixelTracking_v18'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

Oct 02 '22 08:10 swagata87

A new Issue was created by @swagata87 Swagata Mukherjee.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Oct 02 '22 08:10 cmsbuild

assign reconstruction, tracking-pog, egamma-pog

Oct 02 '22 08:10 swagata87

New categories assigned: tracking-pog,reconstruction,egamma-pog

@slava77,@mmusich,@mandrenguyen,@lfinco,@clacaputo,@swagata87 you have been requested to review this Pull request/Issue and eventually sign? Thanks

Oct 02 '22 08:10 cmsbuild

the recipe at https://github.com/cms-sw/cmssw/issues/39570#issue-1393682784 when run on lxplus-gpu results in:

Can not stat file (2):-/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw
%MSG
----- Begin Fatal Exception 03-Oct-2022 14:00:58 CEST-----------------------
An exception of category 'FedRawDataInputSource:threadError' occurred while
   [0] Calling InputSource::getNextItemType
Exception Message:
 file reader thread error 
----- End Fatal Exception -------------------------------------------------

please create a recipe that can run without having access to hilton nodes.

Oct 03 '22 12:10 mmusich

@mmusich Please use process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']

Oct 03 '22 12:10 bmjoshi

This issue is not reproducible on lxplus-node. However, it is reproducible on GPU cluster at Milan. The difference that I found was the CUDA version. Please find the driver info and version below: LXPLUS-GPU: CUDA runtime version 11.5, driver version 11.7, NVIDIA driver version 515.65.01 MILAN-GPU@P5: CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03

Oct 03 '22 12:10 bmjoshi

with this recipe:

(after ssh -X -Y <user>@lxplus-gpu.cern.ch)

cmsrel CMSSW_12_4_9 
cd CMSSW_12_4_9/src 
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py > & out.out &

I am unable to reproduce the error. You can find the resulting log file here: http://musich.web.cern.ch/musich/Issue-39570/out.out

Oct 03 '22 12:10 mmusich

-1

cannot be reproduced offline

Oct 03 '22 12:10 mmusich

Just to keep track, there was another similar crash in HLT in run 359806. Elog: http://cmsonline.cern.ch/cms-elog/1158449

It was possible to reproduce the crash in GPU Hilton machine using the error stream file. The recipe is same as given in the issue description; but only change is the filename: process.source.fileNames = ['file:/store/error_stream/run359806/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw'] The crash reproduced in hilton can be viewed here: https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_GPU_hilton_c2b01_44_01.log

As before, this crash was also NOT reproduced in lxplus-gpu. To run on lxplus, one can use this file which I copied to /eos: process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw'] The log file is here, which shows that there is no crash when ran in lxplus-gpu; https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_lxplus_gpu_764.log

Oct 05 '22 12:10 swagata87

The crash goes away if we add the full Sylvester's criterion here: https://github.com/cms-sw/cmssw/blob/5fb55fe3444c124a47661df4cc0168279ec55769/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc#L34-L36 ie; with this patch the crash is gone: (ignore the std::cout, which was just for check)

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index 5e9e6b31389..e3035361a5c 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -32,8 +32,17 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
 
     double det = 0;
-    if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
-        (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {
+    double det22 = 0;
+    double det33 = 0;
+    double det44 = 0;
+
+    if ( updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det22) && det22 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det33) && det33 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det44) && det44 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0)  {
+      std::cout << "all TSOS condition passed so will do result.addState.. " << std::endl;
+      
       result.addState(TrajectoryStateOnSurface(weights[i],

Oct 05 '22 15:10 swagata87

FYI: @cms-sw/hlt-l2 @cms-sw/heterogeneous-l2

(HLT crashes, seemingly specific to reconstruction on GPUs)

Oct 06 '22 07:10 missirol

There was another HLT crash due to GSF with similar error message, in run 359871. Relevant Elog: http://cmsonline.cern.ch/cms-elog/1158552 The crash was reproducible in hilton machine; this was the error message in hilton; using error stream file /store/error_stream/run359871/run359871_ls0093_index000227_fu-c2b01-39-01_pid3885773.raw

%MSG
----- Begin Fatal Exception 06-Oct-2022 14:46:35 -----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 359871 lumi: 93 event: 193950756 stream: 0
   [1] Running path 'HLT_DoubleEle9_eta1p22_mMax6_v2'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

With the fix mentioned above (https://github.com/cms-sw/cmssw/issues/39570#issuecomment-1268610509), the crash goes away.

I will prepare a PR.

Oct 06 '22 15:10 swagata87

Given the fix PR is merged in master, I'll prepare backport to 12_5_X and 12_4_X.

Similar GSF crashes is expected to happen rarely in long collision runs until the fix is online. Another example crash is here, from Friday's overnight run: http://cmsonline.cern.ch/cms-elog/1159020. This one was also reproducible using error_stream file, and the crash goes away with the fix.

I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.

Oct 08 '22 22:10 swagata87

I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.

I expect the HLT DOCs will anyway document them ?

Oct 09 '22 13:10 fwyzard

+1

A rare GSF crash was happening in HLT, with the message MultiTrajectoryState mixes states with and without errors.
To the best of my knowledge, this crash happened in long collision runs 359686 (Oct 1), 359806 (Oct 5), 359871 (Oct 5), 359998 (Oct 7), 360019 (Oct 8), 360225 (Oct 12); all taken with 12_4_9.
HLT then switched to 12_4_10 which included a potential fix for this crash. The switch to 12_4_10 happened on Oct 12th, and the first run with this release was 360279.
After HLT switched to 12_4_10, this crash was not seen so far.

Oct 17 '22 08:10 swagata87

This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .

Oct 24 '22 06:10 alintulu

This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .

@alintulu you might want to report a recipe to reproduce (also offline).

Oct 24 '22 06:10 mmusich

Hi Marco, I was checking this a little bit. I could not reproduce the crash in run 360919, but could reproduce the crash in 360895. And what I found is that, your suggestion here , which was related to a different crash, can avoid this crash in 360895. But it prints some messages like this:

[swmukher@gputest-milan-01 src]$ cmsRun hlt.py 
.......
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
......
%MSG-w BasicTrajectoryState:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =       37.6946     -22.2219      4.96195
p =       1.71548      -1.1439     0.328893
no error defined.

local error valid/values :0
[         -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan ]
%MSG
%MSG-w BasicTrajectoryState:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =       37.6946     -22.2219      4.96195
p =       1.71548      -1.1439     0.328893
no error defined.

local error valid/values :0
[         -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan ]
%MSG
%MSG-w BasicTrajectoryState:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =          -nan         -nan         -nan
p =          -nan         -nan         -nan
no error defined.

local error valid/values :0
[          nan        -nan        -nan         nan         nan
          -nan        -nan        -nan         nan        -nan
          -nan        -nan        -nan         nan        -nan
           nan         nan         nan        -nan        -nan
           nan        -nan        -nan        -nan         nan ]
%MSG
%MSG-e PosteriorWeightsCalculator:   GsfTrackProducer:hltEgammaGsfTracksForBParking 23-Oct-2022 22:45:51 CEST  Run: 360895 Event: 783357885
PosteriorWeightsCalculator: sumWeight < DBL_MIN
%MSG
%MSG-w TrackNaN:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
Track has NaN or the cov is not pos-definite
%MSG

Oct 24 '22 07:10 swagata87

+heterogeneous

Oct 24 '22 08:10 fwyzard

@swagata87 it looks like the same kind of problem also hit Prompt Reco.

There is a paused job for run 360888 in the ParkingDoubleMuonLowMass2 PD:

2022-10-24 10:15:40,863:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2022-10-24 10:15:40,874:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_10', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 8001

2022-10-24 10:15:40,874:CRITICAL:CMSSW:Error message: An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 360888 lumi: 199 event: 426362613 stream: 1
   [1] Running path 'dqmoffline_6_step'
   [2] Prefetching for module LogMessageMonitor/'TrackFinderLogMessageMonCommon'
   [3] Prefetching for module LogErrorHarvester/'logErrorHarvester'
   [4] Prefetching for module ConversionProducer/'gsfTracksOpenConversions'
   [5] Prefetching for module ConversionTrackProducer/'gsfTracksOpenConversionTrackProducer'
   [6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors

Tier-0 will post the coordinates to reproduce shorty.

EDIT: details at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641

Oct 24 '22 09:10 mmusich

thanks, I've now managed to check the prompt-reco crash also, and I confirm that your patch solves the crash. So I think it makes sense to put in the fix to avoid further crashes.

Your fix was originally meant for a crash with message MultiTrajectoryState mixes states with different signs of local p_z, which happened both offline and HLT, and as far as I know was not possible to reproduce. Thus the fix was not put in back then. The same crash did not happen at HLT recently. But now the other crash MultiTrajectoryState mixes states with and without errors is also fixed with the same patch. So, it looks like these 2 crashes are after all related to each other?

We surely want to fix the crashes urgently, so let's go ahead with the PR, but I've copied the raw file so that it remain accessible, in case we want to do any other check
/eos/cms/store/group/phys_egamma/ec/swmukher/gsfcrashfile/00266a70-1a8f-4101-8391-ca99df9b51ca.root

The HLT error_stream files are also there, but none of them are reproducible in lxplus-gpu or lxplus(cpu). So it becomes generally difficult for people to debug with HLT error files, as one needs access to one of the P5 GPU machines. The prompt reco crash raw file should be easier to work with, as anyone can use it for debugging purpose in any usual lxplus machine.

At some point it would be great for egamma to understand these strange warnings triggered from BasicTrajectoryState here. It's unclear to me right now from where we are having missing error or nan matrix. But given the complexity of the Gsf algo it might take a while to fully understand it.

Oct 24 '22 11:10 swagata87

@mmusich can you explain to me what the suggested changes do ?

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index c3958e7f0c4..39a8da622be 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,8 +28,15 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
   MultiTrajectoryStateAssembler result;
 
   int i = 0;
+  float pzSign = 1.;
   for (auto const& tsosI : predictedComponents) {
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
+    if (i > 0 && pzSign * updatedTSOS.localParameters().pzSign() < 0) {
+      continue;
+    } else {
+      pzSign *= updatedTSOS.localParameters().pzSign();
+    }
+
     if (updatedTSOS.isValid() && updatedTSOS.localError().valid()) {
       result.addState(TrajectoryStateOnSurface(weights[i],
                                                updatedTSOS.localParameters(),

If updatedTSOS.localParameters().pzSign() is negative, won't this flip the sign of pzSign for each component ?

Oct 24 '22 12:10 fwyzard

is negative, won't this flip the sign of pzSign for each component ?

pzSign is a "check" variable. It starts a positive and then it gets update for each iteration (after the first) by multiplying it with the sign of the next state. Since, unless the charges of the state do not share the same sign, the sign of the product will be +, it essentially catches the states for which the sign gets changed (and removes them, since they lead to the mismatched pZ sign that than leads to a crash).

Oct 24 '22 12:10 mmusich

Mhm, I'm not sure I follow.

During the first iteration (when i == 0) the if clause is skipped and the else clause is always processed, so effectively pzSign is set to the sign of of KFUpdator().update(predictedComponents.front(), aRecHit).

Then for the following interations

if pzSign has a different sign than updatedTSOS.localParameters().pzSign(), the component is skipped
otherwise, pzSign is multiplied by updatedTSOS.localParameters().pzSign(), which could flip it if the latter is negative.

So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?

Then, once a sign is chosen, should it be fixed and not changed any more ?

For example to use the sign of the first component, this loop seems safer:

    if (i == 0) {
      // select the pz sign of the first component
      pzSign = updatedTSOS.localParameters().pzSign();
    } else {
      // skip components with the "wrong" pz sign
      if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
        continue;
      }
    }

Oct 24 '22 12:10 fwyzard

(of course, maybe I misunderstood the whole logic...)

Oct 24 '22 12:10 fwyzard

Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?

Oct 24 '22 12:10 fwyzard

Then for the following interations

if pzSign has a different sign than updatedTSOS.localParameters().pzSign(), the component is skipped

otherwise, pzSign is multiplied by updatedTSOS.localParameters().pzSign(), which could flip it if the latter is negative.

right, I see it can become a problem when there are many states to be added with multiple sign flips. I was wrongly assuming that it would not try to add more than two states...

So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?

this is a valid point, though one would need one more loop to collect all the signs.

Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?

that's precisely my comment at: https://github.com/cms-sw/cmssw/issues/39234#issuecomment-1232537510, see also https://github.com/cms-sw/cmssw/issues/39234#issuecomment-1279266060.

Oct 24 '22 13:10 mmusich

For example to use the sign of the first component, this loop seems safer:

   if (i == 0) {
     // select the pz sign of the first component
     pzSign = updatedTSOS.localParameters().pzSign();
   } else {
     // skip components with the "wrong" pz sign
     if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
       continue;
     }
   }

while logically correct, this doesn't solve the issue at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641 :(

Oct 24 '22 14:10 mmusich

So I have made a bit of archeology dig. The crash reported in prompt reco (cmsTalk) already appeared once this year (see https://github.com/cms-sw/cmssw/issues/39026) and I solved it by proposing https://github.com/cms-sw/cmssw/pull/39056. With the state of the code as of 2b81038ca2d3d3726589e58d1da24c1b7c8f0adf the newly reported crash doesn't happen either. Then, the same file (TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc) was touched twice by @swagata87:

in https://github.com/cms-sw/cmssw/pull/39256 (to reduce the amount of GSF tracking errors, see https://github.com/cms-sw/cmssw/issues/39234 )
in https://github.com/cms-sw/cmssw/pull/39656 (to solve this very issue).

with the state of the code as of https://github.com/cms-sw/cmssw/commit/f3afc63127880c397839549e52c7877ffd518b52 the crash in prompt reco doesn't happen, so I conclude that the issue has been re-introduce somehow by mistake in PR https://github.com/cms-sw/cmssw/pull/39656.

Oct 25 '22 10:10 mmusich

https://github.com/cms-sw/cmssw/pull/39656 is a rather simple PR. As I understand, all it does is to make sure that the states we are allowing to go further in the reco chain has a positive definite curvilinear error matrix, so that the states are more well behaved. This should not lead to the crash. We are probably missing something elsewhere.

Also, I noticed that the frequency of this crash at HLT has reduced after https://github.com/cms-sw/cmssw/pull/39656, but given that the crash has not completely disappeared, it's clear that there is something more to it which we are missing currently. I am open to suggestions and happy to try out any suggestions people might have on this.

The reco chain of electron and lowpt-electron is a bit different. So I was wondering if that has something to do with it. The problem could also be in lowpt-electron specific code/config. I have to admit I have not figured it out yet.

Oct 25 '22 12:10 swagata87