GsfTracking crash at HLT in run 359686
This issue is related to https://github.com/cms-sw/cmssw/issues/39026 in the sense that similar crashes happened before and a fix for that was put in https://github.com/cms-sw/cmssw/pull/39074. It is unclear to me why the crash came back despite the fix.
Some details of the crash is here
With the error stream file, I could reproduce the crash in GPU machines.
The error stream file(.raw), which is available in Hilton nodes, is also copied to this location: /eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw
In case it is useful, the file is also available in .root format: /eos/cms/store/group/phys_egamma/swmukher/outputFileGSF.root
Recipe to reproduce the crash in GPU machine is given below:
cmsrel CMSSW_12_4_9
cd CMSSW_12_4_9/src
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py
It crashes with the following messages:
%MSG-i CUDAService: (NoModuleName) 01-Oct-2022 22:49:54 CEST pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
CUDA device 1: Tesla T4 (sm_75)
%MSG
.....
%MSG-e GsfMultiStateUpdator: GsfTrackProducer:hltEgammaGsfTracks 01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 3 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator: GsfTrackProducer:hltEgammaGsfTracks 01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 4 is invalid. skipping.
.....
----- Begin Fatal Exception 01-Oct-2022 22:50:29 CEST-----------------------
An exception of category 'LogicError' occurred while
[0] Processing Event run: 359686 lumi: 263 event: 165691107 stream: 0
[1] Running path 'DST_Run3_PFScoutingPixelTracking_v18'
[2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------
A new Issue was created by @swagata87 Swagata Mukherjee.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction, tracking-pog, egamma-pog
New categories assigned: tracking-pog,reconstruction,egamma-pog
@slava77,@mmusich,@mandrenguyen,@lfinco,@clacaputo,@swagata87 you have been requested to review this Pull request/Issue and eventually sign? Thanks
the recipe at https://github.com/cms-sw/cmssw/issues/39570#issue-1393682784 when run on lxplus-gpu results in:
Can not stat file (2):-/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw
%MSG
----- Begin Fatal Exception 03-Oct-2022 14:00:58 CEST-----------------------
An exception of category 'FedRawDataInputSource:threadError' occurred while
[0] Calling InputSource::getNextItemType
Exception Message:
file reader thread error
----- End Fatal Exception -------------------------------------------------
please create a recipe that can run without having access to hilton nodes.
@mmusich Please use
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
This issue is not reproducible on lxplus-node. However, it is reproducible on GPU cluster at Milan. The difference that I found was the CUDA version. Please find the driver info and version below:
LXPLUS-GPU: CUDA runtime version 11.5, driver version 11.7, NVIDIA driver version 515.65.01
MILAN-GPU@P5: CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
with this recipe:
(after ssh -X -Y <user>@lxplus-gpu.cern.ch)
cmsrel CMSSW_12_4_9
cd CMSSW_12_4_9/src
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py > & out.out &
I am unable to reproduce the error. You can find the resulting log file here: http://musich.web.cern.ch/musich/Issue-39570/out.out
-1
- cannot be reproduced offline
Just to keep track, there was another similar crash in HLT in run 359806.
Elog: http://cmsonline.cern.ch/cms-elog/1158449
It was possible to reproduce the crash in GPU Hilton machine using the error stream file. The recipe is same as given in the issue description; but only change is the filename:
process.source.fileNames = ['file:/store/error_stream/run359806/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw']
The crash reproduced in hilton can be viewed here: https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_GPU_hilton_c2b01_44_01.log
As before, this crash was also NOT reproduced in lxplus-gpu.
To run on lxplus, one can use this file which I copied to /eos:
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw']
The log file is here, which shows that there is no crash when ran in lxplus-gpu; https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_lxplus_gpu_764.log
The crash goes away if we add the full Sylvester's criterion here: https://github.com/cms-sw/cmssw/blob/5fb55fe3444c124a47661df4cc0168279ec55769/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc#L34-L36 ie; with this patch the crash is gone: (ignore the std::cout, which was just for check)
diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index 5e9e6b31389..e3035361a5c 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -32,8 +32,17 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
double det = 0;
- if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
- (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {
+ double det22 = 0;
+ double det33 = 0;
+ double det44 = 0;
+
+ if ( updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+ (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det22) && det22 > 0) &&
+ (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det33) && det33 > 0) &&
+ (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det44) && det44 > 0) &&
+ (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {
+ std::cout << "all TSOS condition passed so will do result.addState.. " << std::endl;
+
result.addState(TrajectoryStateOnSurface(weights[i],
FYI: @cms-sw/hlt-l2 @cms-sw/heterogeneous-l2
(HLT crashes, seemingly specific to reconstruction on GPUs)
There was another HLT crash due to GSF with similar error message, in run 359871.
Relevant Elog: http://cmsonline.cern.ch/cms-elog/1158552
The crash was reproducible in hilton machine; this was the error message in hilton; using error stream file /store/error_stream/run359871/run359871_ls0093_index000227_fu-c2b01-39-01_pid3885773.raw
%MSG
----- Begin Fatal Exception 06-Oct-2022 14:46:35 -----------------------
An exception of category 'LogicError' occurred while
[0] Processing Event run: 359871 lumi: 93 event: 193950756 stream: 0
[1] Running path 'HLT_DoubleEle9_eta1p22_mMax6_v2'
[2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------
With the fix mentioned above (https://github.com/cms-sw/cmssw/issues/39570#issuecomment-1268610509), the crash goes away.
I will prepare a PR.
Given the fix PR is merged in master, I'll prepare backport to 12_5_X and 12_4_X.
Similar GSF crashes is expected to happen rarely in long collision runs until the fix is online. Another example crash is here, from Friday's overnight run: http://cmsonline.cern.ch/cms-elog/1159020. This one was also reproducible using error_stream file, and the crash goes away with the fix.
I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.
I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.
I expect the HLT DOCs will anyway document them ?
+1
- A rare GSF crash was happening in HLT, with the message
MultiTrajectoryState mixes states with and without errors. - To the best of my knowledge, this crash happened in long collision runs 359686 (Oct 1), 359806 (Oct 5), 359871 (Oct 5), 359998 (Oct 7), 360019 (Oct 8), 360225 (Oct 12); all taken with
12_4_9. - HLT then switched to
12_4_10which included a potential fix for this crash. The switch to12_4_10happened on Oct 12th, and the first run with this release was 360279. - After HLT switched to 12_4_10, this crash was not seen so far.
This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .
This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .
@alintulu you might want to report a recipe to reproduce (also offline).
Hi Marco, I was checking this a little bit. I could not reproduce the crash in run 360919, but could reproduce the crash in 360895. And what I found is that, your suggestion here , which was related to a different crash, can avoid this crash in 360895. But it prints some messages like this:
[swmukher@gputest-milan-01 src]$ cmsRun hlt.py
.......
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
......
%MSG-w BasicTrajectoryState: GsfTrackProducer:hltEgammaGsfTracksForBParking 23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available accessing local error..
freestate pointer: parameters
x = 37.6946 -22.2219 4.96195
p = 1.71548 -1.1439 0.328893
no error defined.
local error valid/values :0
[ -nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan ]
%MSG
%MSG-w BasicTrajectoryState: GsfTrackProducer:hltEgammaGsfTracksForBParking 23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available accessing local error..
freestate pointer: parameters
x = 37.6946 -22.2219 4.96195
p = 1.71548 -1.1439 0.328893
no error defined.
local error valid/values :0
[ -nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan
-nan -nan -nan -nan -nan ]
%MSG
%MSG-w BasicTrajectoryState: GsfTrackProducer:hltEgammaGsfTracksForBParking 23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available accessing local error..
freestate pointer: parameters
x = -nan -nan -nan
p = -nan -nan -nan
no error defined.
local error valid/values :0
[ nan -nan -nan nan nan
-nan -nan -nan nan -nan
-nan -nan -nan nan -nan
nan nan nan -nan -nan
nan -nan -nan -nan nan ]
%MSG
%MSG-e PosteriorWeightsCalculator: GsfTrackProducer:hltEgammaGsfTracksForBParking 23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
PosteriorWeightsCalculator: sumWeight < DBL_MIN
%MSG
%MSG-w TrackNaN: GsfTrackProducer:hltEgammaGsfTracksForBParking 23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
Track has NaN or the cov is not pos-definite
%MSG
+heterogeneous
@swagata87 it looks like the same kind of problem also hit Prompt Reco.
There is a paused job for run 360888 in the ParkingDoubleMuonLowMass2 PD:
2022-10-24 10:15:40,863:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2022-10-24 10:15:40,874:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_10', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 8001
2022-10-24 10:15:40,874:CRITICAL:CMSSW:Error message: An exception of category 'LogicError' occurred while
[0] Processing Event run: 360888 lumi: 199 event: 426362613 stream: 1
[1] Running path 'dqmoffline_6_step'
[2] Prefetching for module LogMessageMonitor/'TrackFinderLogMessageMonCommon'
[3] Prefetching for module LogErrorHarvester/'logErrorHarvester'
[4] Prefetching for module ConversionProducer/'gsfTracksOpenConversions'
[5] Prefetching for module ConversionTrackProducer/'gsfTracksOpenConversionTrackProducer'
[6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
Tier-0 will post the coordinates to reproduce shorty.
EDIT: details at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641
thanks, I've now managed to check the prompt-reco crash also, and I confirm that your patch solves the crash. So I think it makes sense to put in the fix to avoid further crashes.
Your fix was originally meant for a crash with message MultiTrajectoryState mixes states with different signs of local p_z, which happened both offline and HLT, and as far as I know was not possible to reproduce. Thus the fix was not put in back then. The same crash did not happen at HLT recently. But now the other crash MultiTrajectoryState mixes states with and without errors is also fixed with the same patch. So, it looks like these 2 crashes are after all related to each other?
We surely want to fix the crashes urgently, so let's go ahead with the PR, but I've copied the raw file so that it remain accessible, in case we want to do any other check
/eos/cms/store/group/phys_egamma/ec/swmukher/gsfcrashfile/00266a70-1a8f-4101-8391-ca99df9b51ca.root
The HLT error_stream files are also there, but none of them are reproducible in lxplus-gpu or lxplus(cpu). So it becomes generally difficult for people to debug with HLT error files, as one needs access to one of the P5 GPU machines. The prompt reco crash raw file should be easier to work with, as anyone can use it for debugging purpose in any usual lxplus machine.
At some point it would be great for egamma to understand these strange warnings triggered from BasicTrajectoryState here. It's unclear to me right now from where we are having missing error or nan matrix. But given the complexity of the Gsf algo it might take a while to fully understand it.
@mmusich can you explain to me what the suggested changes do ?
diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc index c3958e7f0c4..39a8da622be 100644 --- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc +++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc @@ -28,8 +28,15 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur MultiTrajectoryStateAssembler result; int i = 0; + float pzSign = 1.; for (auto const& tsosI : predictedComponents) { TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit); + if (i > 0 && pzSign * updatedTSOS.localParameters().pzSign() < 0) { + continue; + } else { + pzSign *= updatedTSOS.localParameters().pzSign(); + } + if (updatedTSOS.isValid() && updatedTSOS.localError().valid()) { result.addState(TrajectoryStateOnSurface(weights[i], updatedTSOS.localParameters(),
If updatedTSOS.localParameters().pzSign() is negative, won't this flip the sign of pzSign for each component ?
is negative, won't this flip the sign of pzSign for each component ?
pzSign is a "check" variable. It starts a positive and then it gets update for each iteration (after the first) by multiplying it with the sign of the next state. Since, unless the charges of the state do not share the same sign, the sign of the product will be +, it essentially catches the states for which the sign gets changed (and removes them, since they lead to the mismatched pZ sign that than leads to a crash).
Mhm, I'm not sure I follow.
During the first iteration (when i == 0) the if clause is skipped and the else clause is always processed, so effectively pzSign is set to the sign of of KFUpdator().update(predictedComponents.front(), aRecHit).
Then for the following interations
- if
pzSignhas a different sign thanupdatedTSOS.localParameters().pzSign(), the component is skipped - otherwise,
pzSignis multiplied byupdatedTSOS.localParameters().pzSign(), which could flip it if the latter is negative.
So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?
Then, once a sign is chosen, should it be fixed and not changed any more ?
For example to use the sign of the first component, this loop seems safer:
if (i == 0) {
// select the pz sign of the first component
pzSign = updatedTSOS.localParameters().pzSign();
} else {
// skip components with the "wrong" pz sign
if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
continue;
}
}
(of course, maybe I misunderstood the whole logic...)
Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?
Then for the following interations
- if pzSign has a different sign than updatedTSOS.localParameters().pzSign(), the component is skipped
- otherwise, pzSign is multiplied by updatedTSOS.localParameters().pzSign(), which could flip it if the latter is negative.
right, I see it can become a problem when there are many states to be added with multiple sign flips. I was wrongly assuming that it would not try to add more than two states...
So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?
this is a valid point, though one would need one more loop to collect all the signs.
Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?
that's precisely my comment at: https://github.com/cms-sw/cmssw/issues/39234#issuecomment-1232537510, see also https://github.com/cms-sw/cmssw/issues/39234#issuecomment-1279266060.
For example to use the sign of the first component, this loop seems safer:
if (i == 0) { // select the pz sign of the first component pzSign = updatedTSOS.localParameters().pzSign(); } else { // skip components with the "wrong" pz sign if (pzSign * updatedTSOS.localParameters().pzSign() < 0) { continue; } }
while logically correct, this doesn't solve the issue at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641 :(
So I have made a bit of archeology dig.
The crash reported in prompt reco (cmsTalk) already appeared once this year (see https://github.com/cms-sw/cmssw/issues/39026) and I solved it by proposing https://github.com/cms-sw/cmssw/pull/39056.
With the state of the code as of 2b81038ca2d3d3726589e58d1da24c1b7c8f0adf the newly reported crash doesn't happen either.
Then, the same file (TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc) was touched twice by @swagata87:
- in https://github.com/cms-sw/cmssw/pull/39256 (to reduce the amount of GSF tracking errors, see https://github.com/cms-sw/cmssw/issues/39234 )
- in https://github.com/cms-sw/cmssw/pull/39656 (to solve this very issue).
with the state of the code as of https://github.com/cms-sw/cmssw/commit/f3afc63127880c397839549e52c7877ffd518b52 the crash in prompt reco doesn't happen, so I conclude that the issue has been re-introduce somehow by mistake in PR https://github.com/cms-sw/cmssw/pull/39656.
https://github.com/cms-sw/cmssw/pull/39656 is a rather simple PR. As I understand, all it does is to make sure that the states we are allowing to go further in the reco chain has a positive definite curvilinear error matrix, so that the states are more well behaved. This should not lead to the crash. We are probably missing something elsewhere.
Also, I noticed that the frequency of this crash at HLT has reduced after https://github.com/cms-sw/cmssw/pull/39656, but given that the crash has not completely disappeared, it's clear that there is something more to it which we are missing currently. I am open to suggestions and happy to try out any suggestions people might have on this.
The reco chain of electron and lowpt-electron is a bit different. So I was wondering if that has something to do with it. The problem could also be in lowpt-electron specific code/config. I have to admit I have not figured it out yet.