[DATA ReRECO] Logic error in BParking UL ReRECO
Hi, We are seeing very minor errors (only a couple of jobs exiting code 8001) in production concerning a logic error in the context of the BParking UL ReRECO at AOD step. Two automatic JIRAS in Computing has been opened [1] and [2]. The particular error is described in [3]
We would like advice on how to solve the issue or even if we should just bypass and accept missing stats from these jobs.
Thanks, Jordan FYI @slava77 @bainbrid
[1] https://its.cern.ch/jira/browse/CMSCOMPPR-21546 [2] https://its.cern.ch/jira/browse/CMSCOMPPR-21544 [3] `Fatal Exception (Exit code: 8001) An exception of category 'LogicError' occurred while [0] Processing Event run: 316569 lumi: 608 event: 804095228 stream: 2 [1] Running path 'AODoutput_step' [2] Prefetching for module PoolOutputModule/'AODoutput' [3] Prefetching for module LogErrorHarvester/'logErrorHarvester' [4] Prefetching for module ConversionProducer/'gsfTracksOpenConversions' [5] Prefetching for module ConversionTrackProducer/'gsfTracksOpenConversionTrackProducer' [6] Prefetching for module GsfTrackProducer/'lowPtGsfEleGsfTracks' [7] Prefetching for module CkfTrackCandidateMaker/'lowPtGsfEleCkfTrackCandidates' [8] Calling method for module LowPtGsfElectronSeedProducer/'lowPtGsfElectronSeeds' Exception Message: MultiTrajectoryState mixes states with different signs of local p_z
Fatal Exception (Exit code: 8001) An exception of category 'LogicError' occurred while [0] Processing Event run: 316457 lumi: 429 event: 485185703 stream: 5 [1] Running path 'AODoutput_step' [2] Prefetching for module PoolOutputModule/'AODoutput' [3] Prefetching for module LogErrorHarvester/'logErrorHarvester' [4] Prefetching for module ConversionProducer/'gsfTracksOpenConversions' [5] Prefetching for module ConversionTrackProducer/'gsfTracksOpenConversionTrackProducer' [6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks' Exception Message: MultiTrajectoryState mixes states with different signs of local p_z`
A new Issue was created by @jordan-martins Jordan Martins.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
In addition to the erros presented above, we also see:
Fatal Exception (Exit code: 8001) An exception of category 'GeometryMismatch' occurred while [0] Processing stream begin Run run: 316766 stream: 2 [1] Calling method for module PFElecTkProducer/'uncleanedOnlyPfTrackElec' [2] Using EventSetup component TransientTrackBuilderESProducer/'' to make data TransientTrackBuilder/'TransientTrackBuilder' in record TransientTrackRecord [3] Using EventSetup component GlobalTrackingGeometryESProducer/'' to make data GlobalTrackingGeometry/'' in record GlobalTrackingGeometryRecord [4] Using EventSetup component CSCGeometryESModule/'' to make data CSCGeometry/'' in record MuonGeometryRecord Exception Message: Size mismatch between geometry (size=0) and alignments (size=4284)
and
Fatal Exception (Exit code: 8002) An exception of category 'StdException' occurred while [0] Processing stream begin Run run: 316766 stream: 7 [1] Calling method for module PFElecTkProducer/'uncleanedOnlyPfTrackElec' [2] Using EventSetup component TransientTrackBuilderESProducer/'' to make data TransientTrackBuilder/'TransientTrackBuilder' in record TransientTrackRecord [3] Using EventSetup component GlobalTrackingGeometryESProducer/'' to make data GlobalTrackingGeometry/'' in record GlobalTrackingGeometryRecord [4] Using EventSetup component CSCGeometryESModule/'' to make data CSCGeometry/'' in record MuonGeometryRecord [5] Using EventSetup component PoolDBESSource/'GlobalTag' to make data CSCRecoDigiParameters/'' in record CSCRecoDigiParametersRcd Exception Message: A std::exception was thrown. Can not get data (Additional Information: [frontier.c:1135]: No more proxies. Last error was: Request 1724 on chan 1 failed at Thu Oct 14 04:55:55 2021: -9 [fn-socket.c:147]: connect to 10.29.0.1:3127 timed out after 5 seconds) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )
assign reconstruction
New categories assigned: reconstruction
@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks
Fatal Exception (Exit code: 8001) An exception of category 'GeometryMismatch' occurred while [0] Processing stream begin Run run: 316766 stream: 2 [1] Calling method for module PFElecTkProducer/'uncleanedOnlyPfTrackElec' [2] Using EventSetup component TransientTrackBuilderESProducer/'' to make data TransientTrackBuilder/'TransientTrackBuilder' in record TransientTrackRecord [3] Using EventSetup component GlobalTrackingGeometryESProducer/'' to make data GlobalTrackingGeometry/'' in record GlobalTrackingGeometryRecord [4] Using EventSetup component CSCGeometryESModule/'' to make data CSCGeometry/'' in record MuonGeometryRecord Exception Message: Size mismatch between geometry (size=0) and alignments (size=4284)
This error looks something maybe @cms-sw/geometry-l2 or @cms-sw/alca-l2 could comment about?
Fatal Exception (Exit code: 8002) An exception of category 'StdException' occurred while [0] Processing stream begin Run run: 316766 stream: 7 [1] Calling method for module PFElecTkProducer/'uncleanedOnlyPfTrackElec' [2] Using EventSetup component TransientTrackBuilderESProducer/'' to make data TransientTrackBuilder/'TransientTrackBuilder' in record TransientTrackRecord [3] Using EventSetup component GlobalTrackingGeometryESProducer/'' to make data GlobalTrackingGeometry/'' in record GlobalTrackingGeometryRecord [4] Using EventSetup component CSCGeometryESModule/'' to make data CSCGeometry/'' in record MuonGeometryRecord [5] Using EventSetup component PoolDBESSource/'GlobalTag' to make data CSCRecoDigiParameters/'' in record CSCRecoDigiParametersRcd Exception Message: A std::exception was thrown. Can not get data (Additional Information: [frontier.c:1135]: No more proxies. Last error was: Request 1724 on chan 1 failed at Thu Oct 14 04:55:55 2021: -9 [fn-socket.c:147]: connect to 10.29.0.1:3127 timed out after 5 seconds) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )
This looks like a transient error in Frontier access. Or is it a persistent failure?
This looks like a transient error in Frontier access. Or is it a persistent failure?
yeah, I wish that this was not a part of the issue, since it's overtaking the attention from the actual problem.
@makortel Concerning the exception:
MuonGeometryRecord Exception Message: Size mismatch between geometry (size=0) and alignments (size=4284)
that looks like a configuration problem, so I don't see now it could be an error that affects only a few jobs. Could the transient Frontier errors be related to this exception?
Thanks @cvuosalo.
Could the transient Frontier errors be related to this exception?
It probably is, the log I was able to find https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2018A_ParkingBPH5_20Jun2021_UL2018_210812_201823_8542/8001/DataProcessing/c8a3243a-211b-4b76-a1dc-3d98338f7d4c-289-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log contains
warn [frontier.c:1025]: Request 286 on chan 2 failed at Thu Oct 14 04:51:15 2021: -6 [fn-socket.c:261]: read from <ip>:3127 timed out after 10 seconds
warn [frontier.c:1103]: Trying next server cmsfrontier.cern.ch with same proxy 10.29.0.1[10.29.0.1:3127]
warn [frontier.c:1025]: Request 1295 on chan 1 failed at Thu Oct 14 04:53:45 2021: -6 [fn-socket.c:261]: read from <ip>:3127 timed out after 10 seconds
warn [frontier.c:1103]: Trying next server cmsfrontier.cern.ch with same proxy 10.29.0.1[10.29.0.1:3127]
warn [frontier.c:1025]: Request 1656 on chan 1 failed at Thu Oct 14 04:54:23 2021: -6 [fn-socket.c:261]: read from <ip>:3127 timed out after 10 seconds
warn [frontier.c:1103]: Trying next server cmsfrontier1.cern.ch with same proxy 10.29.0.1[10.29.0.1:3127]
warn [frontier.c:1025]: Request 1718 on chan 1 failed at Thu Oct 14 04:55:42 2021: -9 [fn-socket.c:147]: connect to <ip>:3127 timed out after 5 seconds
warn [frontier.c:1125]: Trying next proxy 10.29.0.1 with same server cmsfrontier1.cern.ch
warn [frontier.c:1025]: Request 1724 on chan 1 failed at Thu Oct 14 04:55:55 2021: -9 [fn-socket.c:147]: connect to <ip>:3127 timed out after 5 seconds
error [frontier.c:1135]: No more proxies. Last error was: Request 1724 on chan 1 failed at Thu Oct 14 04:55:55 2021: -9 [fn-socket.c:147]: connect to <ip>:3127 timed out after 5 seconds
warn [fn-htclient.c:714]: Resetting the proxy list at Thu Oct 14 04:55:55 2021
warn [frontier.c:1025]: Request 1774 on chan 1 failed at Thu Oct 14 04:57:08 2021: -9 [fn-socket.c:147]: connect to <ip>:3127 timed out after 5 seconds
warn [frontier.c:1125]: Trying next proxy <ip> with same server cmsfrontier.cern.ch
and also
----- Begin Fatal Exception 14-Oct-2021 04:57:18 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing stream begin Run run: 316766 stream: 7
[1] Calling method for module PFElecTkProducer/'uncleanedOnlyPfTrackElec'
[2] Using EventSetup component TransientTrackBuilderESProducer/'' to make data TransientTrackBuilder/'TransientTrackBuilder' in record TransientTrackRecord
[3] Using EventSetup component GlobalTrackingGeometryESProducer/'' to make data GlobalTrackingGeometry/'' in record GlobalTrackingGeometryRecord
[4] Using EventSetup component CSCGeometryESModule/'' to make data CSCGeometry/'' in record MuonGeometryRecord
[5] Using EventSetup component PoolDBESSource/'GlobalTag' to make data CSCRecoDigiParameters/'' in record CSCRecoDigiParametersRcd
Exception Message:
A std::exception was thrown.
Can not get data (Additional Information: [frontier.c:1135]: No more proxies. Last error was: Request 1724 on chan 1 failed at Thu Oct 14 04:55:55 2021: -9 [fn-socket.c:147]: connect to <ip>:3127 timed out after 5 seconds) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )
----- End Fatal Exception -------------------------------------------------
which certainly looks like the root cause. I only wonder why the other EDM streams end up reporting a different exception.
This looks like a transient error in Frontier access. Or is it a persistent failure?
yeah, I wish that this was not a part of the issue, since it's overtaking the attention from the actual problem.
@slava77 I presume by "actual problem", you mean this:
[...snip...]
Exception Message:
MultiTrajectoryState mixes states with different signs of local p_z`
I can take a look, but not immediately, and most probably not for a few days. How urgent is this? It seems from the comments that it's a rare problem, affecting only a small fraction of jobs?
I note that both examples given here are triggered initially by the gsfTracksOpenConversions, of which @nancymarinelli is an expert.
However, I presume the issue lies within e.g. the lowPtGsfElectronSeeds module, which makes use of PFTrajectroryPoint from an extrapolation of a trajectory to the calo surfaces, here. Is this the likely source?
Exception Message: A std::exception was thrown. Can not get data (Additional Information: [frontier.c:1135]: No more proxies. Last error was: Request 1724 on chan 1 failed at Thu Oct 14 04:55:55 2021: -9 [fn-socket.c:147]: connect to
:3127 timed out after 5 seconds) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )
@makortel Not sure if this help, but I think we already saw this error in the past while developing the BeamSpotOnline infrastucture (e.g see log in https://cms-conddb.cern.ch/cmsDbBrowser/logs/show_O2O_log/Prod/BeamSpotOnlineHLTTest/2020-11-19%2006:07:28.287671). I read some old emails but I cannot find the solution, all I remember is that at that time we identified an issue with the rotation of the squids, maybe @smorovic remembers exactly what was done. Other two people working on this were Dave Dykstra and Barry Blumenfeld (sorry i can't find their git handle). Also, around the same time @ggovi added a bugfix for coral (https://github.com/cms-sw/cmssw/pull/32503), but I'm not sure if it can be related to this.
type egamma
the main issue is
[8] Calling method for module LowPtGsfElectronSeedProducer/'lowPtGsfElectronSeeds'
Exception Message:
MultiTrajectoryState mixes states with different signs of local p_z
https://github.com/cms-sw/cmssw/issues/35929#issuecomment-956573741
Hello @bainbrid, coming back to this rather old issue. Was there any solution of the crash reported above [1]?
I am asking because this issue re-appeared during Run3 data-taking at the HLT level, a few days back, in run number 356615. The exact error message was this[2]. So, we might need to understand the reason of this crash and follow up on this. Let me also tag @VinInn as an expert of Gsf tracking. He might have some insight or suggestions.
[1]
MultiTrajectoryState mixes states with different signs of local p_z
[2]
Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with different signs of local p_z
Hi all,
(mainly for my benefit) I summarise below the exceptions reported in this thread relevant to low pT electrons:
@jordan-martins originally reported in this message https://github.com/cms-sw/cmssw/issues/35929#issue-1040944414 the following two exceptions seen during the BParking UL RERECO:
`Fatal Exception (Exit code: 8001)
An exception of category 'LogicError' occurred while
[0] Processing Event run: 316569 lumi: 608 event: 804095228 stream: 2
...
[8] Calling method for module LowPtGsfElectronSeedProducer/'lowPtGsfElectronSeeds'
Exception Message:
MultiTrajectoryState mixes states with different signs of local p_z
and
Fatal Exception (Exit code: 8001)
An exception of category 'LogicError' occurred while
[0] Processing Event run: 316457 lumi: 429 event: 485185703 stream: 5
...
[6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with different signs of local p_z`
Separately, @swagata87 also reported in the message above https://github.com/cms-sw/cmssw/issues/35929#issuecomment-1205125152 a further two exceptions seen during "Run3 data-taking at the HLT level". However, the low pT electron code is NOT run as part of the HLT, and the 2nd exception reported below relates to hltEgammaGsfTracks module, which is not part of my domain. (The source of first exception is ambiguous.)
[1]
MultiTrajectoryState mixes states with different signs of local p_z
and
[2]
Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with different signs of local p_z
I mention them here because they show the same exception as those indicated by @jordan-martins but using different modules (e.g. NOT related to low pT electrons ...???) So perhaps the issue is a general problem (related to the GsfTrackProducer module), rather than being specific to low pT electrons? Perhaps @VinInn can comment?
Finally, I ADD a new (similar) exception below, reported in this CMS talk thread, seen when running over Run 3 data (in this case, the ParkingDoubleMuonLowMass5 primary data set).
An exception of category 'LogicError' occurred while
[0] Processing Event run: 356948 lumi: 46 event: 59404076 stream: 5
[6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
Again, GsfTrackProducer is involved (albeit it with a modified config and inputs).
@lfinco
Hi @jordan-martins @VinInn @lfinco @swagata87, @slava77, all,
In short, I'm going to struggle to debug this, as it seems to be a problem with core code relating to TrajectoryStates (beyond my remit), and I would appreciate some help, or even somebody taking over the task.
Here's a recipe from Jordan to produce the MultiTrajectoryState mixes states with and without errors error from the Run 3 ParkingDoubleMuonLowMass5 and JetMET processing:
cmsrel CMSSW_12_4_5
cd CMSSW_12_4_5/src
tar -xvf <targz file>
cmsenv
scram b
cd job/WMTaskSpace/cmsRun1/
cmsRun PSet.py
Target files:
- For the ParkingDoubleMuonLowMass: /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022C/DQMLogicError/vocms013.cern.ch-1705406-3-log.tar.gz
- For the JetMET: /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022C/DQMLogicError/vocms013.cern.ch-1700050-3-log.tar.gz
Or you can isolate an event using my test area that generates the exception here:
cd /afs/cern.ch/user/b/bainbrid/public/dev/lowptele-crashes/CMSSW_12_4_5/src
cmsenv
cmsRun TEST.py
The error in full is:
Begin processing the 1st record. Run 356948, Event 59404076, LumiSection 46 on stream 0 at 11-Aug-2022 12:26:46.954 CEST
%MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 11-Aug-2022 12:27:16 CEST Run: 356948 Event: 59404076
local error not pos-def
[ 1.5025e+09-1.30407e+08 4.67017e+07 -54666.4-2.75911e+08
-1.30407e+08 1.13185e+07 -4.0534e+06 4748.96 2.39473e+07
4.67017e+07 -4.0534e+06 1.45161e+06 -1696.55-8.57603e+06
-54666.4 4748.96 -1696.55 -46.349 10020.1
-2.75911e+08 2.39473e+07-8.57603e+06 10020.1 5.06667e+07 ]
pos/mom/mf (-27.5758,21.5712,-35.2811) (-7.23815,5.65459,-8.15019) (0.002197,-0.0017186,3.8095)
%MSG
%MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 11-Aug-2022 12:27:16 CEST Run: 356948 Event: 59404076
BasicTrajectoryState: attempt to access errors when none available accessing local error..
freestate pointer: parameters
x = -59.8614 -26.7489 -601.627
p = 0.000161703 0.000241183 -0.000139573
no error defined.
local error valid/values :0
[ -5.16557e+11 4.48198e+10-1.60121e+10 -8.8316 9.47497e+10
4.48198e+10-3.88885e+09 1.38932e+09 0.766286 -8.2211e+09
-1.60121e+10 1.38932e+09-4.96341e+08 -0.27376 2.93703e+09
-8.8316 0.766286 -0.27376 4.60074e-06 1.61994
9.47497e+10 -8.2211e+09 2.93703e+09 1.61994-1.73795e+10 ]
%MSG
%MSG-w BasicTrajectoryState: GsfTrackProducer:lowPtGsfEleGsfTracks 11-Aug-2022 12:27:16 CEST Run: 356948 Event: 59404076
BasicTrajectoryState: attempt to access errors when none available accessing local error..
freestate pointer: parameters
x = -59.8614 -26.7489 -601.627
p = 0.000161703 0.000241182 -0.000139572
no error defined.
local error valid/values :0
[ -5.16557e+11 4.48198e+10-1.60121e+10 -8.8316 9.47497e+10
4.48198e+10-3.88885e+09 1.38932e+09 0.766286 -8.2211e+09
-1.60121e+10 1.38932e+09-4.96341e+08 -0.27376 2.93703e+09
-8.8316 0.766286 -0.27376 4.60074e-06 1.61994
9.47497e+10 -8.2211e+09 2.93703e+09 1.61994-1.73795e+10 ]
%MSG
----- Begin Fatal Exception 11-Aug-2022 12:27:17 CEST-----------------------
An exception of category 'LogicError' occurred while
[0] Processing Event run: 356948 lumi: 46 event: 59404076 stream: 0
[1] Running path 'dqmoffline_6_step'
[2] Prefetching for module LogMessageMonitor/'TrackFinderLogMessageMonCommon'
[3] Prefetching for module LogErrorHarvester/'logErrorHarvester'
[4] Prefetching for module ConversionProducer/'gsfTracksOpenConversions'
[5] Prefetching for module ConversionTrackProducer/'gsfTracksOpenConversionTrackProducer'
[6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------
The issue arises in the BasicTrajectoryState class used by the (generic) GsfTrackProducer module (instance: lowPtGsfEleGsfTracks). Perhaps there are issues for very displaced tracks? e.g. the first warning reports pos/mom/mf (-27.5758,21.5712,-35.2811) (-7.23815,5.65459,-8.15019) (0.002197,-0.0017186,3.8095) ....
Any help / suggestions welcome.
see also https://github.com/cms-sw/cmssw/issues/39026#issuecomment-1213027768
Hello, I just wanted to mention that we saw this error in the HLT yesterday night. In 357542 (2022-08-15 23:11:11) we got
[2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking'Exception Message:MultiTrajectoryState mixes states with different signs of local p_z
I just wanted to mention that we saw this error in the HLT yesterday night.
is there a way to test offline any of these errors? I think something like this
diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index c3958e7f0c4..39a8da622be 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,8 +28,15 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
MultiTrajectoryStateAssembler result;
int i = 0;
+ float pzSign = 1.;
for (auto const& tsosI : predictedComponents) {
TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
+ if (i > 0 && pzSign * updatedTSOS.localParameters().pzSign() < 0) {
+ continue;
+ } else {
+ pzSign *= updatedTSOS.localParameters().pzSign();
+ }
+
if (updatedTSOS.isValid() && updatedTSOS.localError().valid()) {
result.addState(TrajectoryStateOnSurface(weights[i],
updatedTSOS.localParameters(),
should work.
is there a way to test offline any of these errors?
The issue with the HLT crashes (2 so far) is that they don't seem to be reproducible offline (even when running on a machine with a GPU, like in the original HLT job). For completeness, below are the details to rerun the relevant HLT menu on the problematic events.
-
Run-356615 (
CMSSW_12_4_3)- Error:
[2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracks' Exception Message: MultiTrajectoryState mixes states with different signs of local p_z- Recipe (not reproducing error):
cmsrel CMSSW_12_4_3; cd CMSSW_12_4_3/src; cmsenv hltConfigFromDB --runNumber 356615 > hlt.py cat >> hlt.py <<@EOF process.source.fileListMode = True process.source.fileNames = ['file:/afs/cern.ch/work/m/missirol/public/fog/error_stream/run356615/run356615_ls0008_index000164_fu-c2b01-34-01_pid1822035.raw'] @EOF cmsRun hlt.py &> hlt.log -
Run-357542 (
CMSSW_12_4_6)- Error:
[2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking' Exception Message: MultiTrajectoryState mixes states with different signs of local p_z- Recipe (not reproducing error):
cmsrel CMSSW_12_4_6; cd CMSSW_12_4_6/src; cmsenv hltConfigFromDB --runNumber 357542 > hlt.py cat >> hlt.py <<@EOF process.source.fileListMode = True process.source.fileNames = ['file:/afs/cern.ch/work/m/missirol/public/fog/error_stream/run357542/run357542_ls0037_index000136_fu-c2b04-23-01_pid406358.raw'] @EOF cmsRun hlt.py &> hlt.log
thanks @missirol, what about the errors in the reprocessing @jordan-martins ? Do you have a reproducer?
there was an instance of the same problem in prompt reco for run 360888 in the dataset ParkingDoubleMuonLowMass2 (full details at cmsTalk).
The issue can be reproduced (quickly) by using the pkl file from:
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022F/LogicError/job_844315
and using the following PSet:
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.source.eventsToProcess = cms.untracked.VEventRange('360888:426362612-360888:426362614')
this patch (as also reported at https://github.com/cms-sw/cmssw/issues/39570#issuecomment-1288562267 for a similar problem observed at HLT)
diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index f3d6d173c10..f03c724ef11 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,8 +28,14 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
MultiTrajectoryStateAssembler result;
int i = 0;
+ float pzSign = 1.;
for (auto const& tsosI : predictedComponents) {
TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
+ if (i > 0 && pzSign * updatedTSOS.localParameters().pzSign() < 0) {
+ continue;
+ } else {
+ pzSign *= updatedTSOS.localParameters().pzSign();
+ }
if (double det;
updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
seems to solve the issue.