cmssw
cmssw copied to clipboard
spurious differences in outputs of wf `12434.7` (was `11634.7`)
Differences in outputs of PR tests for wf 11634.7
were noticed in recent PRs to 12_5_X
.
In each of these cases, (1) the PR was purely technical and almost-certainly incapable of creating changes to physics outputs, and (2) PR tests ran on IB CMSSW_12_5_X_2022-10-20-1100
(but I don't know if this type of issue had been seen before).
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1ff86b/28394/summary.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e7334d/28397/summary.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-107837/28401/summary.html
In one case (https://github.com/cms-sw/cmssw/pull/39793#issuecomment-1286081487), PR tests were re-run (using the same IB as base), and after that bin-by-bin differences for wf 11634.7
disappeared, suggesting some non-reproducibility is at play.
The corresponding PRs to 12_4_X
and 12_6_X
(tested just as recently) didn't exhibit this issue.
Edit : originally, these spurious differences were only seen in 12_5_X
; later on, they also appeared in the master
branch (13_0_X
at the time).
Edit (May 24th):
For the record, #41471 (and backports) removed wf
11634.7
(2022 HLT and MC GT) from the 'limited matrix' inCMSSW_13_X_Y
, and effectively replaced it with wf12434.7
(2023 HLT and MC GT).
A new Issue was created by @missirol Marino Missiroli.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction, tracking-pog
11634.7
is a dedicated extended mkFit setup.
$ runTheMatrix.py -nel 11634.7
11634.7 2021_trackingMkFit+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano [1]: cmsDriver.py TTbar_14TeV_TuneCP5_cfi -s GEN,SIM -n 10 --conditions auto:phase1_2022_realistic --beamspot Realistic25ns13p6TeVEarly2022Collision --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --relval 9000,100
[2]: cmsDriver.py step2 -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2022 --conditions auto:phase1_2022_realistic --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --customise RecoTracker/MkFit/customizeHLTIter0ToMkFit.customizeHLTIter0ToMkFit
[3]: cmsDriver.py step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,NANO,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2022_realistic --datatier GEN-SIM-RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3 --procModifiers trackingMkFitDevel
[4]: cmsDriver.py step4 -s HARVESTING:@standardValidation+@standardDQM+@ExtraHLT+@miniAODValidation+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2022_realistic --mc --geometry DB:Extended --scenario pp --filetype DQM --era Run3 -n 100
1 workflows with 4 steps
--------------------------------------------------------------------------------
assign reconstruction, tracking-pog
New categories assigned: tracking-pog,reconstruction
@slava77,@mmusich,@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks
#39811 provides another example (again in 12_5_X
):
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b4c6cc/28422/summary.html
(but I don't know if this type of issue had been seen before)
I checked recent 12_5_X
PRs for which PR-tests are still accessible, and I didn't find other ones affected by this issue. So, I cannot exclude that this issue somehow started only since CMSSW_12_5_X_2022-10-20-1100
.
#39814 provides another example, again in 12_5_X (enough examples at this point):
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b1c64e/28423/summary.html
#39811 provides another example (again in
12_5_X
):https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b4c6cc/28422/summary.html
there is one more pixelPair step track candidate relative to the baseline https://tinyurl.com/26w5bw6a This iteration is not using mkFit. So, it's not obvious why the difference would be localized in the mkfit wf.
do these diffs show up in 12_5_X only or also in 12_6_X ?
there is one more pixelPair step track candidate relative to the baseline https://tinyurl.com/26w5bw6a
and apparently 2 "existing" track candidates are different (in addition to having one more), based on e.g. chi2 distr
This iteration is not using mkFit. So, it's not obvious why the difference would be localized in the mkfit wf.
uhm, I'm wrong, pixelPair in this setup is using mkfit as well
pixelPair in this setup is using mkfit as well
right
https://github.com/cms-sw/cmssw/blob/bf6b0ccf84c8a1dd2a0c3fa2423ccb54f03c8a10/Configuration/ProcessModifiers/python/trackingMkFitDevel_cff.py#L26
do these diffs show up in 12_5_X only or also in 12_6_X ?
I am a bit surprised it doesn't show (at least there haven't been reports) in 12_6_X as well.
urgent (marking urgent the issues affecting relvals in the IBs)
It looks like this is starting to hit also master, see e.g.:
- https://github.com/cms-sw/cmssw/pull/40051#issuecomment-1336778476
- https://github.com/cms-sw/cmssw/pull/40222#issuecomment-1336212221
Also here https://github.com/cms-sw/cmssw/pull/40133#issuecomment-1336022845
And here https://github.com/cms-sw/cmssw/pull/39953#issuecomment-1338002482
Another one in https://github.com/cms-sw/cmssw/pull/40253#issuecomment-1340260714
@missirol, do you mind changing the title to remove "in 12_5_X
" since that doesn't apply anymore (if that's possible at all) ?
another one in https://github.com/cms-sw/cmssw/pull/40317#issuecomment-1352433774
Is the 116134.7 workflow still useful to be run in PR tests?
Another occurance in https://github.com/cms-sw/cmssw/pull/40442
Is the 116134.7 workflow still useful to be run in PR tests?
I'd support to remove this from the limited tests
IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?
IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?
For a strange coincidence I was noticing some differences of that kind in #40679 only a few hours before you posted this comment. They are concentrated in the HLT tracking, but still they have probably the same origin that the older ones referenced here: could it be?
PS: maybe those differences are not really "spurious", i.e. not related to this issue:
- they are only in the HLT tracking, not in offline reco;
- PR #40679 does touch mkfit, in fact.
IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?
For a strange coincidence I was noticing some differences of that kind in #40679 only a few hours before you posted this comment. They are concentrated in the HLT tracking, but still they have probably the same origin that the older ones referenced here: could it be?
PS: maybe those differences are not really "spurious", i.e. not related to this issue:
* they are only in the HLT tracking, not in offline reco; * PR [[MkFit] Format change for windows in json files #40679](https://github.com/cms-sw/cmssw/pull/40679) does touch mkfit, in fact.
this case is different; some change in HLT context was expected
A difference in one specific histogram in 11634.7, EgammaV/ConversionValidator/ConversionInfo/pConvVtxdRVsEta
has started to appear, e.g. in https://github.com/cms-sw/cmssw/pull/40997#issuecomment-1460743585
Is the 116134.7 workflow still useful to be run in PR tests?
I'd support to remove this from the limited tests
Should we consider again removing 11634.7 from limited matrix?
I commented it in https://github.com/cms-sw/cmssw/pull/41106, let me know if I should have fully removed it, I was just thinking we may want to add it back later after the wf's output changes are more understood
For the record, #41471 (and backports) removed wf 11634.7
(2022 HLT and MC GT) from the 'limited matrix' in CMSSW_13_X_Y
, and effectively replaced it with wf 12434.7
(2023 HLT and MC GT).
Another example in https://github.com/cms-sw/cmssw/pull/42707#issuecomment-1703882846 :
- the baseline tests were run on
Intel(R) Xeon(R) Silver 4216 CPU
(Cascade lake) - the PR tests were run on
Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)
Another example in https://github.com/cms-sw/cmssw/pull/42612#issuecomment-1716403919
- the baseline tests were run on
Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell) - the PR tests were run on
Intel(R) Xeon(R) CPU E5-2683 v4
(Broadwell)