cmssw spurious differences in outputs of wf `12434.7` (was `11634.7`)

Differences in outputs of PR tests for wf 11634.7 were noticed in recent PRs to 12_5_X.

In each of these cases, (1) the PR was purely technical and almost-certainly incapable of creating changes to physics outputs, and (2) PR tests ran on IB CMSSW_12_5_X_2022-10-20-1100 (but I don't know if this type of issue had been seen before).

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1ff86b/28394/summary.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e7334d/28397/summary.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-107837/28401/summary.html

In one case (https://github.com/cms-sw/cmssw/pull/39793#issuecomment-1286081487), PR tests were re-run (using the same IB as base), and after that bin-by-bin differences for wf 11634.7 disappeared, suggesting some non-reproducibility is at play.

The corresponding PRs to 12_4_X and 12_6_X (tested just as recently) didn't exhibit this issue.

Edit : originally, these spurious differences were only seen in 12_5_X; later on, they also appeared in the master branch (13_0_X at the time).

Edit (May 24th):

For the record, #41471 (and backports) removed wf 11634.7 (2022 HLT and MC GT) from the 'limited matrix' in CMSSW_13_X_Y, and effectively replaced it with wf 12434.7 (2023 HLT and MC GT).

Oct 20 '22 20:10 missirol

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Oct 20 '22 20:10 cmsbuild

assign reconstruction, tracking-pog

Oct 21 '22 14:10 mmusich

11634.7 is a dedicated extended mkFit setup.

$ runTheMatrix.py -nel  11634.7 

11634.7 2021_trackingMkFit+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano [1]: cmsDriver.py TTbar_14TeV_TuneCP5_cfi  -s GEN,SIM -n 10 --conditions auto:phase1_2022_realistic --beamspot Realistic25ns13p6TeVEarly2022Collision --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --relval 9000,100 
                                           [2]: cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2022 --conditions auto:phase1_2022_realistic --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --customise RecoTracker/MkFit/customizeHLTIter0ToMkFit.customizeHLTIter0ToMkFit
                                           [3]: cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,NANO,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2022_realistic --datatier GEN-SIM-RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3 --procModifiers trackingMkFitDevel
                                           [4]: cmsDriver.py step4  -s HARVESTING:@standardValidation+@standardDQM+@ExtraHLT+@miniAODValidation+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2022_realistic --mc  --geometry DB:Extended --scenario pp --filetype DQM --era Run3 -n 100 

1 workflows with 4 steps

 --------------------------------------------------------------------------------

Oct 21 '22 14:10 mmusich

assign reconstruction, tracking-pog

Oct 21 '22 14:10 missirol

New categories assigned: tracking-pog,reconstruction

@slava77,@mmusich,@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

Oct 21 '22 14:10 cmsbuild

#39811 provides another example (again in 12_5_X):

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b4c6cc/28422/summary.html

Oct 21 '22 18:10 missirol

(but I don't know if this type of issue had been seen before)

I checked recent 12_5_X PRs for which PR-tests are still accessible, and I didn't find other ones affected by this issue. So, I cannot exclude that this issue somehow started only since CMSSW_12_5_X_2022-10-20-1100.

Oct 21 '22 18:10 missirol

#39814 provides another example, again in 12_5_X (enough examples at this point):

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b1c64e/28423/summary.html

Oct 21 '22 18:10 missirol

#39811 provides another example (again in 12_5_X):

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b4c6cc/28422/summary.html

there is one more pixelPair step track candidate relative to the baseline https://tinyurl.com/26w5bw6a This iteration is not using mkFit. So, it's not obvious why the difference would be localized in the mkfit wf.

do these diffs show up in 12_5_X only or also in 12_6_X ?

Oct 21 '22 18:10 slava77

there is one more pixelPair step track candidate relative to the baseline https://tinyurl.com/26w5bw6a

and apparently 2 "existing" track candidates are different (in addition to having one more), based on e.g. chi2 distr

Oct 21 '22 19:10 slava77

This iteration is not using mkFit. So, it's not obvious why the difference would be localized in the mkfit wf.

uhm, I'm wrong, pixelPair in this setup is using mkfit as well

Oct 21 '22 19:10 slava77

pixelPair in this setup is using mkfit as well

right

https://github.com/cms-sw/cmssw/blob/bf6b0ccf84c8a1dd2a0c3fa2423ccb54f03c8a10/Configuration/ProcessModifiers/python/trackingMkFitDevel_cff.py#L26

do these diffs show up in 12_5_X only or also in 12_6_X ?

I am a bit surprised it doesn't show (at least there haven't been reports) in 12_6_X as well.

Oct 21 '22 19:10 mmusich

urgent (marking urgent the issues affecting relvals in the IBs)

Nov 02 '22 06:11 perrotta

It looks like this is starting to hit also master, see e.g.:

https://github.com/cms-sw/cmssw/pull/40051#issuecomment-1336778476
https://github.com/cms-sw/cmssw/pull/40222#issuecomment-1336212221

Dec 05 '22 09:12 mmusich

Also here https://github.com/cms-sw/cmssw/pull/40133#issuecomment-1336022845

Dec 05 '22 14:12 makortel

And here https://github.com/cms-sw/cmssw/pull/39953#issuecomment-1338002482

Dec 05 '22 19:12 makortel

Another one in https://github.com/cms-sw/cmssw/pull/40253#issuecomment-1340260714

Dec 07 '22 14:12 makortel

@missirol, do you mind changing the title to remove "in 12_5_X" since that doesn't apply anymore (if that's possible at all) ?

Dec 07 '22 15:12 mmusich

another one in https://github.com/cms-sw/cmssw/pull/40317#issuecomment-1352433774

Dec 15 '22 07:12 smuzaffar

Is the 116134.7 workflow still useful to be run in PR tests?

Jan 09 '23 19:01 makortel

Another occurance in https://github.com/cms-sw/cmssw/pull/40442

Is the 116134.7 workflow still useful to be run in PR tests?

I'd support to remove this from the limited tests

Jan 10 '23 21:01 tvami

IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?

Feb 05 '23 22:02 mmusich

IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?

For a strange coincidence I was noticing some differences of that kind in #40679 only a few hours before you posted this comment. They are concentrated in the HLT tracking, but still they have probably the same origin that the older ones referenced here: could it be?

PS: maybe those differences are not really "spurious", i.e. not related to this issue:

they are only in the HLT tracking, not in offline reco;
PR #40679 does touch mkfit, in fact.

Feb 06 '23 06:02 perrotta

IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?

For a strange coincidence I was noticing some differences of that kind in #40679 only a few hours before you posted this comment. They are concentrated in the HLT tracking, but still they have probably the same origin that the older ones referenced here: could it be?

PS: maybe those differences are not really "spurious", i.e. not related to this issue:
* they are only in the HLT tracking, not in offline reco;

* PR [[MkFit] Format change for windows in json files #40679](https://github.com/cms-sw/cmssw/pull/40679) does touch mkfit, in fact.

this case is different; some change in HLT context was expected

Feb 06 '23 15:02 slava77

A difference in one specific histogram in 11634.7, EgammaV/ConversionValidator/ConversionInfo/pConvVtxdRVsEta has started to appear, e.g. in https://github.com/cms-sw/cmssw/pull/40997#issuecomment-1460743585

Mar 08 '23 22:03 makortel

Is the 116134.7 workflow still useful to be run in PR tests?

I'd support to remove this from the limited tests

Should we consider again removing 11634.7 from limited matrix?

Mar 20 '23 14:03 makortel

I commented it in https://github.com/cms-sw/cmssw/pull/41106, let me know if I should have fully removed it, I was just thinking we may want to add it back later after the wf's output changes are more understood

Mar 20 '23 14:03 tvami

For the record, #41471 (and backports) removed wf 11634.7 (2022 HLT and MC GT) from the 'limited matrix' in CMSSW_13_X_Y, and effectively replaced it with wf 12434.7 (2023 HLT and MC GT).

May 24 '23 05:05 missirol

Another example in https://github.com/cms-sw/cmssw/pull/42707#issuecomment-1703882846 :

the baseline tests were run on Intel(R) Xeon(R) Silver 4216 CPU (Cascade lake)
the PR tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

Sep 02 '23 17:09 missirol

Another example in https://github.com/cms-sw/cmssw/pull/42612#issuecomment-1716403919

the baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)
the PR tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

Sep 13 '23 07:09 mmusich

cmssw cmssw copied to clipboard

spurious differences in outputs of wf `12434.7` (was `11634.7`)

cmssw
cmssw copied to clipboard