cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

spurious differences in outputs of wf `12434.7` (was `11634.7`)

Open missirol opened this issue 2 years ago • 46 comments

Differences in outputs of PR tests for wf 11634.7 were noticed in recent PRs to 12_5_X.

In each of these cases, (1) the PR was purely technical and almost-certainly incapable of creating changes to physics outputs, and (2) PR tests ran on IB CMSSW_12_5_X_2022-10-20-1100 (but I don't know if this type of issue had been seen before).

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1ff86b/28394/summary.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e7334d/28397/summary.html https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-107837/28401/summary.html

In one case (https://github.com/cms-sw/cmssw/pull/39793#issuecomment-1286081487), PR tests were re-run (using the same IB as base), and after that bin-by-bin differences for wf 11634.7 disappeared, suggesting some non-reproducibility is at play.

The corresponding PRs to 12_4_X and 12_6_X (tested just as recently) didn't exhibit this issue.

Edit : originally, these spurious differences were only seen in 12_5_X; later on, they also appeared in the master branch (13_0_X at the time).

Edit (May 24th):

For the record, #41471 (and backports) removed wf 11634.7 (2022 HLT and MC GT) from the 'limited matrix' in CMSSW_13_X_Y, and effectively replaced it with wf 12434.7 (2023 HLT and MC GT).

missirol avatar Oct 20 '22 20:10 missirol

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Oct 20 '22 20:10 cmsbuild

assign reconstruction, tracking-pog

mmusich avatar Oct 21 '22 14:10 mmusich

11634.7 is a dedicated extended mkFit setup.

$ runTheMatrix.py -nel  11634.7 

11634.7 2021_trackingMkFit+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano [1]: cmsDriver.py TTbar_14TeV_TuneCP5_cfi  -s GEN,SIM -n 10 --conditions auto:phase1_2022_realistic --beamspot Realistic25ns13p6TeVEarly2022Collision --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --relval 9000,100 
                                           [2]: cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2022 --conditions auto:phase1_2022_realistic --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --customise RecoTracker/MkFit/customizeHLTIter0ToMkFit.customizeHLTIter0ToMkFit
                                           [3]: cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,NANO,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2022_realistic --datatier GEN-SIM-RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3 --procModifiers trackingMkFitDevel
                                           [4]: cmsDriver.py step4  -s HARVESTING:@standardValidation+@standardDQM+@ExtraHLT+@miniAODValidation+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2022_realistic --mc  --geometry DB:Extended --scenario pp --filetype DQM --era Run3 -n 100 

1 workflows with 4 steps

 -------------------------------------------------------------------------------- 

mmusich avatar Oct 21 '22 14:10 mmusich

assign reconstruction, tracking-pog

missirol avatar Oct 21 '22 14:10 missirol

New categories assigned: tracking-pog,reconstruction

@slava77,@mmusich,@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Oct 21 '22 14:10 cmsbuild

#39811 provides another example (again in 12_5_X):

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b4c6cc/28422/summary.html

missirol avatar Oct 21 '22 18:10 missirol

(but I don't know if this type of issue had been seen before)

I checked recent 12_5_X PRs for which PR-tests are still accessible, and I didn't find other ones affected by this issue. So, I cannot exclude that this issue somehow started only since CMSSW_12_5_X_2022-10-20-1100.

missirol avatar Oct 21 '22 18:10 missirol

#39814 provides another example, again in 12_5_X (enough examples at this point):

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b1c64e/28423/summary.html

missirol avatar Oct 21 '22 18:10 missirol

#39811 provides another example (again in 12_5_X):

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b4c6cc/28422/summary.html

there is one more pixelPair step track candidate relative to the baseline https://tinyurl.com/26w5bw6a This iteration is not using mkFit. So, it's not obvious why the difference would be localized in the mkfit wf.

do these diffs show up in 12_5_X only or also in 12_6_X ?

slava77 avatar Oct 21 '22 18:10 slava77

there is one more pixelPair step track candidate relative to the baseline https://tinyurl.com/26w5bw6a

and apparently 2 "existing" track candidates are different (in addition to having one more), based on e.g. chi2 distr image

slava77 avatar Oct 21 '22 19:10 slava77

This iteration is not using mkFit. So, it's not obvious why the difference would be localized in the mkfit wf.

uhm, I'm wrong, pixelPair in this setup is using mkfit as well

slava77 avatar Oct 21 '22 19:10 slava77

pixelPair in this setup is using mkfit as well

right

https://github.com/cms-sw/cmssw/blob/bf6b0ccf84c8a1dd2a0c3fa2423ccb54f03c8a10/Configuration/ProcessModifiers/python/trackingMkFitDevel_cff.py#L26

do these diffs show up in 12_5_X only or also in 12_6_X ?

I am a bit surprised it doesn't show (at least there haven't been reports) in 12_6_X as well.

mmusich avatar Oct 21 '22 19:10 mmusich

urgent (marking urgent the issues affecting relvals in the IBs)

perrotta avatar Nov 02 '22 06:11 perrotta

It looks like this is starting to hit also master, see e.g.:

  • https://github.com/cms-sw/cmssw/pull/40051#issuecomment-1336778476
  • https://github.com/cms-sw/cmssw/pull/40222#issuecomment-1336212221

mmusich avatar Dec 05 '22 09:12 mmusich

Also here https://github.com/cms-sw/cmssw/pull/40133#issuecomment-1336022845

makortel avatar Dec 05 '22 14:12 makortel

And here https://github.com/cms-sw/cmssw/pull/39953#issuecomment-1338002482

makortel avatar Dec 05 '22 19:12 makortel

Another one in https://github.com/cms-sw/cmssw/pull/40253#issuecomment-1340260714

makortel avatar Dec 07 '22 14:12 makortel

@missirol, do you mind changing the title to remove "in 12_5_X" since that doesn't apply anymore (if that's possible at all) ?

mmusich avatar Dec 07 '22 15:12 mmusich

another one in https://github.com/cms-sw/cmssw/pull/40317#issuecomment-1352433774

smuzaffar avatar Dec 15 '22 07:12 smuzaffar

Is the 116134.7 workflow still useful to be run in PR tests?

makortel avatar Jan 09 '23 19:01 makortel

Another occurance in https://github.com/cms-sw/cmssw/pull/40442

Is the 116134.7 workflow still useful to be run in PR tests?

I'd support to remove this from the limited tests

tvami avatar Jan 10 '23 21:01 tvami

IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?

mmusich avatar Feb 05 '23 22:02 mmusich

IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?

For a strange coincidence I was noticing some differences of that kind in #40679 only a few hours before you posted this comment. They are concentrated in the HLT tracking, but still they have probably the same origin that the older ones referenced here: could it be?

PS: maybe those differences are not really "spurious", i.e. not related to this issue:

  • they are only in the HLT tracking, not in offline reco;
  • PR #40679 does touch mkfit, in fact.

perrotta avatar Feb 06 '23 06:02 perrotta

IIUC, this has apparently stopped in recent PR tests - without either an explicit fix or removing the workflow ?

For a strange coincidence I was noticing some differences of that kind in #40679 only a few hours before you posted this comment. They are concentrated in the HLT tracking, but still they have probably the same origin that the older ones referenced here: could it be?

PS: maybe those differences are not really "spurious", i.e. not related to this issue:

* they are only in the HLT tracking, not in offline reco;

* PR [[MkFit] Format change for windows in json files #40679](https://github.com/cms-sw/cmssw/pull/40679) does touch mkfit, in fact.

this case is different; some change in HLT context was expected

slava77 avatar Feb 06 '23 15:02 slava77

A difference in one specific histogram in 11634.7, EgammaV/ConversionValidator/ConversionInfo/pConvVtxdRVsEta has started to appear, e.g. in https://github.com/cms-sw/cmssw/pull/40997#issuecomment-1460743585

makortel avatar Mar 08 '23 22:03 makortel

Is the 116134.7 workflow still useful to be run in PR tests?

I'd support to remove this from the limited tests

Should we consider again removing 11634.7 from limited matrix?

makortel avatar Mar 20 '23 14:03 makortel

I commented it in https://github.com/cms-sw/cmssw/pull/41106, let me know if I should have fully removed it, I was just thinking we may want to add it back later after the wf's output changes are more understood

tvami avatar Mar 20 '23 14:03 tvami

For the record, #41471 (and backports) removed wf 11634.7 (2022 HLT and MC GT) from the 'limited matrix' in CMSSW_13_X_Y, and effectively replaced it with wf 12434.7 (2023 HLT and MC GT).

missirol avatar May 24 '23 05:05 missirol

Another example in https://github.com/cms-sw/cmssw/pull/42707#issuecomment-1703882846 :

  • the baseline tests were run on Intel(R) Xeon(R) Silver 4216 CPU (Cascade lake)
  • the PR tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

missirol avatar Sep 02 '23 17:09 missirol

Another example in https://github.com/cms-sw/cmssw/pull/42612#issuecomment-1716403919

  • the baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)
  • the PR tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

mmusich avatar Sep 13 '23 07:09 mmusich