cmssw Speedup (grouped)limitedCandidates by changing the implementation of the candidates' number limitation.

Currenlty, in both (grouped)limitedCandidates, for each "step" candidates are first all collected and then the size is reduced based on some score. This is equivalent to not "pushing" a new candidate if worse than the current worst once the collection has reached the max allowed size. (but for candidates with equal score). The implementation has been therefore modified to this latter mechanism that is more performant.

It should also be noticed that the final sorting is not required and indeed in groupedLimitedCandidates there is NO final sort.

The Size of TempTrajectory has also been reduced to speed up move and swap. Took the opportunity to reduce the size of Trajectory as well and to cleanup in general CkfTrajectoryBuilder code.

In a HLT menu time of limitedCandidates improves of 20% while the one of groupedLimitedCandidates improves of a bit more than 10%

HLT throughput improves of a solid 1.5% at least.

Purely technical. Negligible regressions may appear in case of "candidates" with identical score.

May 05 '24 12:05 VinInn

cms-bot internal usage

May 05 '24 12:05 cmsbuild

@cmsbuild please test

May 05 '24 12:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40174

This PR adds an extra 28KB to repository

May 05 '24 12:05 cmsbuild

A new Pull Request was created by @VinInn for master.

It involves the following packages:

DataFormats/TrackCandidate (reconstruction)
TrackingTools/PatternTools (reconstruction)

@mandrenguyen, @jfernan2 can you please review it and eventually sign? Thanks. @HuguesBrun, @jhgoh, @felicepantaleo, @gpetruc, @abbiendi, @VinInn, @mtosi, @mmusich, @bellan, @dgulhan, @andrea21z, @GiacomoSguazzoni, @CeliaFernandez, @missirol, @Fedespring, @cericeci, @rovere, @trocino, @VourMa, @JanFSchulte this is something you requested to watch as well. @antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

May 05 '24 12:05 cmsbuild

-1

Failed Tests: UnitTests RelVals RelVals-INPUT AddOn Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39240/summary.html COMMIT: 94698036de5f8ec280091aee49d52635970708b5 CMSSW: CMSSW_14_1_X_2024-05-05-0000/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39240/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 6 errors in the following unit tests:

---> test testTauEmbeddingWorkflow2016postVFP had ERRORS
---> test testTauEmbeddingWorkflow2016preVFP had ERRORS
---> test testTauEmbeddingWorkflow2017 had ERRORS
and more ...

RelVals

4.53A fatal system signal has occurred: segmentation violation
139.001A fatal system signal has occurred: segmentation violation
140.023A fatal system signal has occurred: segmentation violation

Expand to see more relval errors ...

RelVals-INPUT

4.64.6_MinimumBias2010A/step2_MinimumBias2010A.log
138.4138.4_PromptCollisions2021/step2_PromptCollisions2021.log
138.5138.5_ExpressCollisions2021/step2_ExpressCollisions2021.log

Expand to see more relval errors ...

AddOn Tests

hlt_mc_Fake

A fatal system signal has occurred: segmentation violation

hlt_mc_Fake1

A fatal system signal has occurred: segmentation violation

hlt_mc_Fake2

A fatal system signal has occurred: segmentation violation

Expand to see more addon errors ...

May 05 '24 14:05 cmsbuild

Took the opportunity to reduce the size of Trajectory as well.

is this from reordering the data members?

May 05 '24 14:05 slava77

Sorting of TempTrajectory in CkfTrajectoryBuilder::limitedCandidates takes almost 1% of HLT time. Reducing it's size to two pointers this timing is reduced by half.

wouldn't it be more practical to update CkfTrajectoryBuilder::limitedCandidates to sort indices to trajectories (or did I misunderstand the comment/code)?

What's the cost of now having more scattered data in memory? Is it negligible? It would be nice to see some profiler or at least the timing piechart.

May 05 '24 14:05 slava77

@slava77 , sorting indices and caching score. maybe yes. Was an option I considered. Not sure it is more practical. The data are not really much more scattered in memory, there is just one more (fast) indirection.

As said the code is now faster at least for HLT. In reco it crashes for reasons to be understood (not obvious at all: it seems that not everything has ben recompiled?) will now try some relval.

May 05 '24 14:05 VinInn

Took the opportunity to reduce the size of Trajectory as well.

is this from reordering the data members?

In part. also moving from short (int16) to uint8 (nhits is definetively less than 255. HitPattern is limited to 76)

May 05 '24 14:05 VinInn

crash reason understood: comes from a test of TempTrajectory validity. A clever fix did not work. Will try another or just revert to a full default constructor...

May 05 '24 17:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40183

This PR adds an extra 52KB to repository

May 06 '24 12:05 cmsbuild

Pull request #44904 was updated. @cmsbuild, @jfernan2, @mandrenguyen can you please check and sign again.

May 06 '24 12:05 cmsbuild

@cmsbuild please test

May 06 '24 16:05 VinInn

-1

Failed Tests: UnitTests RelVals RelVals-INPUT AddOn Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39264/summary.html COMMIT: 3fbdd16e8072eb38b080c333274bd09bc81896f6 CMSSW: CMSSW_14_1_X_2024-05-06-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39264/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 7 errors in the following unit tests:

---> test testPhase2PixelNtuple had ERRORS
---> test testTauEmbeddingWorkflow2016postVFP had ERRORS
---> test testTauEmbeddingWorkflow2016preVFP had ERRORS
and more ...

RelVals

4.53A fatal system signal has occurred: segmentation violation
140.023A fatal system signal has occurred: segmentation violation
136.731A fatal system signal has occurred: segmentation violation

Expand to see more relval errors ...

RelVals-INPUT

136.9136.9_RunDoubleMuon2016C/step2_RunDoubleMuon2016C.log
141.011141.011_RunEGamma2023B/step2_RunEGamma2023B.log
141.033141.033_RunBTagMu2023C/step2_RunBTagMu2023C.log

Expand to see more relval errors ...

AddOn Tests

hlt_mc_Fake

A fatal system signal has occurred: segmentation violation

hlt_mc_Fake1

A fatal system signal has occurred: segmentation violation

hlt_mc_Fake2

A fatal system signal has occurred: segmentation violation

Expand to see more addon errors ...

May 06 '24 18:05 cmsbuild

type performance-improvements

May 07 '24 06:05 mmusich

strange as I run successfully 100 events in 11634.0_TTbar_14TeV+2021 (step2 and 3)

May 07 '24 07:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40194

This PR adds an extra 20KB to repository

May 07 '24 09:05 cmsbuild

Pull request #44904 was updated. @cmsbuild, @mandrenguyen, @jfernan2 can you please check and sign again.

May 07 '24 09:05 cmsbuild

apparently the MULTIARCH (v3) was not crashing while the standard sse3 was....

May 07 '24 09:05 VinInn

@cmsbuild please test

May 07 '24 11:05 VinInn

-1

Failed Tests: RelVals-INPUT Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39275/summary.html COMMIT: fab8ae6bdd519685ac479ab9cf98bbdd461df972 CMSSW: CMSSW_14_1_X_2024-05-07-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39275/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

2500.22500.2_NANOmc124Xrun3/step2_NANOmc124Xrun3.log

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3438984
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3438964
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
Checked 206 log files, 170 edm output root files, 49 DQM output files
TriggerResults: no differences found

May 07 '24 15:05 cmsbuild

I suppose 2500.2 is not caused by this PR

May 07 '24 16:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40197

This PR adds an extra 64KB to repository

May 07 '24 16:05 cmsbuild

Pull request #44904 was updated. @jfernan2, @mandrenguyen, @cmsbuild can you please check and sign again.

May 07 '24 16:05 cmsbuild

@cmsbuild please test

May 07 '24 16:05 VinInn

-1

Failed Tests: UnitTests RelVals-INPUT Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39287/summary.html COMMIT: 1c90a1910c460c11b423fe9e4f2ad0d72c56486e CMSSW: CMSSW_14_1_X_2024-05-07-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39287/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 12 errors in the following unit tests:

---> test SubmitPVsplit had ERRORS
---> test SubmitPVrbr had ERRORS
---> test validateAlignments had ERRORS
and more ...

RelVals-INPUT

2500.22500.2_NANOmc124Xrun3/step2_NANOmc124Xrun3.log

Comparison Summary

Summary:

You potentially removed 15 lines from the logs
Reco comparison results: 2036 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3438984
DQMHistoTests: Total failures: 1532
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3437432
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
Checked 206 log files, 170 edm output root files, 49 DQM output files
TriggerResults: no differences found

May 07 '24 20:05 cmsbuild

the failure does not seem to be associated to this PR
the DQM differences are expected. Maybe a high stat validation is required as the logic of limitedCandidates is in principle changed

May 08 '24 09:05 VinInn

assign hlt

May 08 '24 11:05 jfernan2

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

May 08 '24 11:05 cmsbuild

@cmsbuild, please test

failures are transient

May 08 '24 11:05 mmusich