cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Speedup (grouped)limitedCandidates by changing the implementation of the candidates' number limitation.

Open VinInn opened this issue 1 year ago • 56 comments

Currenlty, in both (grouped)limitedCandidates, for each "step" candidates are first all collected and then the size is reduced based on some score. This is equivalent to not "pushing" a new candidate if worse than the current worst once the collection has reached the max allowed size. (but for candidates with equal score). The implementation has been therefore modified to this latter mechanism that is more performant.

It should also be noticed that the final sorting is not required and indeed in groupedLimitedCandidates there is NO final sort.

The Size of TempTrajectory has also been reduced to speed up move and swap. Took the opportunity to reduce the size of Trajectory as well and to cleanup in general CkfTrajectoryBuilder code.

In a HLT menu time of limitedCandidates improves of 20% while the one of groupedLimitedCandidates improves of a bit more than 10%

HLT throughput improves of a solid 1.5% at least.

Purely technical. Negligible regressions may appear in case of "candidates" with identical score.

VinInn avatar May 05 '24 12:05 VinInn

cms-bot internal usage

cmsbuild avatar May 05 '24 12:05 cmsbuild

@cmsbuild please test

VinInn avatar May 05 '24 12:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40174

  • This PR adds an extra 28KB to repository

cmsbuild avatar May 05 '24 12:05 cmsbuild

A new Pull Request was created by @VinInn for master.

It involves the following packages:

  • DataFormats/TrackCandidate (reconstruction)
  • TrackingTools/PatternTools (reconstruction)

@mandrenguyen, @jfernan2 can you please review it and eventually sign? Thanks. @HuguesBrun, @jhgoh, @felicepantaleo, @gpetruc, @abbiendi, @VinInn, @mtosi, @mmusich, @bellan, @dgulhan, @andrea21z, @GiacomoSguazzoni, @CeliaFernandez, @missirol, @Fedespring, @cericeci, @rovere, @trocino, @VourMa, @JanFSchulte this is something you requested to watch as well. @antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar May 05 '24 12:05 cmsbuild

-1

Failed Tests: UnitTests RelVals RelVals-INPUT AddOn Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39240/summary.html COMMIT: 94698036de5f8ec280091aee49d52635970708b5 CMSSW: CMSSW_14_1_X_2024-05-05-0000/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39240/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 6 errors in the following unit tests:

---> test testTauEmbeddingWorkflow2016postVFP had ERRORS
---> test testTauEmbeddingWorkflow2016preVFP had ERRORS
---> test testTauEmbeddingWorkflow2017 had ERRORS
and more ...

RelVals

  • 4.53A fatal system signal has occurred: segmentation violation
  • 139.001A fatal system signal has occurred: segmentation violation
  • 140.023A fatal system signal has occurred: segmentation violation
Expand to see more relval errors ...

RelVals-INPUT

  • 4.64.6_MinimumBias2010A/step2_MinimumBias2010A.log
  • 138.4138.4_PromptCollisions2021/step2_PromptCollisions2021.log
  • 138.5138.5_ExpressCollisions2021/step2_ExpressCollisions2021.log
Expand to see more relval errors ...

AddOn Tests

A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
Expand to see more addon errors ...

cmsbuild avatar May 05 '24 14:05 cmsbuild

Took the opportunity to reduce the size of Trajectory as well.

is this from reordering the data members?

slava77 avatar May 05 '24 14:05 slava77

Sorting of TempTrajectory in CkfTrajectoryBuilder::limitedCandidates takes almost 1% of HLT time. Reducing it's size to two pointers this timing is reduced by half.

wouldn't it be more practical to update CkfTrajectoryBuilder::limitedCandidates to sort indices to trajectories (or did I misunderstand the comment/code)?

What's the cost of now having more scattered data in memory? Is it negligible? It would be nice to see some profiler or at least the timing piechart.

slava77 avatar May 05 '24 14:05 slava77

@slava77 , sorting indices and caching score. maybe yes. Was an option I considered. Not sure it is more practical. The data are not really much more scattered in memory, there is just one more (fast) indirection.

As said the code is now faster at least for HLT. In reco it crashes for reasons to be understood (not obvious at all: it seems that not everything has ben recompiled?) will now try some relval.

VinInn avatar May 05 '24 14:05 VinInn

Took the opportunity to reduce the size of Trajectory as well.

is this from reordering the data members?

In part. also moving from short (int16) to uint8 (nhits is definetively less than 255. HitPattern is limited to 76)

VinInn avatar May 05 '24 14:05 VinInn

crash reason understood: comes from a test of TempTrajectory validity. A clever fix did not work. Will try another or just revert to a full default constructor...

VinInn avatar May 05 '24 17:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40183

  • This PR adds an extra 52KB to repository

cmsbuild avatar May 06 '24 12:05 cmsbuild

Pull request #44904 was updated. @cmsbuild, @jfernan2, @mandrenguyen can you please check and sign again.

cmsbuild avatar May 06 '24 12:05 cmsbuild

@cmsbuild please test

VinInn avatar May 06 '24 16:05 VinInn

-1

Failed Tests: UnitTests RelVals RelVals-INPUT AddOn Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39264/summary.html COMMIT: 3fbdd16e8072eb38b080c333274bd09bc81896f6 CMSSW: CMSSW_14_1_X_2024-05-06-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39264/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 7 errors in the following unit tests:

---> test testPhase2PixelNtuple had ERRORS
---> test testTauEmbeddingWorkflow2016postVFP had ERRORS
---> test testTauEmbeddingWorkflow2016preVFP had ERRORS
and more ...

RelVals

  • 4.53A fatal system signal has occurred: segmentation violation
  • 140.023A fatal system signal has occurred: segmentation violation
  • 136.731A fatal system signal has occurred: segmentation violation
Expand to see more relval errors ...

RelVals-INPUT

  • 136.9136.9_RunDoubleMuon2016C/step2_RunDoubleMuon2016C.log
  • 141.011141.011_RunEGamma2023B/step2_RunEGamma2023B.log
  • 141.033141.033_RunBTagMu2023C/step2_RunBTagMu2023C.log
Expand to see more relval errors ...

AddOn Tests

A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
Expand to see more addon errors ...

cmsbuild avatar May 06 '24 18:05 cmsbuild

type performance-improvements

mmusich avatar May 07 '24 06:05 mmusich

strange as I run successfully 100 events in 11634.0_TTbar_14TeV+2021 (step2 and 3)

VinInn avatar May 07 '24 07:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40194

  • This PR adds an extra 20KB to repository

cmsbuild avatar May 07 '24 09:05 cmsbuild

Pull request #44904 was updated. @cmsbuild, @mandrenguyen, @jfernan2 can you please check and sign again.

cmsbuild avatar May 07 '24 09:05 cmsbuild

apparently the MULTIARCH (v3) was not crashing while the standard sse3 was....

VinInn avatar May 07 '24 09:05 VinInn

@cmsbuild please test

VinInn avatar May 07 '24 11:05 VinInn

-1

Failed Tests: RelVals-INPUT Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39275/summary.html COMMIT: fab8ae6bdd519685ac479ab9cf98bbdd461df972 CMSSW: CMSSW_14_1_X_2024-05-07-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39275/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

  • 2500.22500.2_NANOmc124Xrun3/step2_NANOmc124Xrun3.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3438984
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3438964
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 206 log files, 170 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar May 07 '24 15:05 cmsbuild

I suppose 2500.2 is not caused by this PR

VinInn avatar May 07 '24 16:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40197

  • This PR adds an extra 64KB to repository

cmsbuild avatar May 07 '24 16:05 cmsbuild

Pull request #44904 was updated. @jfernan2, @mandrenguyen, @cmsbuild can you please check and sign again.

cmsbuild avatar May 07 '24 16:05 cmsbuild

@cmsbuild please test

VinInn avatar May 07 '24 16:05 VinInn

-1

Failed Tests: UnitTests RelVals-INPUT Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39287/summary.html COMMIT: 1c90a1910c460c11b423fe9e4f2ad0d72c56486e CMSSW: CMSSW_14_1_X_2024-05-07-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39287/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 12 errors in the following unit tests:

---> test SubmitPVsplit had ERRORS
---> test SubmitPVrbr had ERRORS
---> test validateAlignments had ERRORS
and more ...

RelVals-INPUT

  • 2500.22500.2_NANOmc124Xrun3/step2_NANOmc124Xrun3.log

Comparison Summary

Summary:

cmsbuild avatar May 07 '24 20:05 cmsbuild

  1. the failure does not seem to be associated to this PR
  2. the DQM differences are expected. Maybe a high stat validation is required as the logic of limitedCandidates is in principle changed

VinInn avatar May 08 '24 09:05 VinInn

assign hlt

jfernan2 avatar May 08 '24 11:05 jfernan2

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar May 08 '24 11:05 cmsbuild

@cmsbuild, please test

  • failures are transient

mmusich avatar May 08 '24 11:05 mmusich