cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Add ParticleFlow Client for Online DQM - GPUvsCPU comparison

Open waredjeb opened this issue 1 year ago • 10 comments

This PR creates a new Online DQM client for Particle Flow, taking inspiration from the HCAL GPU client. Currently the PF client will be used for monitoring PFCluster@Alpaka, comparing the GPU with the CPU version from the DQMGPUvsCPU stream.

Test

Local test on lxplus with the following command, following instructions on the DQM Twiki cmsRun DQM/Integration/python/clients/hcalgpu_dqm_sourceclient-live_cfg.py runInputDir=/eos/cms/store/group/comm_dqm/ runNumber=380649 runkey=pp_run scanOnce=True

Just for completeness, adding some of the plots produced by the client

image image image image

Backport

Probably a backport to CMSSW_14_0_X will be needed

@missirol @swagata87 @stahlleiton @hatakeyamak @jsamudio

waredjeb avatar May 28 '24 15:05 waredjeb

cms-bot internal usage

cmsbuild avatar May 28 '24 15:05 cmsbuild

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40395

  • This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

  • code-format: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40395/code-format.patch e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40395/code-format.patch | patch -p1 You can also run scram build code-format to apply code format directly

cmsbuild avatar May 28 '24 15:05 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40397

  • This PR adds an extra 20KB to repository

cmsbuild avatar May 28 '24 15:05 cmsbuild

A new Pull Request was created by @waredjeb for master.

It involves the following packages:

  • DQM/Integration (dqm)
  • DQM/PFTasks (****)

The following packages do not have a category, yet:

DQM/PFTasks Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@syuvivida, @rvenditti, @nothingface0, @cmsbuild, @tjavaid, @antoniovagnerini can you please review it and eventually sign? Thanks. @francescobrivio, @batinkov, @threus this is something you requested to watch as well. @antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar May 28 '24 15:05 cmsbuild

type pf

swagata87 avatar May 29 '24 08:05 swagata87

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40409

  • This PR adds an extra 24KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

  • code-format: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40409/code-format.patch e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40409/code-format.patch | patch -p1 You can also run scram build code-format to apply code format directly

cmsbuild avatar May 29 '24 08:05 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40410

  • This PR adds an extra 20KB to repository

cmsbuild avatar May 29 '24 09:05 cmsbuild

Pull request #45079 was updated. @cmsbuild, @syuvivida, @rvenditti, @tjavaid, @nothingface0, @antoniovagnerini can you please check and sign again.

cmsbuild avatar May 29 '24 09:05 cmsbuild

@cmsbuild please test

swagata87 avatar May 30 '24 13:05 swagata87

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-47eeef/39629/summary.html COMMIT: d0d6410c623ff4e03e0b24526a80d768dfc29df3 CMSSW: CMSSW_14_1_X_2024-05-30-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45079/39629/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3338862
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3338839
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar May 30 '24 15:05 cmsbuild

kind ping @cms-sw/dqm-l2

waredjeb avatar Jun 03 '24 14:06 waredjeb

Hi @waredjeb,

We have tested this PR against vanilla CMSSW_14_0_7 + PRs 45007,45027 and run 381069 on DQM playback machine. All clients on playback ended gracefully so far.

For your convenience, you can check the logs at DQM^2 mirror here: https://cmsweb.cern.ch/dqm/dqm-square/?run=514356&db=playback and DQM Online Playback GUI here: https://cmsweb.cern.ch/dqm/online-playback

Best regards, Vichayanun for DQM Core team

vicha-w avatar Jun 04 '24 08:06 vicha-w

Dear Authors, Just to note that when we tested with older DQM streamers, from run 379530, we saw errors of product not found (see error log below) and the client pfgpu crashed:

----- Begin Fatal Exception 04-Jun-2024 09:04:52 CEST----------------------- An exception of category 'ProductNotFound' occurred while [0] Processing Event run: 379866 lumi: 718 event: 826903685 stream: 0 [1] Running path 'tasksPath' [2] Calling method for module PFHcalGPUComparisonTask/'pfHcalGPUComparisonTask' Exception Message: Principal::getByToken: Found zero products matching all criteria Looking for type: std::vectorreco::PFCluster Looking for module label: hltParticleFlowClusterHCALSerialSync Looking for productInstanceName:

We suggest to add in process.options if you agree. TryToContinue = cms.untracked.vstring( 'ProductNotFound' )

syuvivida avatar Jun 04 '24 09:06 syuvivida

Dear Authors, Just to note that when we tested with older DQM streamers, from run 379530, we saw errors of product not found (see error log below) and the client pfgpu crashed:

----- Begin Fatal Exception 04-Jun-2024 09:04:52 CEST----------------------- An exception of category 'ProductNotFound' occurred while [0] Processing Event run: 379866 lumi: 718 event: 826903685 stream: 0 [1] Running path 'tasksPath' [2] Calling method for module PFHcalGPUComparisonTask/'pfHcalGPUComparisonTask' Exception Message: Principal::getByToken: Found zero products matching all criteria Looking for type: std::vectorreco::PFCluster Looking for module label: hltParticleFlowClusterHCALSerialSync Looking for productInstanceName:

We suggest to add in process.options if you agree. TryToContinue = cms.untracked.vstring( 'ProductNotFound' )

Dear @syuvivida Indeed, the collection was not saved in the event back then. Thanks for checking, I can add the line you suggested!

waredjeb avatar Jun 04 '24 09:06 waredjeb

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40467

  • This PR adds an extra 20KB to repository

cmsbuild avatar Jun 04 '24 10:06 cmsbuild

Pull request #45079 was updated. @antoniovagnerini, @syuvivida, @rvenditti, @nothingface0, @tjavaid, @cmsbuild can you please check and sign again.

cmsbuild avatar Jun 04 '24 10:06 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45079/40468

  • This PR adds an extra 20KB to repository

cmsbuild avatar Jun 04 '24 10:06 cmsbuild

Pull request #45079 was updated. @tjavaid, @syuvivida, @rvenditti, @antoniovagnerini, @nothingface0, @cmsbuild can you please check and sign again.

cmsbuild avatar Jun 04 '24 10:06 cmsbuild

Hi @waredjeb,

Thanks to the new fixes you just made, run 379530 now ended gracefully on playback machines.

Here is the status on DQM^2: https://cmsweb.cern.ch/dqm/dqm-square/?run=514378&db=playback

Best regards, Vichayanun for DQM Core team

vicha-w avatar Jun 04 '24 11:06 vicha-w

please test

tjavaid avatar Jun 04 '24 12:06 tjavaid

We suggest to add in process.options if you agree. TryToContinue = cms.untracked.vstring( 'ProductNotFound' )

I am sorry to chime in, but I have a question about this. If the collections that are given in input do not correspond to the expectations, means that somehow the input stream from HLT doesn't contain the expected event content. If you let the client to silently fail, who's going to propagate the information that something is wrong with the HLT stream? For instance recently we noticed that the pixel GPU client was not using the right collections and until we submitted this PR https://github.com/cms-sw/cmssw/pull/44933 we didn't have monitoring. Apparently no-one was checking the histograms at P5 (not offline). How to make sure things like this get noticed?

mmusich avatar Jun 04 '24 13:06 mmusich

Hi @mmusich If we want the online shifters to check the pixel GPU clients results, the Tracker group needs to update/implement the instruction twiki page, and also include the plots in the shift page of DQMGUI.

syuvivida avatar Jun 04 '24 14:06 syuvivida

@syuvivida

If we want the online shifters to check the pixel GPU clients results, the Tracker group needs to update/implement the instruction twiki page, and also include the plots in the shift page of DQMGUI.

thanks but this doesn't answer the general question. E.g. for this client, does the PF group plan in providing such instructions etc.?

mmusich avatar Jun 04 '24 14:06 mmusich

@syuvivida

If we want the online shifters to check the pixel GPU clients results, the Tracker group needs to update/implement the instruction twiki page, and also include the plots in the shift page of DQMGUI.

thanks but this doesn't answer the general question. E.g. for this client, does the PF group plan in providing such instructions etc.?

If updating the Twiki is required, we will provide these kinds of instructions. @syuvivida, could you please send us the Twiki page that needs to be updated and maybe instructions for adding the plots in the DQMGUI? Thanks a lot

waredjeb avatar Jun 04 '24 14:06 waredjeb

@syuvivida

If we want the online shifters to check the pixel GPU clients results, the Tracker group needs to update/implement the instruction twiki page, and also include the plots in the shift page of DQMGUI.

thanks but this doesn't answer the general question. E.g. for this client, does the PF group plan in providing such instructions etc.?

If updating the Twiki is required, we will provide these kinds of instructions. @syuvivida, could you please send us the Twiki page that needs to be updated and maybe instructions for adding the plots in the DQMGUI? Thanks a lot

Hello, I will send you these piece of information by email.

syuvivida avatar Jun 04 '24 14:06 syuvivida

Hello, I will send you these piece of information by email.

can you please keep me in the loop? thx.

mmusich avatar Jun 04 '24 14:06 mmusich

DQM/PFTasks (****)

@waredjeb you also need to make a PR to https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign the (DQM) category.

mmusich avatar Jun 04 '24 14:06 mmusich

DQM/PFTasks (****)

@waredjeb you also need to make a PR to https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign the (DQM) category.

Sure, I was waiting for the merge of this PR.

waredjeb avatar Jun 04 '24 14:06 waredjeb

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-47eeef/39687/summary.html COMMIT: b9a0564229c067d9efba398f21731d71cd9c3d6b CMSSW: CMSSW_14_1_X_2024-06-04-1100/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/45079/39687/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3338862
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3338836
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Jun 04 '24 14:06 cmsbuild

Sure, I was waiting for the merge of this PR.

my understanding is that it needs to be done before merging the PR, so that the corresponding L2 maintainers of the new package can sign-off on that too.

mmusich avatar Jun 04 '24 15:06 mmusich