cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Alpaka vs CUDA DQM compare modules for pixel tracks objects

Open borzari opened this issue 1 year ago • 85 comments

PR description:

This implements DQM modules for comparing pixel tracks objects reconstructed using Alpaka and CUDA. These modules can be used for any possible comparisons using pixel tracks: CUDA CPU vs CUDA GPU, Alpaka Serial vs Alpaka Device and Alpaka vs CUDA (both Serial/CPU and Device/GPU). A new process modifier chain, alpakaCUDAValidation, was added to turn on both reconstructions and the Alpaka vs CUDA DQM modules when needed.

A new workflow offset, .5041, was added to run DQM with all the possible comparisons using pixel tracks objects. A new customization to run Alpaka vs CUDA comparisons in HLT was also created.

These changes will be back ported to CMSSW_14_0_X for the special validation of the Alpaka vs CUDA reconstruction.

PR validation:

The modifications were tested using wfs 12434.403 (Alpaka validation), 12434.5041 (Alpaka vs CUDA validation), and 12434.503 (CUDA validation). All comparisons follow the expectations.

borzari avatar Feb 14 '24 16:02 borzari

cms-bot internal usage

cmsbuild avatar Feb 14 '24 16:02 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43964/38859

  • This PR adds an extra 1168KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): #33532, #43701, #43761
    • File DQM/SiPixelHeterogeneous/plugins/SiPixelCompareVertexSoAAlpaka.cc modified in PR(s): #43952
    • File RecoLocalTracker/SiPixelRecHits/python/SiPixelRecHits_cfi.py modified in PR(s): #37952
    • File RecoTracker/PixelVertexFinding/plugins/alpaka/vertexFinder.dev.cc modified in PR(s): #43952

cmsbuild avatar Feb 14 '24 16:02 cmsbuild

A new Pull Request was created by @borzari for master.

It involves the following packages:

  • Configuration/ProcessModifiers (operations)
  • Configuration/PyReleaseValidation (upgrade, pdmv)
  • DQM/SiPixelHeterogeneous (dqm)
  • DataFormats/TrackSoA (reconstruction, heterogeneous)
  • DataFormats/TrackingRecHitSoA (reconstruction, heterogeneous)
  • HLTrigger/Configuration (hlt)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoTracker/Configuration (reconstruction)
  • RecoTracker/PixelSeeding (reconstruction)
  • RecoTracker/PixelTrackFitting (reconstruction)
  • RecoTracker/PixelVertexFinding (reconstruction)
  • RecoVertex/BeamSpotProducer (alca, reconstruction)

@mmusich, @syuvivida, @AdrianoDee, @saumyaphor4252, @miquork, @nothingface0, @perrotta, @tjavaid, @antoniovilela, @fabiocos, @cmsbuild, @sunilUIET, @mandrenguyen, @davidlange6, @fwyzard, @Martin-Grunewald, @makortel, @rappoccio, @consuegs, @rvenditti, @jfernan2, @antoniovagnerini, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks. @idebruyn, @mmusich, @GiacomoSguazzoni, @mroguljic, @yuanchao, @JanFSchulte, @francescobrivio, @jandrea, @rsreds, @missirol, @silviodonato, @VinInn, @dgulhan, @dkotlins, @fabiocos, @tvami, @fioriNTU, @felicepantaleo, @VourMa, @Martin-Grunewald, @makortel, @gpetruc, @mtosi, @slomeo, @tocheng, @threus, @ferencek, @rovere this is something you requested to watch as well. @rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar Feb 14 '24 16:02 cmsbuild

-hlt

mmusich avatar Feb 14 '24 16:02 mmusich

-1 No modification of the HLT menu dumps in HLTrigger/Configuration. AFAIK all ALPKA modifications are to be coded in the ALPAKA customisation routine in HLTrigger/Configuration/python/customizeHLTforAlpka.py .

Martin-Grunewald avatar Feb 14 '24 17:02 Martin-Grunewald

Also, it is unclear why DQM comparisons affect the HLT menu; I think it needs to be discussed how many modules we add and execute for this comparison to the HLT menu. Will this also need to run at P5, or only for offline workflows? If the former, it needs to be assessed for timing. If the latter, it definitely does NOT go in the HLT menus in HLTrigger/Configuration, and has to remain a DQM specific customisation routine called in a cmsDriver DQM step and put in some DQM subsystem.

Martin-Grunewald avatar Feb 14 '24 17:02 Martin-Grunewald

Also, it is unclear why DQM comparisons affect the HLT menu; I think it needs to be discussed how many modules we add and execute for this comparison to the HLT menu. Will thus also need to run at P5 or only for offline workflows? If the former, it needs to be assessed for timing. If the latter, it definitely does NOT go in the menus and has to remain a DQM specific customisation routine called in a cmsDriver DQM step.

This is because there's a path for CPUvsGPU checks DQM_PixelReconstruction_v that uses the comparison modules. In this case we are not adding anything, just changed the module to be able in general to do CUDAvsCUDA, CUDAvsAlpaka and AlpakaVSAlpaka comparisons.

AdrianoDee avatar Feb 14 '24 17:02 AdrianoDee

Only one path in the HLT menu is modified? In any case, if you modify a path run in the HLT menu at P5, you need to file a CMSHLT JIRA request with the details.

Martin-Grunewald avatar Feb 14 '24 17:02 Martin-Grunewald

Please clarify: you are updating offline relval workflows, so none of this needs to run during data taking at P5? Or is there a (small?) set of changes required at HLT level during data taking such that these DQM comparisons run later can succeed?

Martin-Grunewald avatar Feb 14 '24 17:02 Martin-Grunewald

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43964/38861

  • This PR adds an extra 192KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): #33532, #43701, #43761
    • File RecoLocalTracker/SiPixelRecHits/python/SiPixelRecHits_cfi.py modified in PR(s): #37952
    • File RecoTracker/PixelVertexFinding/plugins/alpaka/vertexFinder.dev.cc modified in PR(s): #43952

cmsbuild avatar Feb 14 '24 17:02 cmsbuild

Pull request #43964 was updated. @subirsarkar, @AdrianoDee, @davidlange6, @makortel, @fwyzard, @cmsbuild, @antoniovilela, @consuegs, @srimanob, @nothingface0, @rappoccio, @sunilUIET, @Martin-Grunewald, @mandrenguyen, @syuvivida, @miquork, @fabiocos, @antoniovagnerini, @saumyaphor4252, @perrotta, @tjavaid, @jfernan2, @rvenditti, @mmusich can you please check and sign again.

cmsbuild avatar Feb 14 '24 17:02 cmsbuild

Old DQM compare modules for pixel rechits, tracks and vertices were removed.

does this still apply?

mmusich avatar Feb 14 '24 17:02 mmusich

Old DQM compare modules for pixel rechits, tracks and vertices were removed.

overall I am confused by amount of comparison we need to support. Could someone from the alpaka migration team outline what is the foreseen pattern of deployment?

mmusich avatar Feb 14 '24 17:02 mmusich

Old DQM compare modules for pixel rechits, tracks and vertices were removed.

does this still apply?

No, I just removed it. Thanks!

borzari avatar Feb 14 '24 18:02 borzari

Old DQM compare modules for pixel rechits, tracks and vertices were removed.

overall I am confused by amount of comparison we need to support. Could someone from the alpaka migration team outline what is the foreseen pattern of deployment?

The idea is to add modules that compare Alpaka and CUDA reconstructed objects (pixel hits, tracks and vertices in this case) for validation of the Alpaka pixel tracks. We also used the opportunity to address some comments from the previous PRs of the migration, in which there would be two compare modules: one for CUDA SoAs and one for Alpaka SoAs, which could be generalized. Also, the amount of comparisons is temporary, since the CUDA modules will be retired at some point and the Alpaka modules can be replaced by the generalized ones (which are currently not very general; they work for all the CUDA and/or Alpaka cases, but the inputs are still referring to CUDA/Alpaka SoAs)

borzari avatar Feb 14 '24 18:02 borzari

Please clarify: you are updating offline relval workflows, so none of this needs to run during data taking at P5? Or is there a (small?) set of changes required at HLT level during data taking such that these DQM comparisons run later can succeed?

@Martin-Grunewald I reverted back the changes that were removing the DQM modules that are curently being used in the HLT configuration. No change should be touching the HLT menu now, and anything "extra" is added via customizations

borzari avatar Feb 14 '24 18:02 borzari

Also, the amount of comparisons is temporary, since the CUDA modules will be retired at some point

this is what is not clear to me. Can you or @AdrianoDee elaborate on the plan? E.g. in terms of which menu versions this should be fit in and retired.

mmusich avatar Feb 14 '24 18:02 mmusich

tl;dr

Online DQM

  • within the HLT menu, we should not do any comparisons
  • the HLT menu should run a dedicated CPU-only version of the alpaka modules on 1% of the events, and store the SoA collections¹ in the DQMGPUvsCPU output stream
  • the online DQM running at P5 should run a comparison of the alpaka vs alpaka-cpu results

The goal of this comparison is to ensure that the HLT menu running on GPU or on CPU gives equivalent results. The validation of the legacy modules is outside the scope.

Regular relvals with 14.1.x and later

Same as above.

In addition, for modules that have a non-alpaka based CPU implementation (ECAL, HCAL, maybe Pixel clustering, not Pixel tracks) we may consider doing also an alpaka-vs-legacy comparison. See the next point for possible options.

Special validation of CMSSW 14.0.0

Only for CMSSW 14.0.0 we plan to do a special validation of the Alpaka vs CUDA reconstruction. This will not be part of the HLT menu.

A simple approach is to run four workflows over the same RAW data (real data or MC)

  • legacy CPU
  • legacy CUDA
  • alpaka CPU
  • alpaka CUDA

and compare the trigger results and the usual (?) DQM plots.

It's up to the individual DPGs and POGs is the also want to implement a dedicated event-by-event comparison and the corresponding workflows.


¹ this was not possible with the old dataformats; it is possible with the alpaka-based portable collections.

fwyzard avatar Feb 14 '24 19:02 fwyzard

@fwyzard summarized everything before (and better than) I could. Let me add a point/question. For Online DQM, as said, we will run the Alpaka version of what we run now in CUDA on 1% of the events. With what is in customizeHLTForAlpaka (specifically in customizeHLTforDQMGPUvsCPUPixel) there's everything that's needed to "port" that part to Alpaka (at least for the cms-sw menu). But I imagine a JIRA would still be needed from the POG to implement this in the "real" menu.

In this PR we have restructured the various DQM module to be able to do all the possible comparison mentioned above (CUDA,Alpaka, CUDAvsAlpaka) with a single module (following also the suggestion https://github.com/cms-sw/cmssw/pull/41288#discussion_r1445426784, listed in https://github.com/cms-sw/cmssw/issues/43796#issuecomment-1912381807). So all the fuzz arose since we thought we would have needed to "fix" the current menu to use the CUDA branch of these new modules. But since

  1. we are keeping the CUDA modules as they are;
  2. the transition will (would) be directly done to the new Alpaka comparison DQM;

I don't think we need to implement a special customizer for HLT to run with this new version of the CUDA modules.

AdrianoDee avatar Feb 14 '24 19:02 AdrianoDee

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43964/38866

  • This PR adds an extra 188KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): #33532, #43701, #43761
    • File DQM/SiPixelHeterogeneous/plugins/SiPixelCompareVertexSoAAlpaka.cc modified in PR(s): #43952
    • File RecoLocalTracker/SiPixelRecHits/python/SiPixelRecHits_cfi.py modified in PR(s): #37952
    • File RecoTracker/PixelVertexFinding/plugins/alpaka/vertexFinder.dev.cc modified in PR(s): #43952

cmsbuild avatar Feb 14 '24 20:02 cmsbuild

Pull request #43964 was updated. @consuegs, @Martin-Grunewald, @cmsbuild, @mandrenguyen, @subirsarkar, @davidlange6, @srimanob, @jfernan2, @antoniovagnerini, @rvenditti, @fabiocos, @sunilUIET, @AdrianoDee, @mmusich, @tjavaid, @syuvivida, @rappoccio, @saumyaphor4252, @miquork, @makortel, @perrotta, @antoniovilela, @fwyzard, @nothingface0 can you please check and sign again.

cmsbuild avatar Feb 14 '24 20:02 cmsbuild

test parameters:

  • workflow_opts_gpu= -w upgrade
  • workflows_gpu= 12434.403, 12434.405, 12434.503

AdrianoDee avatar Feb 14 '24 21:02 AdrianoDee

enable gpu

AdrianoDee avatar Feb 14 '24 21:02 AdrianoDee

please test

AdrianoDee avatar Feb 14 '24 21:02 AdrianoDee

-1

Failed Tests: Build Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3adb3d/37466/summary.html COMMIT: c73f58a37d82122b53700f523bb0cb99b2bb4205 CMSSW: CMSSW_14_1_X_2024-02-14-1100/el8_amd64_gcc12 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/43964/37466/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

>> Compiling edm plugin src/RecoTauTag/HLTProducers/src/TauTagFilter.cc
>> Compiling edm plugin src/RecoTauTag/HLTProducers/src/TrackingRegionsFromBeamSpotAndL2Tau.cc
>> Compiling edm plugin src/RecoTauTag/HLTProducers/src/TrackingRegionsFromBeamSpotAndL2TauEDProducer.cc
>> Compiling edm plugin src/RecoTauTag/HLTProducers/src/VertexFromTrackProducer.cc
src/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc: In member function 'void L2TauNNProducerAlpaka::selectGoodTracksAndVertices(const ZVertexHost&, const TracksHost&, std::vector&, std::vector&)':
src/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc:580:28: error: 'TracksUtilities' does not name a type
  580 |   using patatrackHelpers = TracksUtilities<:phase1>;
      |                            ^~~~~~~~~~~~~~~
src/RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc:594:18: error: 'patatrackHelpers' has not been declared
  594 |     auto nHits = patatrackHelpers::nHits(patatracks_tsoa.view(), trk_idx);
      |                  ^~~~~~~~~~~~~~~~

cmsbuild avatar Feb 14 '24 23:02 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43964/38875

  • This PR adds an extra 200KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

    • File Configuration/PyReleaseValidation/python/upgradeWorkflowComponents.py modified in PR(s): #33532, #43761
    • File DQM/SiPixelHeterogeneous/plugins/SiPixelCompareVertexSoAAlpaka.cc modified in PR(s): #43952
    • File RecoLocalTracker/SiPixelRecHits/python/SiPixelRecHits_cfi.py modified in PR(s): #37952
    • File RecoTauTag/HLTProducers/src/L2TauTagNNProducerAlpaka.cc modified in PR(s): #43952
    • File RecoTracker/PixelVertexFinding/plugins/alpaka/vertexFinder.dev.cc modified in PR(s): #43952

cmsbuild avatar Feb 14 '24 23:02 cmsbuild

Pull request #43964 was updated. @sunilUIET, @antoniovilela, @Martin-Grunewald, @tjavaid, @fabiocos, @consuegs, @davidlange6, @mmusich, @rappoccio, @syuvivida, @jfernan2, @rvenditti, @fwyzard, @subirsarkar, @saumyaphor4252, @mandrenguyen, @miquork, @cmsbuild, @srimanob, @antoniovagnerini, @makortel, @AdrianoDee, @nothingface0, @perrotta can you please check and sign again.

cmsbuild avatar Feb 14 '24 23:02 cmsbuild

Hi all,

I find the location and changes to some of the files in this PR problematic. HLTrigger/Configuration should deal with the HLT as run at P5 and offline, not any subsequent steps such as DQM or validation.

Temporaily adding customisation files such as HLTrigger/Configuration/python/customizeHLTforAlpaka.py is OK, as eventually these changes are moved into ConfDb and the menu dumps in CMSSW, and then HLTrigger/Configuration/python/customizeHLTforAlpaka.py will be removed (planned for early March).

However, DQM-related files as in this PR should NOT be in HLTrigger/Configuration, but rather some DQM/ or DQMOffline/ subsystem/package, ie, the file HLTrigger/Configuration/python/customizeHLTforAlpakavsCUDA.py added here should be moved elsewhere, and the hooks added in HLTrigger/Configuration/python/customizeHLTforCMSSW.py should be removed and moved elsewhere to some DQM location.

Martin-Grunewald avatar Feb 15 '24 07:02 Martin-Grunewald

How about HLTriggerOffline/Common/python/ ?

fwyzard avatar Feb 15 '24 07:02 fwyzard

Ah yes! Or make HLTriggerOffline/Alpaka/...

Martin-Grunewald avatar Feb 15 '24 08:02 Martin-Grunewald