cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

New Heterogeneous Memory Pool

Open VinInn opened this issue 2 years ago • 42 comments

This PR replaces the old "notcub" cache allocator with a memory pool featuring

lockfree operations backend agnostic implementation The data interface is based on a simple Buffer that is completely backend agnostic The allocation interface (makeBuffer) currently depends on cudaStream_t that can be easily hidden behind void * or a light opaque struct A new feature is a "Bundle deleter": buffers can be bundle together and then freed in just one operation: this reduces the number of cuda calls. All previous users of the cache allocator (at least for Pixel wf) have been migrated.

Tests passes: it is not slower than previous implementation. Need a free machine to make definitive tests.

Some cleanup is still required to remove debug statements.

Purely technical no regression expected.

Draft Slides for a possible presentation available @ https://cernbox.cern.ch/index.php/s/Ax4NHYGLHbG8N1C

VinInn avatar May 15 '22 13:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37952/30020

  • This PR adds an extra 232KB to repository

  • Found files with invalid states:

    • HeterogeneousCore/CUDAUtilities/src/cudaMemoryPool.cu:
      • Added: f37385a552b3045133ef6c42fef34d1f1e7d737a
      • Modified: 75ca0dbe0f56fffb8998d35429c6978f5b461505, c429d13bf12445972e07f1a107b28ee63931fb88, c5e35f0ee22d34bafdf47a70b7e6c010eb52801f
      • Deleted: 21a646e3b8568eaed4d67f66f87c6f866c1754a6
    • CUDADataFormats/TrackingRecHit/interface/TrackingRecHit2DHeterogeneousImpl.h:
      • Added: 5291489f7c512567e2ea7a2c2492ed769113cdbb
      • Modified: 1a43ba74652dfe596df6e440e3d58705d60342c1, e7d86325fee62412599be075901712ea5dd78571, c8d553a8399fbe6335428bd05524b7ca6842c5f2
      • Deleted: 521d4c0ba8cb2fddb313d4aec0448c32b8663780
    • HeterogeneousCore/CUDAUtilities/interface/cudaMemoryPoolImpl.h:
      • Added: 5291489f7c512567e2ea7a2c2492ed769113cdbb
      • Modified: 1a43ba74652dfe596df6e440e3d58705d60342c1, 59bcb2be4ef6e9932face966de0a60c914d5a8ed, 29df6e20a122c8328831f9d2594e8630d9f43a45, 849da8c5c5aeab4d4bf77ecd8daead996dfe4da7, e7d86325fee62412599be075901712ea5dd78571, b4f4d467756c6b8373a46a1d22a654bd54c4e742, 8b149edfa447eec39e9548d1f32f0eb0df384d2c
      • Deleted: 1487b8809a1ec08bea1eb831cb3a0de6d545ee45
    • CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDAImpl.h:
      • Added: 3ae45f7de8cd397b8b61b0430e4f402839a7dbbc
      • Modified: b4f4d467756c6b8373a46a1d22a654bd54c4e742, c8d553a8399fbe6335428bd05524b7ca6842c5f2
      • Deleted: 9402cb72c38ad7e988131bdbf3cf8bd1bdfde11b
    • CUDADataFormats/TrackingRecHit/src/TrackingRecHit2DHeterogeneous.cc:
      • Modified: 6b050bde0222f1b9eea7cb60d96e16320b4e9364, 0e49a367337570dd2e867ee929420c4e29b288ea, e8e9c0ff9d0939da0a325e82aea15ed2941a6f02, 88da3bc1c3f19de3c87ee87667d77d6ac7c35e85, b4f4d467756c6b8373a46a1d22a654bd54c4e742, 521d4c0ba8cb2fddb313d4aec0448c32b8663780
      • Deleted: 21a646e3b8568eaed4d67f66f87c6f866c1754a6
      • Added: e7d86325fee62412599be075901712ea5dd78571
  • There are other open Pull requests which might conflict with changes you have proposed:

    • File HeterogeneousCore/CUDAServices/src/CUDAService.cc modified in PR(s): #37831
    • File HeterogeneousCore/CUDAUtilities/test/BuildFile.xml modified in PR(s): #35713
    • File RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHitGPUKernel.cu modified in PR(s): #35713
    • File RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromLegacy.cc modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.h modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsAlloc.cc modified in PR(s): #35713

cmsbuild avatar May 15 '22 14:05 cmsbuild

A new Pull Request was created by @VinInn (Vincenzo Innocente) for master.

It involves the following packages:

  • CUDADataFormats/BeamSpot (heterogeneous, reconstruction)
  • CUDADataFormats/Common (heterogeneous)
  • CUDADataFormats/SiPixelDigi (heterogeneous, reconstruction)
  • CUDADataFormats/Track (heterogeneous, reconstruction)
  • CUDADataFormats/TrackingRecHit (heterogeneous, reconstruction)
  • CUDADataFormats/Vertex (heterogeneous, reconstruction)
  • EventFilter/SiPixelRawToDigi (reconstruction)
  • HeterogeneousCore/CUDACore (heterogeneous)
  • HeterogeneousCore/CUDAServices (heterogeneous)
  • HeterogeneousCore/CUDAUtilities (heterogeneous)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoPixelVertexing/PixelTrackFitting (reconstruction)
  • RecoPixelVertexing/PixelTriplets (reconstruction)
  • RecoPixelVertexing/PixelVertexFinding (reconstruction)
  • RecoVertex/BeamSpotProducer (reconstruction, alca)

@malbouis, @yuanchao, @makortel, @slava77, @clacaputo, @cmsbuild, @fwyzard, @jpata, @tvami, @francescobrivio can you please review it and eventually sign? Thanks. @tvami, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @Martin-Grunewald, @missirol, @OzAmram, @tocheng, @ferencek, @mtosi, @gpetruc, @mmusich, @dkotlins, @threus, @dgulhan, @francescobrivio this is something you requested to watch as well. @perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar May 15 '22 14:05 cmsbuild

@cmsbuild , please test

VinInn avatar May 15 '22 15:05 VinInn

enable gpu

VinInn avatar May 15 '22 15:05 VinInn

-1

Failed Tests: UnitTests Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-651b42/24728/summary.html COMMIT: b8d0837f4924cb88f991b367d2ffbec85e631b7f CMSSW: CMSSW_12_4_X_2022-05-15-0000/slc7_amd64_gcc10 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37952/24728/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test cpuVertexFinderByDensity_t had ERRORS
---> test cpuVertexFinderIterative_t had ERRORS

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19874
  • DQMHistoTests: Total failures: 1171
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 18702
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: found differences in 3 / 3 workflows

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-651b42/11634.301_TTbar_14TeV+2021_Run3FS+TTbar_14TeV_TuneCP5_GenSim+HARVESTNano

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3741432
  • DQMHistoTests: Total failures: 92
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3741318
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar May 15 '22 20:05 cmsbuild

@cmsbuild , please test

VinInn avatar May 16 '22 06:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37952/30028

  • This PR adds an extra 236KB to repository

  • Found files with invalid states:

    • HeterogeneousCore/CUDAUtilities/src/cudaMemoryPool.cu:
      • Added: f37385a552b3045133ef6c42fef34d1f1e7d737a
      • Modified: 75ca0dbe0f56fffb8998d35429c6978f5b461505, c429d13bf12445972e07f1a107b28ee63931fb88, c5e35f0ee22d34bafdf47a70b7e6c010eb52801f
      • Deleted: 21a646e3b8568eaed4d67f66f87c6f866c1754a6
    • CUDADataFormats/TrackingRecHit/interface/TrackingRecHit2DHeterogeneousImpl.h:
      • Added: 5291489f7c512567e2ea7a2c2492ed769113cdbb
      • Modified: 1a43ba74652dfe596df6e440e3d58705d60342c1, e7d86325fee62412599be075901712ea5dd78571, c8d553a8399fbe6335428bd05524b7ca6842c5f2
      • Deleted: 521d4c0ba8cb2fddb313d4aec0448c32b8663780
    • HeterogeneousCore/CUDAUtilities/interface/cudaMemoryPoolImpl.h:
      • Added: 5291489f7c512567e2ea7a2c2492ed769113cdbb
      • Modified: 1a43ba74652dfe596df6e440e3d58705d60342c1, 59bcb2be4ef6e9932face966de0a60c914d5a8ed, 29df6e20a122c8328831f9d2594e8630d9f43a45, 849da8c5c5aeab4d4bf77ecd8daead996dfe4da7, e7d86325fee62412599be075901712ea5dd78571, b4f4d467756c6b8373a46a1d22a654bd54c4e742, 8b149edfa447eec39e9548d1f32f0eb0df384d2c
      • Deleted: 1487b8809a1ec08bea1eb831cb3a0de6d545ee45
    • CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDAImpl.h:
      • Added: 3ae45f7de8cd397b8b61b0430e4f402839a7dbbc
      • Modified: b4f4d467756c6b8373a46a1d22a654bd54c4e742, c8d553a8399fbe6335428bd05524b7ca6842c5f2
      • Deleted: 9402cb72c38ad7e988131bdbf3cf8bd1bdfde11b
    • CUDADataFormats/TrackingRecHit/src/TrackingRecHit2DHeterogeneous.cc:
      • Modified: 6b050bde0222f1b9eea7cb60d96e16320b4e9364, 0e49a367337570dd2e867ee929420c4e29b288ea, e8e9c0ff9d0939da0a325e82aea15ed2941a6f02, 88da3bc1c3f19de3c87ee87667d77d6ac7c35e85, b4f4d467756c6b8373a46a1d22a654bd54c4e742, 521d4c0ba8cb2fddb313d4aec0448c32b8663780
      • Deleted: 21a646e3b8568eaed4d67f66f87c6f866c1754a6
      • Added: e7d86325fee62412599be075901712ea5dd78571
  • There are other open Pull requests which might conflict with changes you have proposed:

    • File HeterogeneousCore/CUDAServices/src/CUDAService.cc modified in PR(s): #37831
    • File HeterogeneousCore/CUDAUtilities/test/BuildFile.xml modified in PR(s): #35713
    • File RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHitGPUKernel.cu modified in PR(s): #35713
    • File RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromLegacy.cc modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.h modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsAlloc.cc modified in PR(s): #35713

cmsbuild avatar May 16 '22 07:05 cmsbuild

Pull request #37952 was updated. @malbouis, @yuanchao, @makortel, @slava77, @clacaputo, @fwyzard, @jpata, @tvami, @francescobrivio can you please check and sign again.

cmsbuild avatar May 16 '22 07:05 cmsbuild

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-651b42/24737/summary.html COMMIT: 574cca4c553978aa1ea6919b2a41ac5c2f69a8bb CMSSW: CMSSW_12_4_X_2022-05-15-2300/slc7_amd64_gcc10 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37952/24737/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19874
  • DQMHistoTests: Total failures: 1172
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 18701
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: found differences in 3 / 3 workflows

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-651b42/11634.301_TTbar_14TeV+2021_Run3FS+TTbar_14TeV_TuneCP5_GenSim+HARVESTNano

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3741432
  • DQMHistoTests: Total failures: 86
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3741324
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar May 16 '22 13:05 cmsbuild

Sent from my iPhone

On May 18, 2022, at 09:30, Tamas Vami @.***> wrote:

 @tvami commented on this pull request.

In CUDADataFormats/BeamSpot/interface/BeamSpotCUDA.h:

class BeamSpotCUDA { public:

  • using Buffer = memoryPool::Buffer<BeamSpotPOD>; Hi @VinInn isnt this technically a namespace? According to rule 2.7 those should start with a lowercase letter

This is a class alias (aka typedef) Will address the other comments later in the week V.

Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

VinInn avatar May 18 '22 10:05 VinInn

again in needs of rebase

VinInn avatar May 22 '22 13:05 VinInn

@cmsbuild , please test

VinInn avatar May 22 '22 14:05 VinInn

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37952/30115

  • This PR adds an extra 224KB to repository

  • Found files with invalid states:

    • HeterogeneousCore/CUDAUtilities/src/cudaMemoryPool.cu:
      • Added: 39b1a121bb3430def21c7930f5f0134c6d946f4e
      • Modified: 26a67001f0ee4157081e10478b523d1474c5d409, 8e8a1abbaf6937065bdfc54ef9bbf04ff6f2c128, ead220d0ed2821d8130af78445408eff8db6392b
      • Deleted: eea72dc3319c5295ef2f54c6a2e71c28a07887be
    • CUDADataFormats/TrackingRecHit/interface/TrackingRecHit2DHeterogeneousImpl.h:
      • Added: f19e812de7b074c55d64d3295fc02eb63cdac1eb
      • Modified: 6f140eb5c98abc7f56646ca6c7f57ed867f06b91, d7aa3b39d17b315adce463edcf74f4cd10b17fbf, c7256ea46632c4c4bfa9ca8e82914e5ada53df29
      • Deleted: 0920241ed10a9d14c410fdc81a73eff10d881cbf
    • HeterogeneousCore/CUDAUtilities/interface/cudaMemoryPoolImpl.h:
      • Added: f19e812de7b074c55d64d3295fc02eb63cdac1eb
      • Modified: 6f140eb5c98abc7f56646ca6c7f57ed867f06b91, 67500afbd34e1ac947db321209885a91d0f989f4, 12d2efc459e81a784718ead1c585b4cba23489ad, d32b122d44b489586fbeb3c727cea2857d26032a, d7aa3b39d17b315adce463edcf74f4cd10b17fbf, b42eeaefe5c0add3ed6fc5fd59fd650f96922914, 1311dc48ac0598e5b50127dd595cf13c10940f0d
      • Deleted: b299bc7fa1b214679723acbd1aed102bbc80eeeb
    • CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDAImpl.h:
      • Added: 031e68338eeb90e4504c66e4d97784615ff65e69
      • Modified: b42eeaefe5c0add3ed6fc5fd59fd650f96922914, c7256ea46632c4c4bfa9ca8e82914e5ada53df29
      • Deleted: 22d6c5b3931f7dfb3603528502c5a9b89b500640
    • CUDADataFormats/TrackingRecHit/src/TrackingRecHit2DHeterogeneous.cc:
      • Modified: f1e6ec9518744a417afe8ef6ebc584af3e91cd07, d752dc8fe9b84a27e57fc562eb8c9ff07cbf14cf, 8ddc45e2a327f97d254788392e6baee1f3f8f434, f13550c10493ec04b8486b5c17fea0dff85de9d8, b42eeaefe5c0add3ed6fc5fd59fd650f96922914, 0920241ed10a9d14c410fdc81a73eff10d881cbf
      • Deleted: eea72dc3319c5295ef2f54c6a2e71c28a07887be
      • Added: d7aa3b39d17b315adce463edcf74f4cd10b17fbf
  • There are other open Pull requests which might conflict with changes you have proposed:

    • File HeterogeneousCore/CUDAServices/src/CUDAService.cc modified in PR(s): #37831
    • File HeterogeneousCore/CUDAUtilities/test/BuildFile.xml modified in PR(s): #35713
    • File RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHitGPUKernel.cu modified in PR(s): #35713
    • File RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromLegacy.cc modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.h modified in PR(s): #35713
    • File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsAlloc.cc modified in PR(s): #35713

cmsbuild avatar May 22 '22 14:05 cmsbuild

Pull request #37952 was updated. @malbouis, @yuanchao, @makortel, @slava77, @clacaputo, @fwyzard, @jpata, @tvami, @francescobrivio can you please check and sign again.

cmsbuild avatar May 22 '22 14:05 cmsbuild

-1

Failed Tests: RelVals-GPU Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-651b42/24895/summary.html COMMIT: 2d0392890d2e9282c03eca5dc21741e1dc3ff091 CMSSW: CMSSW_12_5_X_2022-05-22-0000/el8_amd64_gcc10 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37952/24895/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-GPU

  • 11634.51211634.512_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.52211634.522_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.50611634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3650985
  • DQMHistoTests: Total failures: 14
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3650949
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar May 22 '22 18:05 cmsbuild

I am not sure the changes introduced in this PR are the cause of the errors ad crash in the RelVals ( I mean: how I managed to mess up ECALOnlyGPU?) is "el8" now the standard platform for relvals?

VinInn avatar May 23 '22 06:05 VinInn

I am unable to run gpu relvals

[1]    Exit 1                        runTheMatrix.py --gpu=required -e -t 8 -l 11634.506 >& gpu.log
[innocent@patatrack02 matrix]$
[innocent@patatrack02 matrix]$ cat gpu.log
processing relval_standard
processing relval_highstats
processing relval_pileup
processing relval_generator
processing relval_extendedgen
processing relval_production
processing relval_ged
ignoring relval_upgrade from default matrix
ignoring relval_cleanedupgrade from default matrix
ignoring relval_gpu from default matrix
processing relval_2017
processing relval_2026
ignoring relval_identity from default matrix
processing relval_machine
processing relval_premix
Traceback (most recent call last):
  File "/cvmfs/cms-ib.cern.ch/nweek-02733/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_5_X_2022-05-18-1100/bin/slc7_amd64_gcc11/runTheMatrix.py", line 606, in <module>
    ret = runSelected(opt)
  File "/cvmfs/cms-ib.cern.ch/nweek-02733/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_5_X_2022-05-18-1100/bin/slc7_amd64_gcc11/runTheMatrix.py", line 31, in runSelected
    if len(undefSet)>0: raise ValueError('Undefined workflows: '+', '.join(map(str,list(undefSet))))
ValueError: Undefined workflows: 11634.506

[innocent@patatrack02 matrix]$ runTheMatrix.py --requires-gpu -e -n | grep GPU
39434.502 2026D88_Patatrack_PixelOnlyGPU+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal [1]: cmsDriver.py TTbar_14TeV_TuneCP5_cfi  -s GEN,SIM -n 10 --conditions auto:phase2_realistic_T21 --beamspot HLLHC14TeV --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry Extended2026D88 --era Phase2C17I13M9 --relval 9000,100

VinInn avatar May 23 '22 06:05 VinInn

I think you need runTheMatrix.py -w gpu ... or runTheMatrix.py -w upgrade ... to enable the GPU workflows.

fwyzard avatar May 23 '22 10:05 fwyzard

to me

11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Mon May 23 14:41:24 2022-date Mon May 23 14:34:02 2022; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

with this PR and with IB CMSSW_12_5_X_2022-05-23-1100 as well

VinInn avatar May 23 '22 13:05 VinInn

to me

11634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Mon May 23 14:41:24 2022-date Mon May 23 14:34:02 2022; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

with this PR and with IB CMSSW_12_5_X_2022-05-23-1100 as well

It also works for me. Let's trigger again the test

clacaputo avatar May 25 '22 11:05 clacaputo

@cmsbuild please test

clacaputo avatar May 25 '22 11:05 clacaputo

enable gpu

clacaputo avatar May 25 '22 11:05 clacaputo

-1

Failed Tests: RelVals-GPU Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-651b42/24984/summary.html COMMIT: 2d0392890d2e9282c03eca5dc21741e1dc3ff091 CMSSW: CMSSW_12_5_X_2022-05-24-2300/el8_amd64_gcc10 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37952/24984/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-GPU

  • 11634.50611634.506_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_PixelOnlyTripletsGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.52211634.522_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log
  • 11634.51211634.512_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano/step2_TTbar_14TeV+2021_Patatrack_ECALOnlyGPU+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3650985
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3650961
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar May 25 '22 15:05 cmsbuild

it's crashing in the standard (non GPU, say legacy) part

#5  0x00002b345764b344 in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator<BaseTrackerRecHit const*> > const&, Point3DBase<float, GlobalTag> const&, DetLayer const*) () from /cvmfs/cms-ib.cern.ch/nweek-02734/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-05-23-2300/lib/el8_amd64_gcc10/libRecoTrackerTkHitPairs.so
#6  0x00002b345764746c in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /cvmfs/cms-ib.cern.ch/nweek-02734/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-05-23-2300/lib/el8_amd64_gcc10/libRecoTrackerTkHitPairs.so
#7  0x00002b345764540f in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /cvmfs/cms-ib.cern.ch/nweek-02734/el8_amd64_gcc10/cms/cmssw/CMSSW_12_5_X_2022-05-23-2300/lib/el8_amd64_gcc10/libRecoTrackerTkHitPairs.so

VinInn avatar May 26 '22 06:05 VinInn

BTW it is crashing in step2: HLT and that part is not customized to run just pixelTracks, ECAL, HCAL. It seems to run just the HL menu: not even sure with gpu or not

VinInn avatar May 26 '22 06:05 VinInn

Managed to reproduce the crash (it seems it happens if single threaded...). It seems that it runs the gpu HLT menu (it is intended?) will try to understand why....

VinInn avatar May 26 '22 07:05 VinInn

Yes, all GPU-related workflows run the full HLT menu on GPUs (if one is available).

fwyzard avatar May 26 '22 10:05 fwyzard

why it is scheduling and running both

TimeModule> 6 1 hltSiPixelRecHitsFromLegacy SiPixelRecHitSoAFromLegacy 0.000859022

and

TimeModule> 6 1 hltSiPixelRecHitsFromGPU SiPixelRecHitFromCUDA 0.0004251

?

VinInn avatar May 26 '22 10:05 VinInn

No clue how ti was passing the previous test...

VinInn avatar May 26 '22 13:05 VinInn

@cmsbuild , please test

VinInn avatar May 26 '22 13:05 VinInn