cmssw
cmssw copied to clipboard
Extend onnxruntime gpu interface to producers using onnxruntime
Extends #36963 by adding a backend parameter to models, used by
cms::Ort::getSessionOptions(iConfig.getParameterstd::string("onnx_backend"));
Current options are cpu -> Use CPU backend cuda -> Use cuda backend default -> Use best available
The model used in BoostedJetONNXJetTagsProducer crashes on GPU if the full optimization is included. I reduced this optimization in case a GPU is used (following recipes found on the web). The sort of error one gets is
Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data);
2022-08-26 13:26:51.964709271 [E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running FusedConv node. Name:'Conv_98_Add_99_Relu_100'
Status Message: CUDNN error executing cudnnAddTensor(Base::CudnnHandle(), &alpha, Base::s_.z_tensor, Base::s_.z_data, &alpha, Base::s_.y_tensor, Base::s_.y_data)
----- Begin Fatal Exception 26-Aug-2022 14:26:51 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 1 lumi: 1 event: 1 stream: 0
So far I do not see any significant performance improvement (at least on lxplus-gpu) nor loss. At least BoostedJetONNXJetTagsProducer.cc can be improved to send more than one jet to onnxruntime at a time.
-code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32108
- This PR adds an extra 28KB to repository
Code check has found code style and quality issues which could be resolved by applying following patch(s)
-
code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32108/code-format.patch
e.g.
curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32108/code-format.patch | patch -p1
You can also runscram build code-format
to apply code format directly
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32110
- This PR adds an extra 28KB to repository
A new Pull Request was created by @davidlange6 (David Lange) for master.
It involves the following packages:
- PhysicsTools/ONNXRuntime (reconstruction)
- RecoBTag/ONNXRuntime (reconstruction)
- RecoParticleFlow/PFProducer (reconstruction)
@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks. @AlexDeMoor, @mmarionncern, @JyothsnaKomaragiri, @AnnikaStein, @riga, @emilbols, @lgray, @missirol, @hatakeyamak, @andrzejnovak, @demuller, @seemasharmafnal this is something you requested to watch as well. @perrotta, @dpiparo, @rappoccio you are the release manager for this.
cms-bot commands are listed here
enable gpu
please test
assign heterogenous
assign heterogeneous (helps if you can spell)
New categories assigned: heterogeneous
@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
We also encountered this ONNX issue in SONIC tests. I think it's https://github.com/microsoft/onnxruntime/issues/12321. There's a fix merged, but not in a release yet.
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-edc989/27572/summary.html
COMMIT: 9433eda06107f56bdc6f280b9ec0803350b2b7d5
CMSSW: CMSSW_12_6_X_2022-09-15-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39402/27572/install.sh
to create a dev area with all the needed externals and cmssw changes.
GPU Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 0 differences found in the comparisons
- Reco comparison had 3 failed jobs
- DQMHistoTests: Total files compared: 4
- DQMHistoTests: Total histograms compared: 19876
- DQMHistoTests: Total failures: 8
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 19868
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
- Checked 12 log files, 9 edm output root files, 4 DQM output files
- TriggerResults: found differences in 1 / 3 workflows
Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 5 differences found in the comparisons
- DQMHistoTests: Total files compared: 51
- DQMHistoTests: Total histograms compared: 3618326
- DQMHistoTests: Total failures: 8
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3618296
- DQMHistoTests: Total skipped: 22
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
- Checked 212 log files, 49 edm output root files, 51 DQM output files
- TriggerResults: no differences found
Given that by default a GPU would be used if it is available, maybe it would be time to make the loading of Configuration.StandardSequecens.Accelerators_cff
unconditional
https://github.com/cms-sw/cmssw/blob/284681e89ac822d328cf54cfe57866c475fae9e4/Configuration/StandardSequences/python/Services_cff.py#L11-L18
either by unconditional process.load("Configuration.StandardSequecens.Accelerators_cff")
there, or making the ConfigBuilder
to load the Accelerators_cff
in similar way as Services_cff
(in which case the Accelerators_cff
would be visible in the generated configuration)?
Try out asynchronous offload (would need e.g. https://github.com/cms-sw/cmssw/issues/29188)
regardless of the setting, onnxruntime can decide to use the CPU for some model components (or rather, it does in our models and I see no way to disable this). I believe this offload would need to be handled by a change to onnxruntime itself (its possible I have misunderstood how this works or also that such a hook already exists)
Try out asynchronous offload (would need e.g. https://github.com/cms-sw/cmssw/issues/29188)
regardless of the setting, onnxruntime can decide to use the CPU for some model components (or rather, it does in our models and I see no way to disable this). I believe this offload would need to be handled by a change to onnxruntime itself (its possible I have misunderstood how this works or also that such a hook already exists)
Thanks, if there is a risk of any significant CPU use, we'd not want to do that in a non-TBB thread.
I think we would have to see that empirically for models we have. Not sure how to actually do that. There is some json produced by a profiler but I didn't yet manage to relate that to something like fraction of the time the CPU is doing work vs the GPU doing work. (and onnxruntime runs much more slowly in this mode)
of course that depends on what is not "significant"...
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32196
- This PR adds an extra 32KB to repository
Pull request #39402 was updated. @cmsbuild, @makortel, @mandrenguyen, @clacaputo, @fwyzard can you please check and sign again.
please test
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-edc989/27770/summary.html
COMMIT: 68fc2b3ef8865796c09898f6a380bebcece52e30
CMSSW: CMSSW_12_6_X_2022-09-26-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39402/27770/install.sh
to create a dev area with all the needed externals and cmssw changes.
GPU Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 0 differences found in the comparisons
- Reco comparison had 3 failed jobs
- DQMHistoTests: Total files compared: 4
- DQMHistoTests: Total histograms compared: 19876
- DQMHistoTests: Total failures: 529
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 19347
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
- Checked 12 log files, 9 edm output root files, 4 DQM output files
- TriggerResults: found differences in 1 / 3 workflows
Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 0 differences found in the comparisons
- DQMHistoTests: Total files compared: 51
- DQMHistoTests: Total histograms compared: 3624368
- DQMHistoTests: Total failures: 2
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3624344
- DQMHistoTests: Total skipped: 22
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
- Checked 212 log files, 49 edm output root files, 51 DQM output files
- TriggerResults: no differences found
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39402/32301
- This PR adds an extra 32KB to repository
Pull request #39402 was updated. @cmsbuild, @makortel, @mandrenguyen, @clacaputo, @fwyzard can you please check and sign again.
please testOn Sep 29, 2022 09:28, cmsbuild @.***> wrote: Pull request #39402 was updated. @cmsbuild, @makortel, @mandrenguyen, @clacaputo, @fwyzard can you please check and sign again.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
please test
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-edc989/27836/summary.html
COMMIT: 54d730fcebe7add42934384806eb4be3a323af3e
CMSSW: CMSSW_12_6_X_2022-09-28-2300/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39402/27836/install.sh
to create a dev area with all the needed externals and cmssw changes.
Comparison Summary
@slava77 comparisons for the following workflows were not done due to missing matrix map:
- /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-edc989/41834.0_TTbar_14TeV+2026D94+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal
Summary:
- No significant changes to the logs found
- Reco comparison results: 4 differences found in the comparisons
- DQMHistoTests: Total files compared: 49
- DQMHistoTests: Total histograms compared: 3433154
- DQMHistoTests: Total failures: 3
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3433129
- DQMHistoTests: Total skipped: 22
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
- Checked 204 log files, 49 edm output root files, 49 DQM output files
- TriggerResults: no differences found
GPU Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 0 differences found in the comparisons
- Reco comparison had 3 failed jobs
- DQMHistoTests: Total files compared: 4
- DQMHistoTests: Total histograms compared: 19876
- DQMHistoTests: Total failures: 8
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 19868
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
- Checked 12 log files, 9 edm output root files, 4 DQM output files
- TriggerResults: found differences in 1 / 3 workflows
There are some changes to the tests here, but I think @cms-sw/heterogeneous-l2 is probably better qualified than @cms-sw/reconstruction-l2 to comment
The non-GPU tests show differences only regarding MessageLogger. The differences in 11634.506 in GPU tests look like the usual variation seen in GPU tests.
Note that this PR has currently no impact on workflows that do not enable gpu
(or pixelNtupletFit
) modifier.
+reconstruction No changes to CPU-only workflows. Changes to GPU worksflows are said to be expected.
Milestone for this pull request has been moved to CMSSW_14_0_X.Please open a backport if it should also go in to CMSSW_13_3_X.
Milestone for this pull request has been moved to CMSSW_14_1_X. Please open a backport if it should also go in to CMSSW_14_0_X.
Pull request #39402 was updated. @wpmccormack, @fwyzard, @valsdav, @makortel can you please check and sign again.