cmssw
cmssw copied to clipboard
Improve handling of TF CUDA tests for 14_1_X
PR description:
This PR improves the handling of CUDA unit tests for the TensorFlow package, using a new tf_cuda_support
tool from scram , which checks if the GPU support is enabled in TensorFlow compilation.
The PR also makes the TF cuda tests more strict by checking explicitely if a CUDA device is visible to TF and not only to cmssw.
The test testTFVisibleDevicesCUDA
is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA
tests were passing silently using the CPU to run the test. After this PR all the TF sessions using tf::backend::cuda
, but not finding a GPU will fail explicitly.
- This PR is needed to continue the integration of CUDA 12.4 in https://github.com/cms-sw/cmsdist/pull/9046.
- This PR needs https://github.com/cms-sw/cmsdist/pull/9066 (thanks to @smuzaffar https://github.com/cms-sw/cmssw/pull/44376#issuecomment-1992693414)
cms-bot internal usage
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39431
- This PR adds an extra 12KB to repository
A new Pull Request was created by @valsdav for master.
It involves the following packages:
- PhysicsTools/TensorFlow (ml)
@cmsbuild, @valsdav, @wpmccormack can you please review it and eventually sign? Thanks. @makortel, @riga this is something you requested to watch as well. @rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here
@valsdav , any idea why only testTFVisibleDevicesCUDA
fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?
@valsdav , I was thinking to add tf_cuda_support
toolfile so that we can have something like
<iftool name="tf_cuda_support">
all tests which requires TF to be build with GPU support
</iftool>
@valsdav , any idea why only
testTFVisibleDevicesCUDA
fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?
Hi @smuzaffar I realized this morning testing locally that only that test is failing because it is the only one explicitly checking the list of devices visible to TF.
The other tests that are run when a CUDA device is visible to the cmssw framework are silently fall-backing to CPU running in TF. I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.
all tests which requires TF to be build with GPU support
That's a good idea! Thanks!
The other
testTF*CUDA
tests are instead silently using the CPU to run the test.
I think this part should be fixed: if we want the tests to run with CUDA, they should not fall back to CPU.
I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.
In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?
hold
- As discussed at ORP.
Pull request has been put on hold by @antoniovilela
They need to issue an unhold
command to remove the hold
state or L1 can unhold
it for all
I will add tf_cuda_support
tool and will redo this PR
I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.
In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?
I have a solution for this, @smuzaffar do you want me to push it here so that it can be included in the new PR? Thanks
@valsdav , I have opened https://github.com/cms-sw/cmsdist/pull/9066 which adds the suggested new tool tf_cuda_support
. Feel free to update this PR with
- Changes needed for
In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs
- Move TF GPU unit tests in
<iftool name="tf_cuda_support"> </iftool>
block
-code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443
- This PR adds an extra 16KB to repository
Code check has found code style and quality issues which could be resolved by applying following patch(s)
-
code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443/code-format.patch
e.g.
curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443/code-format.patch | patch -p1
You can also runscram build code-format
to apply code format directly
+code-checks
Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39446
- This PR adds an extra 16KB to repository
Pull request #44376 was updated. @wpmccormack, @cmsbuild, @valsdav can you please check and sign again.
please test with https://github.com/cms-sw/cmsdist/pull/9066
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/summary.html
COMMIT: b9b41e4c8166dca668644f4aef63429fb9c4aebf
CMSSW: CMSSW_14_1_X_2024-03-12-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38092/install.sh
to create a dev area with all the needed externals and cmssw changes.
The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
- @cms-sw cms-sw/cmsdist#9066
You can see more details here: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/git-recent-commits.json https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/git-merge-result
Comparison Summary
Summary:
- You potentially removed 89 lines from the logs
- Reco comparison results: 58 differences found in the comparisons
- DQMHistoTests: Total files compared: 48
- DQMHistoTests: Total histograms compared: 3297383
- DQMHistoTests: Total failures: 9
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3297354
- DQMHistoTests: Total skipped: 20
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
- Checked 202 log files, 165 edm output root files, 48 DQM output files
- TriggerResults: no differences found
enable gpu
please test
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38108/summary.html
COMMIT: b9b41e4c8166dca668644f4aef63429fb9c4aebf
CMSSW: CMSSW_14_1_X_2024-03-13-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38108/install.sh
to create a dev area with all the needed externals and cmssw changes.
Comparison Summary
Summary:
- You potentially added 1 lines to the logs
- Reco comparison results: 50 differences found in the comparisons
- DQMHistoTests: Total files compared: 48
- DQMHistoTests: Total histograms compared: 3297383
- DQMHistoTests: Total failures: 6
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3297357
- DQMHistoTests: Total skipped: 20
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
- Checked 202 log files, 165 edm output root files, 48 DQM output files
- TriggerResults: no differences found
GPU Comparison Summary
Summary:
- You potentially removed 8 lines from the logs
- Reco comparison results: 35 differences found in the comparisons
- DQMHistoTests: Total files compared: 3
- DQMHistoTests: Total histograms compared: 39740
- DQMHistoTests: Total failures: 1188
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 38552
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
- Checked 8 log files, 10 edm output root files, 3 DQM output files
- TriggerResults: no differences found
@cms-sw/ml-l2 , can you please review this?
@cms-sw/orp-l2 , this now has all the required changes. you can unhold
it now
+1
@valsdav , can you please backport it for 14.0.X ?
unhold
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)
+1