cmssw
cmssw copied to clipboard
Disabling TensorFlow CUDA tests for 14_0_X
PR description:
This PR disables temporarily the CUDA tests for the TensorFlow package, as the GPU support is not enabled in 14_X series due to CUDA 12 incompatibilities (https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/tensorflow-requires.file#L7).
The test testTFVisibleDevicesCUDA
is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA
tests are instead silently using the CPU to run the test.
This PR is needed to continue the integration of CUDA 12.4 in https://github.com/cms-sw/cmsdist/pull/9046
A new Pull Request was created by @valsdav for CMSSW_14_0_X.
It involves the following packages:
- PhysicsTools/TensorFlow (ml)
@valsdav, @wpmccormack, @cmsbuild can you please review it and eventually sign? Thanks. @makortel, @riga this is something you requested to watch as well. @sextonkennedy, @rappoccio, @antoniovilela you are the release manager for this.
cms-bot commands are listed here
cms-bot internal usage
hold
- As discussed at ORP.
Pull request has been put on hold by @antoniovilela
They need to issue an unhold
command to remove the hold
state or L1 can unhold
it for all
Pull request #44375 was updated. @valsdav, @cmsbuild, @wpmccormack can you please check and sign again.
enable gpu
please test
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b349d4/38238/summary.html
COMMIT: fcc3e4fbb27693c684d5da32abb136b0be663edf
CMSSW: CMSSW_14_0_X_2024-03-18-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44375/38238/install.sh
to create a dev area with all the needed externals and cmssw changes.
Comparison Summary
Summary:
- You potentially added 27 lines to the logs
- Reco comparison results: 39 differences found in the comparisons
- DQMHistoTests: Total files compared: 49
- DQMHistoTests: Total histograms compared: 3346212
- DQMHistoTests: Total failures: 0
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3346190
- DQMHistoTests: Total skipped: 22
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
- Checked 205 log files, 166 edm output root files, 49 DQM output files
- TriggerResults: no differences found
GPU Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 9 differences found in the comparisons
- DQMHistoTests: Total files compared: 3
- DQMHistoTests: Total histograms compared: 39740
- DQMHistoTests: Total failures: 183
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 39557
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
- Checked 8 log files, 10 edm output root files, 3 DQM output files
- TriggerResults: no differences found
+ml
technical. Avoid running cuda tests if tensorflow is not compiled for Cuda.
Making sure cuda tests fail if TF does not recognize the card.
hold
* As discussed at ORP.
@antoniovilela I don't remember any more why this was put on hold.
Do you think we can proceed with it ? It seems to be blocking #45143.
REMINDER @antoniovilela, @rappoccio, @sextonkennedy: This PR was tested with cms-sw/cmssw#45143, please check if they should be merged together
enable gpu
please test
+1
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b349d4/40076/summary.html
COMMIT: fcc3e4fbb27693c684d5da32abb136b0be663edf
CMSSW: CMSSW_14_0_X_2024-06-25-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44375/40076/install.sh
to create a dev area with all the needed externals and cmssw changes.
Comparison Summary
Summary:
- You potentially removed 11 lines from the logs
- Reco comparison results: 152 differences found in the comparisons
- DQMHistoTests: Total files compared: 48
- DQMHistoTests: Total histograms compared: 3342528
- DQMHistoTests: Total failures: 2727
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 3339781
- DQMHistoTests: Total skipped: 20
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
- Checked 202 log files, 165 edm output root files, 48 DQM output files
- TriggerResults: no differences found
GPU Comparison Summary
Summary:
- No significant changes to the logs found
- Reco comparison results: 48 differences found in the comparisons
- DQMHistoTests: Total files compared: 3
- DQMHistoTests: Total histograms compared: 39744
- DQMHistoTests: Total failures: 1821
- DQMHistoTests: Total nulls: 0
- DQMHistoTests: Total successes: 37923
- DQMHistoTests: Total skipped: 0
- DQMHistoTests: Total Missing objects: 0
- DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
- Checked 8 log files, 10 edm output root files, 3 DQM output files
- TriggerResults: no differences found
Do you think we can proceed with it ? It seems to be blocking #45143.
Just to mention that #45143 was merged a few days ago, but this backport is still on hold.
unhold
This pull request is fully signed and it will be integrated in one of the next CMSSW_14_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_14_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)
+1