cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Disabling TensorFlow CUDA tests for 14_0_X

Open valsdav opened this issue 11 months ago • 4 comments

PR description:

This PR disables temporarily the CUDA tests for the TensorFlow package, as the GPU support is not enabled in 14_X series due to CUDA 12 incompatibilities (https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/tensorflow-requires.file#L7).

The test testTFVisibleDevicesCUDA is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA tests are instead silently using the CPU to run the test.

This PR is needed to continue the integration of CUDA 12.4 in https://github.com/cms-sw/cmsdist/pull/9046

valsdav avatar Mar 12 '24 11:03 valsdav

A new Pull Request was created by @valsdav for CMSSW_14_0_X.

It involves the following packages:

  • PhysicsTools/TensorFlow (ml)

@valsdav, @wpmccormack, @cmsbuild can you please review it and eventually sign? Thanks. @makortel, @riga this is something you requested to watch as well. @sextonkennedy, @rappoccio, @antoniovilela you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar Mar 12 '24 11:03 cmsbuild

cms-bot internal usage

cmsbuild avatar Mar 12 '24 11:03 cmsbuild

hold

  • As discussed at ORP.

antoniovilela avatar Mar 12 '24 16:03 antoniovilela

Pull request has been put on hold by @antoniovilela They need to issue an unhold command to remove the hold state or L1 can unhold it for all

cmsbuild avatar Mar 12 '24 16:03 cmsbuild

Pull request #44375 was updated. @valsdav, @cmsbuild, @wpmccormack can you please check and sign again.

cmsbuild avatar Mar 18 '24 20:03 cmsbuild

enable gpu

valsdav avatar Mar 18 '24 20:03 valsdav

please test

valsdav avatar Mar 18 '24 20:03 valsdav

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b349d4/38238/summary.html COMMIT: fcc3e4fbb27693c684d5da32abb136b0be663edf CMSSW: CMSSW_14_0_X_2024-03-18-1100/el8_amd64_gcc12 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44375/38238/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 27 lines to the logs
  • Reco comparison results: 39 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3346212
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3346190
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 205 log files, 166 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 183
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39557
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Mar 19 '24 00:03 cmsbuild

+ml

technical. Avoid running cuda tests if tensorflow is not compiled for Cuda.
Making sure cuda tests fail if TF does not recognize the card.

valsdav avatar Mar 22 '24 12:03 valsdav

hold

* As discussed at ORP.

@antoniovilela I don't remember any more why this was put on hold.

Do you think we can proceed with it ? It seems to be blocking #45143.

fwyzard avatar Jun 20 '24 08:06 fwyzard

REMINDER @antoniovilela, @rappoccio, @sextonkennedy: This PR was tested with cms-sw/cmssw#45143, please check if they should be merged together

cmsbuild avatar Jun 20 '24 16:06 cmsbuild

enable gpu

fwyzard avatar Jun 25 '24 15:06 fwyzard

please test

fwyzard avatar Jun 25 '24 15:06 fwyzard

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b349d4/40076/summary.html COMMIT: fcc3e4fbb27693c684d5da32abb136b0be663edf CMSSW: CMSSW_14_0_X_2024-06-25-1100/el8_amd64_gcc12 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44375/40076/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 48 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39744
  • DQMHistoTests: Total failures: 1821
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 37923
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

cmsbuild avatar Jun 25 '24 18:06 cmsbuild

Do you think we can proceed with it ? It seems to be blocking #45143.

Just to mention that #45143 was merged a few days ago, but this backport is still on hold.

missirol avatar Jun 27 '24 16:06 missirol

unhold

antoniovilela avatar Jul 02 '24 00:07 antoniovilela

This pull request is fully signed and it will be integrated in one of the next CMSSW_14_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_14_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild avatar Jul 02 '24 00:07 cmsbuild

+1

antoniovilela avatar Jul 02 '24 00:07 antoniovilela