cmssw Improve handling of TF CUDA tests for 14_1

PR description:

This PR improves the handling of CUDA unit tests for the TensorFlow package, using a new tf_cuda_support tool from scram , which checks if the GPU support is enabled in TensorFlow compilation.

The PR also makes the TF cuda tests more strict by checking explicitely if a CUDA device is visible to TF and not only to cmssw.

The test testTFVisibleDevicesCUDA is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA tests were passing silently using the CPU to run the test. After this PR all the TF sessions using tf::backend::cuda , but not finding a GPU will fail explicitly.

This PR is needed to continue the integration of CUDA 12.4 in https://github.com/cms-sw/cmsdist/pull/9046.
This PR needs https://github.com/cms-sw/cmsdist/pull/9066 (thanks to @smuzaffar https://github.com/cms-sw/cmssw/pull/44376#issuecomment-1992693414)

Mar 12 '24 11:03 valsdav

cms-bot internal usage

Mar 12 '24 11:03 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39431

This PR adds an extra 12KB to repository

Mar 12 '24 11:03 cmsbuild

A new Pull Request was created by @valsdav for master.

It involves the following packages:

PhysicsTools/TensorFlow (ml)

@cmsbuild, @valsdav, @wpmccormack can you please review it and eventually sign? Thanks. @makortel, @riga this is something you requested to watch as well. @rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

Mar 12 '24 11:03 cmsbuild

@valsdav , any idea why only testTFVisibleDevicesCUDA fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?

Mar 12 '24 11:03 smuzaffar

@valsdav , I was thinking to add tf_cuda_support toolfile so that we can have something like

<iftool name="tf_cuda_support">
  all tests which requires TF to be build with GPU support
</iftool>

Mar 12 '24 11:03 smuzaffar

@valsdav , any idea why only testTFVisibleDevicesCUDA fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?

Hi @smuzaffar I realized this morning testing locally that only that test is failing because it is the only one explicitly checking the list of devices visible to TF.

The other tests that are run when a CUDA device is visible to the cmssw framework are silently fall-backing to CPU running in TF. I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

Mar 12 '24 12:03 valsdav

all tests which requires TF to be build with GPU support

That's a good idea! Thanks!

Mar 12 '24 12:03 valsdav

The other testTF*CUDA tests are instead silently using the CPU to run the test.

I think this part should be fixed: if we want the tests to run with CUDA, they should not fall back to CPU.

Mar 12 '24 13:03 fwyzard

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

Mar 12 '24 13:03 fwyzard

hold

As discussed at ORP.

Mar 12 '24 16:03 antoniovilela

Pull request has been put on hold by @antoniovilela They need to issue an unhold command to remove the hold state or L1 can unhold it for all

Mar 12 '24 16:03 cmsbuild

I will add tf_cuda_support tool and will redo this PR

Mar 12 '24 16:03 smuzaffar

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

I have a solution for this, @smuzaffar do you want me to push it here so that it can be included in the new PR? Thanks

Mar 12 '24 17:03 valsdav

@valsdav , I have opened https://github.com/cms-sw/cmsdist/pull/9066 which adds the suggested new tool tf_cuda_support . Feel free to update this PR with

Changes needed for In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs
Move TF GPU unit tests in <iftool name="tf_cuda_support"> </iftool> block

Mar 12 '24 22:03 smuzaffar

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443

This PR adds an extra 16KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-format: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443/code-format.patch e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443/code-format.patch | patch -p1 You can also run scram build code-format to apply code format directly

Mar 13 '24 01:03 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39446

This PR adds an extra 16KB to repository

Mar 13 '24 07:03 cmsbuild

Pull request #44376 was updated. @wpmccormack, @cmsbuild, @valsdav can you please check and sign again.

Mar 13 '24 07:03 cmsbuild

please test with https://github.com/cms-sw/cmsdist/pull/9066

Mar 13 '24 08:03 valsdav

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/summary.html COMMIT: b9b41e4c8166dca668644f4aef63429fb9c4aebf CMSSW: CMSSW_14_1_X_2024-03-12-2300/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38092/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

@cms-sw cms-sw/cmsdist#9066

You can see more details here: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/git-recent-commits.json https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/git-merge-result

Comparison Summary

Summary:

You potentially removed 89 lines from the logs
Reco comparison results: 58 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3297383
DQMHistoTests: Total failures: 9
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3297354
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

Mar 13 '24 12:03 cmsbuild

enable gpu

Mar 13 '24 15:03 smuzaffar

please test

Mar 13 '24 15:03 smuzaffar

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38108/summary.html COMMIT: b9b41e4c8166dca668644f4aef63429fb9c4aebf CMSSW: CMSSW_14_1_X_2024-03-13-1100/el8_amd64_gcc12 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38108/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 1 lines to the logs
Reco comparison results: 50 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3297383
DQMHistoTests: Total failures: 6
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3297357
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

You potentially removed 8 lines from the logs
Reco comparison results: 35 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 1188
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 38552
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

Mar 13 '24 18:03 cmsbuild

@cms-sw/ml-l2 , can you please review this? @cms-sw/orp-l2 , this now has all the required changes. you can unhold it now

Mar 18 '24 11:03 smuzaffar

+1

Mar 18 '24 12:03 valsdav

@valsdav , can you please backport it for 14.0.X ?

Mar 18 '24 16:03 smuzaffar

unhold

Mar 18 '24 18:03 antoniovilela

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

Mar 18 '24 18:03 cmsbuild

+1

Mar 18 '24 18:03 antoniovilela

cmssw
cmssw copied to clipboard

Improve handling of TF CUDA tests for 14_1_X

PR description:

Comparison Summary

Comparison Summary

GPU Comparison Summary

cmssw cmssw copied to clipboard

Improve handling of TF CUDA tests for 14_1_X

PR description:

Comparison Summary

Comparison Summary

GPU Comparison Summary

cmssw
cmssw copied to clipboard