cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Improve handling of TF CUDA tests for 14_1_X

Open valsdav opened this issue 11 months ago • 22 comments

PR description:

This PR improves the handling of CUDA unit tests for the TensorFlow package, using a new tf_cuda_support tool from scram , which checks if the GPU support is enabled in TensorFlow compilation.

The PR also makes the TF cuda tests more strict by checking explicitely if a CUDA device is visible to TF and not only to cmssw.

The test testTFVisibleDevicesCUDA is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA tests were passing silently using the CPU to run the test. After this PR all the TF sessions using tf::backend::cuda , but not finding a GPU will fail explicitly.

  • This PR is needed to continue the integration of CUDA 12.4 in https://github.com/cms-sw/cmsdist/pull/9046.
  • This PR needs https://github.com/cms-sw/cmsdist/pull/9066 (thanks to @smuzaffar https://github.com/cms-sw/cmssw/pull/44376#issuecomment-1992693414)

valsdav avatar Mar 12 '24 11:03 valsdav

cms-bot internal usage

cmsbuild avatar Mar 12 '24 11:03 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39431

  • This PR adds an extra 12KB to repository

cmsbuild avatar Mar 12 '24 11:03 cmsbuild

A new Pull Request was created by @valsdav for master.

It involves the following packages:

  • PhysicsTools/TensorFlow (ml)

@cmsbuild, @valsdav, @wpmccormack can you please review it and eventually sign? Thanks. @makortel, @riga this is something you requested to watch as well. @rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

cmsbuild avatar Mar 12 '24 11:03 cmsbuild

@valsdav , any idea why only testTFVisibleDevicesCUDA fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?

smuzaffar avatar Mar 12 '24 11:03 smuzaffar

@valsdav , I was thinking to add tf_cuda_support toolfile so that we can have something like

<iftool name="tf_cuda_support">
  all tests which requires TF to be build with GPU support
</iftool>

smuzaffar avatar Mar 12 '24 11:03 smuzaffar

@valsdav , any idea why only testTFVisibleDevicesCUDA fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?

Hi @smuzaffar I realized this morning testing locally that only that test is failing because it is the only one explicitly checking the list of devices visible to TF.

The other tests that are run when a CUDA device is visible to the cmssw framework are silently fall-backing to CPU running in TF. I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

valsdav avatar Mar 12 '24 12:03 valsdav

all tests which requires TF to be build with GPU support

That's a good idea! Thanks!

valsdav avatar Mar 12 '24 12:03 valsdav

The other testTF*CUDA tests are instead silently using the CPU to run the test.

I think this part should be fixed: if we want the tests to run with CUDA, they should not fall back to CPU.

fwyzard avatar Mar 12 '24 13:03 fwyzard

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

fwyzard avatar Mar 12 '24 13:03 fwyzard

hold

  • As discussed at ORP.

antoniovilela avatar Mar 12 '24 16:03 antoniovilela

Pull request has been put on hold by @antoniovilela They need to issue an unhold command to remove the hold state or L1 can unhold it for all

cmsbuild avatar Mar 12 '24 16:03 cmsbuild

I will add tf_cuda_support tool and will redo this PR

smuzaffar avatar Mar 12 '24 16:03 smuzaffar

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

I have a solution for this, @smuzaffar do you want me to push it here so that it can be included in the new PR? Thanks

valsdav avatar Mar 12 '24 17:03 valsdav

@valsdav , I have opened https://github.com/cms-sw/cmsdist/pull/9066 which adds the suggested new tool tf_cuda_support . Feel free to update this PR with

  • Changes needed for In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs
  • Move TF GPU unit tests in <iftool name="tf_cuda_support"> </iftool> block

smuzaffar avatar Mar 12 '24 22:03 smuzaffar

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443

  • This PR adds an extra 16KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

  • code-format: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443/code-format.patch e.g. curl -k https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39443/code-format.patch | patch -p1 You can also run scram build code-format to apply code format directly

cmsbuild avatar Mar 13 '24 01:03 cmsbuild

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39446

  • This PR adds an extra 16KB to repository

cmsbuild avatar Mar 13 '24 07:03 cmsbuild

Pull request #44376 was updated. @wpmccormack, @cmsbuild, @valsdav can you please check and sign again.

cmsbuild avatar Mar 13 '24 07:03 cmsbuild

please test with https://github.com/cms-sw/cmsdist/pull/9066

valsdav avatar Mar 13 '24 08:03 valsdav

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/summary.html COMMIT: b9b41e4c8166dca668644f4aef63429fb9c4aebf CMSSW: CMSSW_14_1_X_2024-03-12-2300/el8_amd64_gcc12 User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38092/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

  • @cms-sw cms-sw/cmsdist#9066

You can see more details here: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/git-recent-commits.json https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38092/git-merge-result

Comparison Summary

Summary:

cmsbuild avatar Mar 13 '24 12:03 cmsbuild

enable gpu

smuzaffar avatar Mar 13 '24 15:03 smuzaffar

please test

smuzaffar avatar Mar 13 '24 15:03 smuzaffar

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38108/summary.html COMMIT: b9b41e4c8166dca668644f4aef63429fb9c4aebf CMSSW: CMSSW_14_1_X_2024-03-13-1100/el8_amd64_gcc12 Additional Tests: GPU User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38108/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 50 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3297383
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3297357
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

cmsbuild avatar Mar 13 '24 18:03 cmsbuild

@cms-sw/ml-l2 , can you please review this? @cms-sw/orp-l2 , this now has all the required changes. you can unhold it now

smuzaffar avatar Mar 18 '24 11:03 smuzaffar

+1

valsdav avatar Mar 18 '24 12:03 valsdav

@valsdav , can you please backport it for 14.0.X ?

smuzaffar avatar Mar 18 '24 16:03 smuzaffar

unhold

antoniovilela avatar Mar 18 '24 18:03 antoniovilela

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild avatar Mar 18 '24 18:03 cmsbuild

+1

antoniovilela avatar Mar 18 '24 18:03 antoniovilela