WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Change how CUDA runtime and capabilities are defined in the task and Condor

Open amaltaro opened this issue 1 year ago • 12 comments

Fixes #11595

Status

In development

Description

The following is provided in this PR:

  • the WMTask class now does an aggregation of multiple GPU requirements within the same task, such as:
    • GPUMemoryMB: use the largest values among the steps within a task
    • CUDARuntime: changed from simple string to a list of strings with an union of the CUDA runtime versions
    • CUDACapabilities: still a list of strings, but now with an union of the CUDA capabilities versions
  • BossAir plugin method to:
    • sorted list of string versions, from left to right (see cudaCapabilityToSingleVersion)
    • select the smallest CUDA capability version required by the job (version string with dot-notation)
    • convert the dot-notation version to a simple integer with a formula: (1000 * major + 10 * medium + minor), where 1.2.3 would be major=1, medium=2, minor=3. See [1] for further context.
  • [GlideinWMS] Refactor HTCondor CUDACapability classad to a single integer (actually string represented). For instance, it changes from "1.2,3.2,1.4" to "1020". Matchmaking then can be done with something like: Node CUDACapability >= Job CUDACapability, after converting the capability version to a single integer to be compatible with these changes.
  • [GlideinWMS] Refactor HTCondor CUDARuntime classad to be a comma separated list of CUDA runtimes. For instance, it changes from "1.2" to "1.2,3.2". Matchmaking then can be done with something like: stringListSubsetMatch(job_list_cudaruntime, node_list_cudaruntime). -->> NOTE that this needs an HTCondor upgrade to >= 10.0.6
  • Create new HTCondor classad OriginalCUDACapability with the actual comma separated list of CUDA capabilities

Is it backward compatible (if not, which system it affects?)

NO (in the sense that job matchmaking will have to be updated)

Related PRs

Complement to https://github.com/dmwm/WMCore/pull/11588 such that hybrid GPU workflows can be supported.

External dependencies / deployment changes

Submission Infrastructure needs to update HTCondor to >= 10.0.6 and GlideinWMS needs to update the job matchmaking expression for CUDARuntime and CUDACapabilities.

SI ticket: https://its.cern.ch/jira/browse/CMSSI-79 [1] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html

amaltaro avatar Aug 15 '23 15:08 amaltaro

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 4 warnings
    • 79 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 38 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14420/artifact/artifacts/PullRequestReport.html

cmsdmwmbot avatar Aug 15 '23 15:08 cmsdmwmbot

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 2 tests added
  • Python3 Pylint check: failed
    • 28 warnings and errors that must be fixed
    • 7 warnings
    • 229 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14422/artifact/artifacts/PullRequestReport.html

cmsdmwmbot avatar Aug 16 '23 10:08 cmsdmwmbot

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 29 warnings and errors that must be fixed
    • 7 warnings
    • 259 comments to review
  • Pylint py3k check: failed
    • 2 errors and warnings that should be fixed
  • Pycodestyle check: succeeded
    • 98 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14423/artifact/artifacts/PullRequestReport.html

cmsdmwmbot avatar Aug 16 '23 15:08 cmsdmwmbot

In order to perform full testing of these changes, we need to have changes at the SI level (see initial description). Nonetheless, I'd appreciate any feedback and review.

amaltaro avatar Aug 16 '23 19:08 amaltaro

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 29 warnings and errors that must be fixed
    • 7 warnings
    • 238 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14427/artifact/artifacts/PullRequestReport.html

cmsdmwmbot avatar Aug 17 '23 21:08 cmsdmwmbot

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 29 warnings and errors that must be fixed
    • 7 warnings
    • 238 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 84 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14428/artifact/artifacts/PullRequestReport.html

cmsdmwmbot avatar Aug 17 '23 21:08 cmsdmwmbot

@todor-ivanov the last 2 commits provide changes based on your review. Please have another look.

amaltaro avatar Aug 17 '23 21:08 amaltaro

@mapellidario @belforte Hi Dario, Stefano, these are the latest changes that we are trying to commission and deploy in production for GPU jobs. I just updated the initial description, but please let me know if you have any questions.

Note that this is not yet in production and we need to discuss/plan the required changes at the SI layer.

amaltaro avatar Aug 21 '23 13:08 amaltaro

thanks Alan, @novicecpp will be back on Aug 24 and will be able to look at integrating this changes in CRAB as well. It will be nice to have some example to test, and of course what we dearly miss is users !

belforte avatar Aug 21 '23 14:08 belforte

test this please

amaltaro avatar Oct 26 '23 09:10 amaltaro

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 35 warnings and errors that must be fixed
    • 7 warnings
    • 245 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 93 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14583/artifact/artifacts/PullRequestReport.html

cmsdmwmbot avatar Oct 26 '23 09:10 cmsdmwmbot

Can one of the admins verify this patch?

cmsdmwmbot avatar Sep 30 '24 20:09 cmsdmwmbot