WMCore
WMCore copied to clipboard
Change how CUDA runtime and capabilities are defined in the task and Condor
Fixes #11595
Status
In development
Description
The following is provided in this PR:
- the WMTask class now does an aggregation of multiple GPU requirements within the same task, such as:
-
GPUMemoryMB
: use the largest values among the steps within a task -
CUDARuntime
: changed from simple string to a list of strings with an union of the CUDA runtime versions -
CUDACapabilities
: still a list of strings, but now with an union of the CUDA capabilities versions
-
- BossAir plugin method to:
- sorted list of string versions, from left to right (see
cudaCapabilityToSingleVersion
) - select the smallest CUDA capability version required by the job (version string with dot-notation)
- convert the dot-notation version to a simple integer with a formula:
(1000 * major + 10 * medium + minor)
, where1.2.3
would be major=1, medium=2, minor=3. See [1] for further context.
- sorted list of string versions, from left to right (see
- [GlideinWMS] Refactor HTCondor
CUDACapability
classad to a single integer (actually string represented). For instance, it changes from"1.2,3.2,1.4"
to"1020"
. Matchmaking then can be done with something like:Node CUDACapability >= Job CUDACapability
, after converting the capability version to a single integer to be compatible with these changes. - [GlideinWMS] Refactor HTCondor
CUDARuntime
classad to be a comma separated list of CUDA runtimes. For instance, it changes from"1.2"
to"1.2,3.2"
. Matchmaking then can be done with something like:stringListSubsetMatch(job_list_cudaruntime, node_list_cudaruntime)
. -->> NOTE that this needs an HTCondor upgrade to >= 10.0.6 - Create new HTCondor classad
OriginalCUDACapability
with the actual comma separated list of CUDA capabilities
Is it backward compatible (if not, which system it affects?)
NO (in the sense that job matchmaking will have to be updated)
Related PRs
Complement to https://github.com/dmwm/WMCore/pull/11588 such that hybrid GPU workflows can be supported.
External dependencies / deployment changes
Submission Infrastructure needs to update HTCondor to >= 10.0.6 and GlideinWMS needs to update the job matchmaking expression for CUDARuntime and CUDACapabilities.
SI ticket: https://its.cern.ch/jira/browse/CMSSI-79 [1] https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html
Jenkins results:
- Python3 Unit tests: succeeded
- 1 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
- Python3 Pylint check: failed
- 8 warnings and errors that must be fixed
- 4 warnings
- 79 comments to review
- Pylint py3k check: failed
- 2 errors and warnings that should be fixed
- Pycodestyle check: succeeded
- 38 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14420/artifact/artifacts/PullRequestReport.html
Jenkins results:
- Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
- 2 tests added
- Python3 Pylint check: failed
- 28 warnings and errors that must be fixed
- 7 warnings
- 229 comments to review
- Pylint py3k check: failed
- 2 errors and warnings that should be fixed
- Pycodestyle check: succeeded
- 84 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14422/artifact/artifacts/PullRequestReport.html
Jenkins results:
- Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
- Python3 Pylint check: failed
- 29 warnings and errors that must be fixed
- 7 warnings
- 259 comments to review
- Pylint py3k check: failed
- 2 errors and warnings that should be fixed
- Pycodestyle check: succeeded
- 98 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14423/artifact/artifacts/PullRequestReport.html
In order to perform full testing of these changes, we need to have changes at the SI level (see initial description). Nonetheless, I'd appreciate any feedback and review.
Jenkins results:
- Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 tests added
- Python3 Pylint check: failed
- 29 warnings and errors that must be fixed
- 7 warnings
- 238 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 84 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14427/artifact/artifacts/PullRequestReport.html
Jenkins results:
- Python3 Unit tests: failed
- 1 new failures
- 1 tests added
- 2 changes in unstable tests
- Python3 Pylint check: failed
- 29 warnings and errors that must be fixed
- 7 warnings
- 238 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 84 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14428/artifact/artifacts/PullRequestReport.html
@todor-ivanov the last 2 commits provide changes based on your review. Please have another look.
@mapellidario @belforte Hi Dario, Stefano, these are the latest changes that we are trying to commission and deploy in production for GPU jobs. I just updated the initial description, but please let me know if you have any questions.
Note that this is not yet in production and we need to discuss/plan the required changes at the SI layer.
thanks Alan, @novicecpp will be back on Aug 24 and will be able to look at integrating this changes in CRAB as well. It will be nice to have some example to test, and of course what we dearly miss is users !
test this please
Jenkins results:
- Python3 Unit tests: failed
- 1 new failures
- 1 tests added
- 1 changes in unstable tests
- Python3 Pylint check: failed
- 35 warnings and errors that must be fixed
- 7 warnings
- 245 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 93 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14583/artifact/artifacts/PullRequestReport.html
Can one of the admins verify this patch?