kind icon indicating copy to clipboard operation
kind copied to clipboard

Add API for CDI --devices flag in Docker and Podman for mapping GPUs

Open lukeogg opened this issue 2 years ago • 5 comments

This PR adds support for passing GPU parameters to Nvidia Container Toolkit through the CDI specification. Although there is a way to map all GPUs to a single node with device mounts, more granularity is desired. This PR will support mapping various device combinations to different nodes as needed.

Would resolve Issue 3164

  • Adds API for devices with a list of strings in the format specified in the CDI specification.
  devices:
  - "nvidia.com/gpu=0"
  - "nvidia.com/gpu=1"
  • Uses the CDI validation package to validate values passed in the devices API.
  • Supports both docker and podman
  • This relies on the upstream PR for Docker being merged and released - presumably in Docker v25.
  • Checks the docker version if devices have been specified.
  • Add documentation for mapping GPUs.

All GPUs mapped to a single control-plane:

{{< codeFromInline lang="yaml" >}}
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  devices:
  - "nvidia.com/gpu=all"

Specific GPUs mapped to specific worker nodes based on index:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
  devices:
  - "nvidia.com/gpu=0"
- role: worker
  devices:
  - "nvidia.com/gpu=1"

lukeogg avatar Jun 28 '23 19:06 lukeogg

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lukeogg Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jun 28 '23 19:06 k8s-ci-robot

Hi @lukeogg. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 28 '23 19:06 k8s-ci-robot

@lukeogg: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kind-verify d2a13ba3ee624ba73cca4de9e78d2bd122154778 link true /test pull-kind-verify
pull-kind-conformance-parallel-dual-stack-ipv4-ipv6 d2a13ba3ee624ba73cca4de9e78d2bd122154778 link true /test pull-kind-conformance-parallel-dual-stack-ipv4-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Jul 17 '23 20:07 k8s-ci-robot

I will get back to this in the next week or so. Been out for a bit.

lukeogg avatar Aug 30 '23 16:08 lukeogg

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 02 '24 06:04 k8s-ci-robot