bottlerocket
bottlerocket copied to clipboard
Nvidia container-runtime API for GPU allocation
Co-authored-by: Monirul Islam Revives: https://github.com/bottlerocket-os/bottlerocket/pull/3994
Description of changes:
This PR will expose two new APIs that will allow customer to configure value of accept-nvidia-visible-devices-as-volume-mounts
and accept-nvidia-visible-devices-envvar-when-unprivileged
for nvidia container runtime.
We introduce the default behavior to inject Nvidia GPUs using volume-mounts(https://github.com/bottlerocket-os/bottlerocket/pull/3718). This PR is to allow the users to opt-in to the previous behavior that allows unprivileged pods to have access to all GPUs when NVIDIA_VISIBLE_DEVICES=all
is enabled and make both behavior configurable.
Bottlerocket Settings | Impact | Value | What it means? |
---|---|---|---|
settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts |
allows to change the accept-nvidia-visible-devices-as-volume-mounts value for k8s container-toolkit |
true | false default: true |
Adjusting the visible-devices-as-volume-mounts settings will alters the method of GPU detection and integration within container environments. Setting this parameter to true enables the NVIDIA runtime to recognize GPU devices listed in the NVIDIA_VISIBLE_DEVICES environment variable and mount them as volumes, which permits applications within the container to interact with and leverage the GPUs as if they were local resources. |
settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged |
allows to set value of accept-nvidia-visible-devices-envvar-when-unprivileged settings of nvidia container runtime for k8s varient |
true | false default: false |
When this setting is set to false , it prevents unprivileged containers from accessing all GPU devices on the host by default. If NVIDIA_VISIBLE_DEVICES is set to all within the container images and visible-devices-envvar-when-unprivileged is set to true, all GPUs on the host will be accessible to the containers, regardless of the limits set via nvidia.com/gpu. This could lead to situations where more GPUs are allocated to a pod than intended, which can affect resource scheduling and isolation. |
Testing done:
- [x] Functional Test
- Built an AMI for nvidia variant. Verify the settings gets picked up with default value.
$ apiclient get settings.kubernetes.nvidia.container-runtime
{
"settings": {
"kubernetes": {
"nvidia": {
"container-runtime": {
"visible-devices-as-volume-mounts": true,
"visible-devices-envvar-when-unprivileged": false
}
}
}
}
}
- Opt-in the previous behavior to allow unprivileged nvidia device access.
$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-as-volume-mounts=false
$ apiclient set settings.kubernetes.nvidia.container-runtime.visible-devices-envvar-when-unprivileged=true
$ apiclient get settings.kubernetes.nvidia.container-runtime
{
"settings": {
"kubernetes": {
"nvidia": {
"container-runtime": {
"visible-devices-as-volume-mounts": false,
"visible-devices-envvar-when-unprivileged": true
}
}
}
}
}
- Verify the
nvidia-container-runtime
config exists
$ cat /etc/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false
[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"
- [x] Migration Test Tested migration from 1.20.1 to new version. Tested migration back to 1.20.1.
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.