bottlerocket
bottlerocket copied to clipboard
Add Nvidia GPU Time-slicing support
Issue number:
Closes #
Description of changes: NVIDIA GPUs supports Time Slicing feature which allows user to share a GPU among a larger number of workload by dividing the GPU’s time into slices. Each workload gets a turn to use the GPU resources within its allocated time slice. This is similar to how a CPU might time-slice between different processes, ensuring that the GPU is used efficiently and not sitting idle. This PR contains the changes required for bottlerocket to enable Timeslicing for kubernetes.
This PR introduces two bottlerocket settings API:
Bottlerocket Settings | Impact | Value | What it means? |
---|---|---|---|
settings.kubernetes.nvidia.device-plugin.max-sharing-per-gpu |
sets the value of the replicas settings of the device plugin for the timesliced resources |
integer default: 0 |
When the value is greater than 0 . the timeslicing will be enabled. |
settings.kubernetes.nvidia.device-plugin.rename-shared-gpu |
sets the value of the renameByDefault settings of the device plugin for the timesliced resources |
true | false default: false |
When this setting is set to false , it does not change the shared gpu's resource name. if set to true , it renames the gpus and append .shared in the gpu name. for example, if the value is set to true , the gpu name of nvidia.com/gpu will be changed to nvidia.com/gpu.shared |
Testing done:
bash-5.1# apiclient set settings.kubernetes.nvidia.device-plugin.max-sharing-per-gpu=10
[root@admin]# cat .bottlerocket/rootfs/etc/nvidia-k8s-device-plugin/settings.yaml
version: v1
flags:
migStrategy: "none"
failOnInitError: true
plugin:
passDeviceSpecs: true
deviceListStrategy: "volume-mounts"
deviceIDStrategy: "index"
sharing:
timeSlicing:
renameByDefault: true
resources:
- name: "nvidia.com/gpu"
replicas: 10
$ kubectl describe node ip-192-168-68-216.us-west-2.compute.internal
Name: ip-192-168-68-216.us-west-2.compute.internal
...
Capacity:
cpu: 8
ephemeral-storage: 18366Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32458088Ki
nvidia.com/gpu.shared: 10
pods: 58
Note: Migration test is still in progress. I will update once the test is complete.
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.