GPU allocation does not respect NVLink
1. Quick Debug Information
- OS/Version: Ubuntu22.04
- Kernel Version: 6.5.0-15.15~22.04.1
- Container Runtime: containerd 1.6.28-1
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s 1.29.0
2. Issue or feature description
When a container asks for 2 GPUs via nvidia.com/gpu: '2' I expect getting a pair connected with NVLink. Instead the allocation seems completely random.
3. Information to attach (optional if deemed irrelevant)
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS NV12 SYS SYS SYS SYS SYS 0-31 0 N/A
GPU1 SYS X SYS NV12 SYS SYS SYS SYS 0-31 0 N/A
GPU2 NV12 SYS X SYS SYS SYS SYS SYS 0-31 0 N/A
GPU3 SYS NV12 SYS X SYS SYS SYS SYS 0-31 0 N/A
GPU4 SYS SYS SYS SYS X NV12 SYS SYS 32-63 1 N/A
GPU5 SYS SYS SYS SYS NV12 X SYS SYS 32-63 1 N/A
GPU6 SYS SYS SYS SYS SYS SYS X NV12 32-63 1 N/A
GPU7 SYS SYS SYS SYS SYS SYS NV12 X 32-63 1 N/A
/etc/containerd/config.toml
version = 2
[metrics]
address = "0.0.0.0:9325"
grpc_histogram = false
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
disable_apparmor = true
sandbox_image = "docker.ops.iszn.cz/upstream/k8s/pause:3.9"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
Device plugin log
I0220 07:22:07.549235 1 main.go:154] Starting FS watcher.
I0220 07:22:07.549526 1 main.go:161] Starting OS watcher.
I0220 07:22:07.549785 1 main.go:176] Starting Plugins.
I0220 07:22:07.549808 1 main.go:234] Loading configuration.
I0220 07:22:07.549898 1 main.go:242] Updating config with default resource matching patterns.
I0220 07:22:07.550077 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0220 07:22:07.550085 1 main.go:256] Retreiving plugins.
I0220 07:22:07.551481 1 factory.go:107] Detected NVML platform: found NVML library
I0220 07:22:07.551538 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0220 07:22:09.157899 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0220 07:22:09.160846 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0220 07:22:09.164269 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Additional information
I tried the sample from go-gpuallocator with NewBestEffortAllocator and it works correctly. It allocates the right pairs (with NVLink).
So I dug deeper into k8s-nvidia-device-plugin and saw that the required parameter here came empty. Which I assume is correct since I don't want a particular GPU. I just want any pair connected with NVLink. But when required is empty then this filters out nothing and fills requiredDevices with all GPUs on that machine. All of this is then passed to go-gpuallocator and fails on this statement returning an empty set of GPUs for allocation.
So is this a bug or am I missing something?
Looking at the Filter() call that is returning all devices when required is empty, this should only happen if required is nil:
// Filter filters out the selected devices from the list.
// If the supplied list of uuids is nil, no filtering is performed.
// Note that the specified uuids must exist in the list of devices.
func (d DeviceList) Filter(uuids []string) (DeviceList, error) {
if uuids == nil {
return d, nil
}
filtered := []*Device{}
for _, uuid := range uuids {
for _, device := range d {
if device.UUID == uuid {
filtered = append(filtered, device)
break
}
}
if len(filtered) == 0 || filtered[len(filtered)-1].UUID != uuid {
return nil, fmt.Errorf("no device with uuid: %v", uuid)
}
}
return filtered, nil
}
Why is required == nil in this case?
Looking at the type definition in k8s:
type ContainerPreferredAllocationRequest struct {
// List of available deviceIDs from which to choose a preferred allocation
AvailableDeviceIDs []string `protobuf:"bytes,1,rep,name=available_deviceIDs,json=availableDeviceIDs,proto3" json:"available_deviceIDs,omitempty"`
// List of deviceIDs that must be included in the preferred allocation
MustIncludeDeviceIDs []string `protobuf:"bytes,2,rep,name=must_include_deviceIDs,json=mustIncludeDeviceIDs,proto3" json:"must_include_deviceIDs,omitempty"`
// Number of devices to include in the preferred allocation
AllocationSize int32 `protobuf:"varint,3,opt,name=allocation_size,json=allocationSize,proto3" json:"allocation_size,omitempty"`
XXX_NoUnkeyedLiteral struct{} `json:"-"`
XXX_sizecache int32 `json:"-"`
}
We see that the json encoding includes omitempty. Does this affect the protobuf?
@klueska can you think of any reason why we should treat nil and []string{} differently from the perspective of the device plugin?
Assuming we could consider required == nil equivalent to required == []string{} we could apply the following diff:
diff --git a/internal/rm/nvml_manager.go b/internal/rm/nvml_manager.go
index 56f05429..a00d41cf 100644
--- a/internal/rm/nvml_manager.go
+++ b/internal/rm/nvml_manager.go
@@ -73,6 +73,9 @@ func NewNVMLResourceManagers(nvmllib nvml.Interface, config *spec.Config) ([]Res
// GetPreferredAllocation runs an allocation algorithm over the inputs.
// The algorithm chosen is based both on the incoming set of available devices and various config settings.
func (r *nvmlResourceManager) GetPreferredAllocation(available, required []string, size int) ([]string, error) {
+ if required == nil {
+ required = []string{}
+ }
return r.getPreferredAllocation(available, required, size)
}
to address this issue?
Would you be able to confirm this?
It looks like the Filter() function was added as part of:
https://github.com/NVIDIA/go-gpuallocator/pull/13/files#diff-7a10395c66058f91191f5b9ac49321a6e95a8332ea4098d3438e0e19b2b02fdeR151
The old logic that had this loop was:
+// Create a list of Devices from the specific set of GPU uuids passed in.
+func NewDevicesFrom(uuids []string) ([]*Device, error) {
+ devices, err := NewDevices()
+ if err != nil {
+ return nil, err
+ }
+
+ filtered := []*Device{}
+ for _, uuid := range uuids {
+ for _, device := range devices {
+ if device.UUID == uuid {
+ filtered = append(filtered, device)
+ break
+ }
+ }
+ if len(filtered) == 0 || filtered[len(filtered)-1].UUID != uuid {
+ return nil, fmt.Errorf("no device with uuid: %v", uuid)
+ }
+ }
+
+ return filtered, nil
+}
In general, it's completely reasonable for required to be nil (in fact this is the common case) because the kubelet doesn't want to predecide which GPUs should be included in the list of those returned.
That said, I think we could safely change:
if uuids == nil {
return d, nil
}
to
if len(uuids) == 0 {
return d, nil
}
without any issues.
Thanks for digging that up. I don't quite recall why I had that nil check present there. Looking at the existing code, it obviously doesn't make sense.
I have created https://github.com/NVIDIA/go-gpuallocator/pull/23 to address this in go-gpuallocator.