k8s-device-plugin GPU allocation does not respect NVLink

1. Quick Debug Information

OS/Version: Ubuntu22.04
Kernel Version: 6.5.0-15.15~22.04.1
Container Runtime: containerd 1.6.28-1
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s 1.29.0

2. Issue or feature description

When a container asks for 2 GPUs via nvidia.com/gpu: '2' I expect getting a pair connected with NVLink. Instead the allocation seems completely random.

3. Information to attach (optional if deemed irrelevant)

nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	NV12	SYS	SYS	SYS	SYS	SYS	0-31	0		N/A
GPU1	SYS	 X 	SYS	NV12	SYS	SYS	SYS	SYS	0-31	0		N/A
GPU2	NV12	SYS	 X 	SYS	SYS	SYS	SYS	SYS	0-31	0		N/A
GPU3	SYS	NV12	SYS	 X 	SYS	SYS	SYS	SYS	0-31	0		N/A
GPU4	SYS	SYS	SYS	SYS	 X 	NV12	SYS	SYS	32-63	1		N/A
GPU5	SYS	SYS	SYS	SYS	NV12	 X 	SYS	SYS	32-63	1		N/A
GPU6	SYS	SYS	SYS	SYS	SYS	SYS	 X 	NV12	32-63	1		N/A
GPU7	SYS	SYS	SYS	SYS	SYS	SYS	NV12	 X 	32-63	1		N/A

/etc/containerd/config.toml

version = 2

[metrics]
  address = "0.0.0.0:9325"
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    disable_apparmor = true
    sandbox_image = "docker.ops.iszn.cz/upstream/k8s/pause:3.9"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

Device plugin log

I0220 07:22:07.549235       1 main.go:154] Starting FS watcher.
I0220 07:22:07.549526       1 main.go:161] Starting OS watcher.
I0220 07:22:07.549785       1 main.go:176] Starting Plugins.
I0220 07:22:07.549808       1 main.go:234] Loading configuration.
I0220 07:22:07.549898       1 main.go:242] Updating config with default resource matching patterns.
I0220 07:22:07.550077       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0220 07:22:07.550085       1 main.go:256] Retreiving plugins.
I0220 07:22:07.551481       1 factory.go:107] Detected NVML platform: found NVML library
I0220 07:22:07.551538       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0220 07:22:09.157899       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0220 07:22:09.160846       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0220 07:22:09.164269       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Additional information

I tried the sample from go-gpuallocator with NewBestEffortAllocator and it works correctly. It allocates the right pairs (with NVLink).

So I dug deeper into k8s-nvidia-device-plugin and saw that the required parameter here came empty. Which I assume is correct since I don't want a particular GPU. I just want any pair connected with NVLink. But when required is empty then this filters out nothing and fills requiredDevices with all GPUs on that machine. All of this is then passed to go-gpuallocator and fails on this statement returning an empty set of GPUs for allocation.

So is this a bug or am I missing something?

Feb 26 '24 08:02 janvot

Looking at the Filter() call that is returning all devices when required is empty, this should only happen if required is nil:

// Filter filters out the selected devices from the list.
// If the supplied list of uuids is nil, no filtering is performed.
// Note that the specified uuids must exist in the list of devices.
func (d DeviceList) Filter(uuids []string) (DeviceList, error) {
	if uuids == nil {
		return d, nil
	}

	filtered := []*Device{}
	for _, uuid := range uuids {
		for _, device := range d {
			if device.UUID == uuid {
				filtered = append(filtered, device)
				break
			}
		}
		if len(filtered) == 0 || filtered[len(filtered)-1].UUID != uuid {
			return nil, fmt.Errorf("no device with uuid: %v", uuid)
		}
	}

	return filtered, nil
}

Why is required == nil in this case?

Looking at the type definition in k8s:

type ContainerPreferredAllocationRequest struct {
	// List of available deviceIDs from which to choose a preferred allocation
	AvailableDeviceIDs []string `protobuf:"bytes,1,rep,name=available_deviceIDs,json=availableDeviceIDs,proto3" json:"available_deviceIDs,omitempty"`
	// List of deviceIDs that must be included in the preferred allocation
	MustIncludeDeviceIDs []string `protobuf:"bytes,2,rep,name=must_include_deviceIDs,json=mustIncludeDeviceIDs,proto3" json:"must_include_deviceIDs,omitempty"`
	// Number of devices to include in the preferred allocation
	AllocationSize       int32    `protobuf:"varint,3,opt,name=allocation_size,json=allocationSize,proto3" json:"allocation_size,omitempty"`
	XXX_NoUnkeyedLiteral struct{} `json:"-"`
	XXX_sizecache        int32    `json:"-"`
}

We see that the json encoding includes omitempty. Does this affect the protobuf?

@klueska can you think of any reason why we should treat nil and []string{} differently from the perspective of the device plugin?

Feb 26 '24 14:02 elezar

Assuming we could consider required == nil equivalent to required == []string{} we could apply the following diff:

diff --git a/internal/rm/nvml_manager.go b/internal/rm/nvml_manager.go
index 56f05429..a00d41cf 100644
--- a/internal/rm/nvml_manager.go
+++ b/internal/rm/nvml_manager.go
@@ -73,6 +73,9 @@ func NewNVMLResourceManagers(nvmllib nvml.Interface, config *spec.Config) ([]Res
 // GetPreferredAllocation runs an allocation algorithm over the inputs.
 // The algorithm chosen is based both on the incoming set of available devices and various config settings.
 func (r *nvmlResourceManager) GetPreferredAllocation(available, required []string, size int) ([]string, error) {
+	if required == nil {
+		required = []string{}
+	}
 	return r.getPreferredAllocation(available, required, size)
 }

to address this issue?

Would you be able to confirm this?

Feb 26 '24 15:02 elezar

It looks like the Filter() function was added as part of: https://github.com/NVIDIA/go-gpuallocator/pull/13/files#diff-7a10395c66058f91191f5b9ac49321a6e95a8332ea4098d3438e0e19b2b02fdeR151

The old logic that had this loop was:

+// Create a list of Devices from the specific set of GPU uuids passed in.
+func NewDevicesFrom(uuids []string) ([]*Device, error) {
+   devices, err := NewDevices()
+   if err != nil {
+       return nil, err
+   }
+
+   filtered := []*Device{}
+   for _, uuid := range uuids {
+       for _, device := range devices {
+           if device.UUID == uuid {
+               filtered = append(filtered, device)
+               break
+           }
+       }
+       if len(filtered) == 0 || filtered[len(filtered)-1].UUID != uuid {
+           return nil, fmt.Errorf("no device with uuid: %v", uuid)
+       }
+   }
+
+   return filtered, nil
+}

In general, it's completely reasonable for required to be nil (in fact this is the common case) because the kubelet doesn't want to predecide which GPUs should be included in the list of those returned.

Feb 26 '24 15:02 klueska

That said, I think we could safely change:

	if uuids == nil {
		return d, nil
	}

to

	if len(uuids) == 0 {
		return d, nil
	}

without any issues.

Feb 26 '24 15:02 klueska

Thanks for digging that up. I don't quite recall why I had that nil check present there. Looking at the existing code, it obviously doesn't make sense.

I have created https://github.com/NVIDIA/go-gpuallocator/pull/23 to address this in go-gpuallocator.

Feb 27 '24 09:02 elezar