troubleshoot Feature: add GPU capabilities to `nodeResources` analyzer

Describe the rationale for the suggested feature.

It would be good to be able to support preflights that want to check for GPU scheduling capability. Off-hand, I don't know if this is visible in node metadata, but maybe could be detected from containerd configuration? This might require a new collector or modifications to the nodeResources collector to detect if a node is capable of scheduling GPUS, and provide capacity/allocation similar to CPU, Memory, Disk.

Describe the feature

Not sure exactly which fields would be required, if Allocatable makes sense, but at a minimum something like:

gpuCapacity - # of GPUs available to a node

so you can write expressions like

- nodeResources:
        checkName: Total GPU Cores in the cluster is 4 or greater
        outcomes:
          - fail:
              when: "sum(gpuCapacity) < 4"
              message: The cluster must contain at least 4 GPUs
          - pass:
              message: There are at least 4 GPUs

May 17 '23 16:05 adamancini

Thinking through this a little bit there are a few places we can try to detect for GPU support

containerd configuration
nvidia-smi output
node metadata
run a no-op pod requesting GPUs and wait for successful exit

2: This can at least tell us if a GPU is installed, but not if Kubernetes is configured 3: I don't know if the information we need is exposed in node metadata, requires research 1,4: I think are the best options since they are the closes to a functional test confirming that GPU workloads can be scheduled

May 17 '23 16:05 adamancini

Adding some thoughts from a discussion in Slack, on the node metadata angle, we may be able to determine from containerRuntimeVersion at least when the nvidia-container-runtime for containerd is being used. Not sure if that'll be robust enough though. Imagine it could work for most cases.

from my local env:

    nodeInfo:
      architecture: amd64
      bootID: 81e20091-22da-4866-bfe4-a980057a1adf
      containerRuntimeVersion: containerd://1.5.9-k3s1
      kernelVersion: 5.15.49-linuxkit
      .....

May 17 '23 17:05 diamonwiggins

Just chiming in on the number of GPU's question. I think this is going to be implementation specific. I don't know if we can measure it. I know for the Intel Gpu Plugin it can be configured to allow sharing gpu's or not. So the question isn't just how many gpu's are present but are they all fully scheduled.

I think we're going to have to be specific on the gpu drives and providers to make any real attempt at this. Creating a pod seems like the most universal method but it's going to require the user to define that pod. Again using the Intel Gpu Driver there is no containerd configuration to review and the tracking of the resources is via a resource line that calls requires the gpu driver be listed explicitly.

Here's an example:

  resources:
    limits:
      gpu.intel.com/i915: 1

Example of a node with intel gpu plugin. This node has both coral-tpu and intel-gpu's available. It's not configured to allow gpu sharing, so I'm not sure if the allocatable number would change if that were enabled. You'll notice containerd has no special configs. The coral-tpu doesn't show up as a resource it's just identified via a label from node-feature-discovery. It is a usb device, but I don't think that changes if it's an integrated device.

Name:               todoroki
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/coral-tpu=true
                    feature.node.kubernetes.io/intel-gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=todoroki
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node.k0sproject.io/role=control-plane
Annotations:        csi.volume.kubernetes.io/nodeid: {"smb.csi.k8s.io":"todoroki"}
                    nfd.node.kubernetes.io/extended-resources:
                    nfd.node.kubernetes.io/feature-labels: coral-tpu,intel-gpu
                    nfd.node.kubernetes.io/master.version: v0.13.0
                    nfd.node.kubernetes.io/worker.version: v0.13.0
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
...
Addresses:
  InternalIP: 
  Hostname:    todoroki
Capacity:
  cpu:                 8
  ephemeral-storage:   489580536Ki
  gpu.intel.com/i915:  1
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              16231060Ki
  pods:                110
Allocatable:
  cpu:                 8
  ephemeral-storage:   451197421231
  gpu.intel.com/i915:  1
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              16128660Ki
  pods:                110
System Info:
  Machine ID:                
  System UUID:                
  Boot ID:                 
  Kernel Version:             5.4.0-137-generic
  OS Image:                   Ubuntu 20.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18
  Kubelet Version:            v1.26.2+k0s
  Kube-Proxy Version:         v1.26.2+k0s
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource            Requests     Limits
  --------            --------     ------
  cpu                 1630m (20%)  2 (25%)
  memory              1072Mi (6%)  1936Mi (12%)
  ephemeral-storage   0 (0%)       0 (0%)
  hugepages-1Gi       0 (0%)       0 (0%)
  hugepages-2Mi       0 (0%)       0 (0%)
  gpu.intel.com/i915  1            1

May 17 '23 21:05 chris-sanders

If those nodes are running on cloud, we can use instance metadata to get GPU information. Like AWS, it has

elastic-gpus/associations/elastic-gpu-id

However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.

May 17 '23 21:05 DexterYan

However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.

I'm not sure what this part is referring to. This is about troubleshoot detecting the presences of gpu's not about kurl installing drivers that's out of scope for troubleshoot. How the drivers or gpu gets setup is only relevant here as it pertains to detection. As long as troubleshoot has a way to detect a gpu we don't particularly need to care how it got installed.

May 18 '23 16:05 chris-sanders

I think @chris-sanders has landed on what I think will be the best approach here after digging into this more and talking with some customers. I think we'd essentially have one or more collectors that can do similar feature discovery as the below projects and then let an analyzer analyze on the configuration collected. See:

https://github.com/kubernetes-sigs/node-feature-discovery https://github.com/NVIDIA/gpu-feature-discovery

edit. With that being said, not sure if we should start capturing this in a separate issue since I'm not sure if what i'm describing makes sense in the nodeResources analyzer 🤔

May 19 '23 11:05 diamonwiggins

https://app.shortcut.com/replicated/story/106618/in-cluster-collector-gpu-inventory

Jun 25 '24 05:06 xavpaice

troubleshoot troubleshoot copied to clipboard

Feature: add GPU capabilities to `nodeResources` analyzer

troubleshoot
troubleshoot copied to clipboard