troubleshoot
troubleshoot copied to clipboard
Feature: add GPU capabilities to `nodeResources` analyzer
Describe the rationale for the suggested feature.
It would be good to be able to support preflights that want to check for GPU scheduling capability. Off-hand, I don't know if this is visible in node metadata, but maybe could be detected from containerd configuration? This might require a new collector or modifications to the nodeResources collector to detect if a node is capable of scheduling GPUS, and provide capacity/allocation similar to CPU, Memory, Disk.
Describe the feature
Not sure exactly which fields would be required, if Allocatable makes sense, but at a minimum something like:
gpuCapacity - # of GPUs available to a node
so you can write expressions like
- nodeResources:
checkName: Total GPU Cores in the cluster is 4 or greater
outcomes:
- fail:
when: "sum(gpuCapacity) < 4"
message: The cluster must contain at least 4 GPUs
- pass:
message: There are at least 4 GPUs
Thinking through this a little bit there are a few places we can try to detect for GPU support
- containerd configuration
nvidia-smioutput- node metadata
- run a no-op pod requesting GPUs and wait for successful exit
2: This can at least tell us if a GPU is installed, but not if Kubernetes is configured 3: I don't know if the information we need is exposed in node metadata, requires research 1,4: I think are the best options since they are the closes to a functional test confirming that GPU workloads can be scheduled
Adding some thoughts from a discussion in Slack, on the node metadata angle, we may be able to determine from containerRuntimeVersion at least when the nvidia-container-runtime for containerd is being used. Not sure if that'll be robust enough though. Imagine it could work for most cases.
from my local env:
nodeInfo:
architecture: amd64
bootID: 81e20091-22da-4866-bfe4-a980057a1adf
containerRuntimeVersion: containerd://1.5.9-k3s1
kernelVersion: 5.15.49-linuxkit
.....
Just chiming in on the number of GPU's question. I think this is going to be implementation specific. I don't know if we can measure it. I know for the Intel Gpu Plugin it can be configured to allow sharing gpu's or not. So the question isn't just how many gpu's are present but are they all fully scheduled.
I think we're going to have to be specific on the gpu drives and providers to make any real attempt at this. Creating a pod seems like the most universal method but it's going to require the user to define that pod. Again using the Intel Gpu Driver there is no containerd configuration to review and the tracking of the resources is via a resource line that calls requires the gpu driver be listed explicitly.
Here's an example:
resources:
limits:
gpu.intel.com/i915: 1
Example of a node with intel gpu plugin. This node has both coral-tpu and intel-gpu's available. It's not configured to allow gpu sharing, so I'm not sure if the allocatable number would change if that were enabled. You'll notice containerd has no special configs. The coral-tpu doesn't show up as a resource it's just identified via a label from node-feature-discovery. It is a usb device, but I don't think that changes if it's an integrated device.
Name: todoroki
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/coral-tpu=true
feature.node.kubernetes.io/intel-gpu=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=todoroki
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=true
node.k0sproject.io/role=control-plane
Annotations: csi.volume.kubernetes.io/nodeid: {"smb.csi.k8s.io":"todoroki"}
nfd.node.kubernetes.io/extended-resources:
nfd.node.kubernetes.io/feature-labels: coral-tpu,intel-gpu
nfd.node.kubernetes.io/master.version: v0.13.0
nfd.node.kubernetes.io/worker.version: v0.13.0
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
...
Addresses:
InternalIP:
Hostname: todoroki
Capacity:
cpu: 8
ephemeral-storage: 489580536Ki
gpu.intel.com/i915: 1
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16231060Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 451197421231
gpu.intel.com/i915: 1
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16128660Ki
pods: 110
System Info:
Machine ID:
System UUID:
Boot ID:
Kernel Version: 5.4.0-137-generic
OS Image: Ubuntu 20.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.18
Kubelet Version: v1.26.2+k0s
Kube-Proxy Version: v1.26.2+k0s
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1630m (20%) 2 (25%)
memory 1072Mi (6%) 1936Mi (12%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
gpu.intel.com/i915 1 1
If those nodes are running on cloud, we can use instance metadata to get GPU information. Like AWS, it has
elastic-gpus/associations/elastic-gpu-id
However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.
However, for on premise, I think we may need to introduce a kURL add-on to add different GPU device plugins. It has to be pre-defined in the kURL installer.
I'm not sure what this part is referring to. This is about troubleshoot detecting the presences of gpu's not about kurl installing drivers that's out of scope for troubleshoot. How the drivers or gpu gets setup is only relevant here as it pertains to detection. As long as troubleshoot has a way to detect a gpu we don't particularly need to care how it got installed.
I think @chris-sanders has landed on what I think will be the best approach here after digging into this more and talking with some customers. I think we'd essentially have one or more collectors that can do similar feature discovery as the below projects and then let an analyzer analyze on the configuration collected. See:
https://github.com/kubernetes-sigs/node-feature-discovery https://github.com/NVIDIA/gpu-feature-discovery
edit. With that being said, not sure if we should start capturing this in a separate issue since I'm not sure if what i'm describing makes sense in the nodeResources analyzer 🤔
https://app.shortcut.com/replicated/story/106618/in-cluster-collector-gpu-inventory