[Feature Request] Make nvidia-operator-validator add a validation successful label or taint on the node
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
My workload pattern and issue is pretty much similar to #261 except my workloads pods have GPU resource limits mentioned. On scheduling a pod, a fresh GPU instance is provisioned and gpu-operator starts working however the validations (mainly the plugin one) cannot be run because my workloads pods have GPU resource limits mentioned and would steal it away.
Things still work anyway because device plugin DaemonSet is running by that point. Any way I can make the validation steps run before my pod is scheduled?
One way I can think of is to give validation pods system-node-critical priority or they can add some taint/label to the node indicating the node was successfully validated so I can adjust affinities in my workload spec to keep it from scheduling first
2. Steps to reproduce the issue
- Add a GPU node to the cluster (I am using AWS Karpenter on EKS for this)
- Immediately schedule a pod with affinity for that node and "nvidia.com/gpu" resource limit
3. Information to attach (optional if deemed irrelevant)
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
cj-gpu-operator-node-feature-discovery-master-6b95978f7c-ldttm 1/1 Running 0 5h3m
cj-gpu-operator-node-feature-discovery-worker-wbf45 1/1 Running 0 3m31s
gpu-feature-discovery-zvq8k 1/1 Running 0 2m4s
gpu-operator-6f64c86bc-jmdqm 1/1 Running 0 5h3m
nvidia-container-toolkit-daemonset-tndwn 1/1 Running 0 2m4s
nvidia-cuda-validator-gd77h 0/1 Completed 0 98s
nvidia-dcgm-exporter-5t9kh 1/1 Running 0 2m4s
nvidia-device-plugin-daemonset-qkkcv 1/1 Running 0 2m4s
nvidia-device-plugin-validator-jqt5j 0/1 UnexpectedAdmissionError 0 87s
nvidia-operator-validator-qtbpl 0/1 Init:3/4 0 2m4s
$ kubectl describe pod nvidia-device-plugin-validator-jqt5j -n gpu-operator
Name: nvidia-device-plugin-validator-jqt5j
Namespace: gpu-operator
Priority: 0
Node: ip-10-2-46-0.eu-west-1.compute.internal/
Start Time: Tue, 17 Jan 2023 23:55:19 +0530
Labels: app=nvidia-device-plugin-validator
Annotations: kubernetes.io/psp: eks.privileged
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
$ kubectl logs nvidia-operator-validator-qtbpl -n gpu-operator -c plugin-validation
time="2023-01-17T18:25:14Z" level=info msg="GPU resources are not yet discovered by the node, retry: 1"
time="2023-01-17T18:25:19Z" level=info msg="pod nvidia-device-plugin-validator-jqt5j is curently in Pending phase"
time="2023-01-17T18:25:24Z" level=info msg="pod nvidia-device-plugin-validator-jqt5j is curently in Failed phase"
time="2023-01-17T18:28:29Z" level=info msg="pod nvidia-device-plugin-validator-jqt5j is curently in Failed phase"
Pod that gets scheduled first
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "karpenter.k8s.aws/instance-family"
operator: "In"
values:
- "g4dn"
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "In"
values:
- "true"
- key: "nvidia.com/gpu.deploy.container-toolkit"
operator: "In"
values:
- "true"
- key: "nvidia.com/gpu.deploy.device-plugin"
operator: "In"
values:
- "true"
- key: "nvidia.com/gpu.deploy.driver"
operator: "Exists"
- key: "nvidia.com/gpu.deploy.operator-validator"
operator: "In"
values:
- "true"
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
command:
- "tail"
- "-f"
- "/dev/null"
resources:
limits:
cpu: "0.5"
memory: 1Gi
nvidia.com/gpu: 1
@chiragjn Thanks for the detailed report. We are aware of this issue and something we plan to fix in the future. With v22.9.2 for driver upgrades, we are planing to handle this by cordoning the node until validation is successful after driver installation. Adding a taint until the validation is complete makes sense as well. Will look into this with upcoming releases after v22.9.2.
We also are hitting this problem. Was looking to see if the cluster policy allowed passing in a priority class to the validator, but seems the only option is for the daemon sets.
Temporarily I have settled with disabling the test with workload mode
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
and attaching an init container to my workloads to check plugin stage was reached
volumes:
- name: run-nvidia
hostPath:
path: /run/nvidia
type: ''
initContainers:
- name: validate-gpu-readiness
image: alpine:3.14
command:
- sh
- '-c'
args:
- >-
until [ -f /run/nvidia/validations/plugin-ready ]; do echo waiting
for nvidia container stack to be setup; sleep 5; done
volumeMounts:
- name: run-nvidia
mountPath: /run/nvidia
mountPropagation: HostToContainer
Even I am facing this. Is this resolved?
EKS Version 1.25 Using Karpenter to provision new nodes
k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-demo-64894cf985-ff8sj 1/1 Running 0 21m
default mypod 1/1 Running 0 162m
gpu-operator gpu-feature-discovery-dz6rd 1/1 Running 0 20m
gpu-operator gpu-feature-discovery-wcwdh 1/1 Running 0 19m
gpu-operator gpu-operator-1682360234-node-feature-discovery-master-7fd8szqqk 1/1 Running 0 28m
gpu-operator gpu-operator-1682360234-node-feature-discovery-worker-b9ldl 1/1 Running 0 20m
gpu-operator gpu-operator-1682360234-node-feature-discovery-worker-kmg92 1/1 Running 0 28m
gpu-operator gpu-operator-1682360234-node-feature-discovery-worker-nvspt 1/1 Running 0 21m
gpu-operator gpu-operator-6bf9cfc885-hfwzb 1/1 Running 0 28m
gpu-operator nvidia-container-toolkit-daemonset-s98hn 1/1 Running 0 20m
gpu-operator nvidia-container-toolkit-daemonset-tzq4k 1/1 Running 1 (18m ago) 19m
gpu-operator nvidia-cuda-validator-6mvcj 0/1 Completed 0 19m
gpu-operator nvidia-cuda-validator-fdhl4 0/1 Completed 0 18m
gpu-operator nvidia-dcgm-exporter-bsw68 1/1 Running 0 19m
gpu-operator nvidia-dcgm-exporter-r6cvw 1/1 Running 0 20m
gpu-operator nvidia-device-plugin-daemonset-5z2v2 1/1 Running 0 20m
gpu-operator nvidia-device-plugin-daemonset-pbgx5 1/1 Running 0 19m
gpu-operator nvidia-device-plugin-validator-7nxgm 0/1 UnexpectedAdmissionError 0 2m40s
gpu-operator nvidia-device-plugin-validator-p8pq6 0/1 Init:CrashLoopBackOff 5 (37s ago) 3m42s
gpu-operator nvidia-operator-validator-brw58 0/1 Init:3/4 3 (4m10s ago) 20m
gpu-operator nvidia-operator-validator-sdrsv 0/1 Init:3/4 3 (3m9s ago) 19m
karpenter karpenter-759c8b84fd-jhwx8 1/1 Running 0 162m
karpenter karpenter-759c8b84fd-qzjk9 0/1 Pending 0 162m
kube-system aws-node-9g56j 1/1 Running 0 166m
kube-system aws-node-nrrbs 1/1 Running 0 21m
kube-system aws-node-t786q 1/1 Running 0 20m
kube-system coredns-6866f5c8b4-p7jgx 1/1 Running 0 172m
kube-system coredns-6866f5c8b4-srhhj 1/1 Running 0 172m
kube-system kube-proxy-7skqk 1/1 Running 0 166m
kube-system kube-proxy-fdttc 1/1 Running 0 20m
kube-system kube-proxy-lqv5p 1/1 Running 0 21m
davdbada@c889f3ed3843 ~/Documents/ k describe pod -n gpu-operator nvidia-device-plugin-validator-7nxgm
Name: nvidia-device-plugin-validator-7nxgm
Namespace: gpu-operator
Priority: 0
Runtime Class Name: nvidia
Service Account: nvidia-operator-validator
Node: ip-10-0-187-255.eu-west-1.compute.internal/
Start Time: Tue, 25 Apr 2023 00:13:11 +0530
Labels: app=nvidia-device-plugin-validator
Annotations: <none>
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
IP:
IPs: <none>
Controlled By: ClusterPolicy/cluster-policy
Init Containers:
plugin-validation:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
vectorAdd
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m86bb (ro)
Containers:
nvidia-device-plugin-validator:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.3.1
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo device-plugin workload validation is successful
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m86bb (ro)
Volumes:
kube-api-access-m86bb:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning UnexpectedAdmissionError 2m47s kubelet Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected
@shivamerla
Is this already fixed?