pod didn't trigger scale-up (it wouldn't fit if a new node is added)
1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node? ---> No. Amazon Linux 2
- [x] Are you running Kubernetes v1.13+? ---> v1.21
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
1. Issue or feature description
I have an EKS cluster with cluster-autoscaler - (consisting of 2 non-gpu EC2 instance always runiing and 1 autoscaled g4dn gpu instance) The node is spawn when the GPU workload pod is deployed.
I deployed gpu-operator chart using :
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
Somehow, nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset are not added
kubectl get pods -n gpu-operator returns -
NAME READY STATUS RESTARTS AGE
gpu-operator-7bdd8bf555-lw5mx 1/1 Running 0 98m
gpu-operator-helm-node-feature-discovery-master-755dd7bd66mlbjr 1/1 Running 0 98m
gpu-operator-helm-node-feature-discovery-worker-8gqck 1/1 Running 0 98m
gpu-operator-helm-node-feature-discovery-worker-sl7bb 1/1 Running 0 98m
2. Steps to reproduce the issue
When I try to test the GPU workload using pod Spec
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: In
values:
- gpu
tolerations:
- key: type
operator: Equal
value: gpu
effect: NoSchedule
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
I get the error : pod didn't trigger scale-up (it wouldn't fit if a new node is added). If I remove resources.limits. nvidia.com/gpu: 1 from the pod spec then the GPU node is scaled up however test pod still fails with Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)
3. Information to attach (optional if deemed irrelevant)
My cluster autoscaler config is :
Command:
./cluster-autoscaler
--cloud-provider=aws
--namespace=kube-system
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/dev-mlops
--logtostderr=true
--scale-down-non-empty-candidates-count=1
--scale-down-utilization-threshold=0.8
--skip-nodes-with-local-storage=false
--skip-nodes-with-system-pods=false
--stderrthreshold=info
--v=4
@mohittalele you need to pass same tolerations to gpu-operator to tolerate those taints.
values.yaml:
daemonsets:
tolerations:
- key: type
operator: Equal
value: gpu
effect: NoSchedule
@shivamerla I redeployed the chart. The node was successfully scaled up. So thats a progress.
However still there are no driver and container toolkit pods. Also the test pod fails with the same original error
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-7bdd8bf555-xw4fs 1/1 Running 0 6m18s
gpu-operator-helm-node-feature-discovery-master-755dd7bd66lc285 1/1 Running 0 6m18s
gpu-operator-helm-node-feature-discovery-worker-blsh4 1/1 Running 0 6m18s
gpu-operator-helm-node-feature-discovery-worker-x4n47 1/1 Running 0 6m18s
@mohittalele Can you check if Daemonsets are created under gpu-operator namespace? Daemonsets are deployed with nodeSelector to deploy on nodes with GPUs(done by NFD, feature.node.kubernetes.io/pci-10de.present: "true")
@shivamerla I checked the Daemonsets. Its not present there.
kubectl get daemonset -n gpu-operator returns -
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-operator-helm-node-feature-discovery-worker 2 2 2 2 2 <none> 13m
The GPU node do not have the this label feature.node.kubernetes.io/pci-10de.present: "true".
Is this label assigned automatically to node or it has be configured explicitly in AWS ?
When I manually apply this label to GPU node, GPU operator tries to create daemonsets. However fails to do so with the logs -
1.6662113914293282e+09 INFO controllers.ClusterPolicy Couldn't get kernelVersion, did you run the node feature discovery? {"Request.Namespace": "default", "Request.Name": "Node", "Error": "Label.Node \"feature.node.kubernetes.io/kernel-version.full\" not found"}
1.6662113914293776e+09 INFO controllers.ClusterPolicy Failed to apply transformation 'nvidia-driver-daemonset' with error: 'ERROR: Could not find kernel full version: ('', '')' {"Daemonset": "nvidia-driver-daemonset"}
1.6662113914293838e+09 INFO controllers.ClusterPolicy Could not pre-process {"DaemonSet": "nvidia-driver-daemonset", "Namespace": "gpu-operator", "Error": "ERROR: Could not find kernel full version: ('', '')"}
1.6662113914294312e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "ERROR: Could not find kernel full version: ('', '')"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
I think this is because I am using the Amazon Linux 2 ( which I read is not supported by gpu operator) and not ubuntu. I will try again with Ubuntu as node image.
Looks like NFD is somehow not able to label that GPU node. Can you attach logs of NFD master/worker pods to debug further. The above error indicates, we depend on other NFD labels as well which are missing in this case.
Here are the nfd-master logs.
And nfd-daemonset did not have the toleration I specified in values.yaml. So, there was no nfd-worker pod running on the gpu node.
Helm template also shows absence of the below toleration in the nfd-worker daemonset. Could you check it once on your end ? Thanks!
- key: type
operator: Equal
value: gpu
effect: NoSchedule
ah ok, you can pass tolerations to NFD too in values.yaml. Here are the current tolerations:
node-feature-discovery:
worker:
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Equal"
value: ""
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
Ahh I see. That should solve the issue. I would test it next week and report back. Thanks for the prompt responses!
@shivamerla it works as expected upon adding toleration for nfd worker. We can close this issue. :)