gpu-operator pod didn't trigger scale-up (it wouldn't fit if a new node is added)

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node? ---> No. Amazon Linux 2
[x] Are you running Kubernetes v1.13+? ---> v1.21
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
[x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I have an EKS cluster with cluster-autoscaler - (consisting of 2 non-gpu EC2 instance always runiing and 1 autoscaled g4dn gpu instance) The node is spawn when the GPU workload pod is deployed. I deployed gpu-operator chart using : helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator Somehow, nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset are not added

kubectl get pods -n gpu-operator returns -

NAME                                                              READY   STATUS    RESTARTS   AGE
gpu-operator-7bdd8bf555-lw5mx                                     1/1     Running   0          98m
gpu-operator-helm-node-feature-discovery-master-755dd7bd66mlbjr   1/1     Running   0          98m
gpu-operator-helm-node-feature-discovery-worker-8gqck             1/1     Running   0          98m
gpu-operator-helm-node-feature-discovery-worker-sl7bb             1/1     Running   0          98m

2. Steps to reproduce the issue

When I try to test the GPU workload using pod Spec

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: type
            operator: In
            values:
            - gpu
  tolerations:
    - key: type
      operator: Equal
      value: gpu
      effect: NoSchedule 
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
         nvidia.com/gpu: 1

I get the error : pod didn't trigger scale-up (it wouldn't fit if a new node is added). If I remove resources.limits. nvidia.com/gpu: 1 from the pod spec then the GPU node is scaled up however test pod still fails with Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)

3. Information to attach (optional if deemed irrelevant)

My cluster autoscaler config is :

Command:
       ./cluster-autoscaler                                                                                                              
       --cloud-provider=aws                                                                                                               
       --namespace=kube-system                                                                                                            
       --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/dev-mlops                     
       --logtostderr=true                                                                                                                 
       --scale-down-non-empty-candidates-count=1                                                                                          
       --scale-down-utilization-threshold=0.8                                                                                             
       --skip-nodes-with-local-storage=false                                                                                              
       --skip-nodes-with-system-pods=false                                                                                                
       --stderrthreshold=info                                                                                                             
       --v=4

Oct 19 '22 15:10 mohittalele

@mohittalele you need to pass same tolerations to gpu-operator to tolerate those taints.

values.yaml:

daemonsets:
  tolerations:
    - key: type
      operator: Equal
      value: gpu
      effect: NoSchedule

Oct 19 '22 16:10 shivamerla

@shivamerla I redeployed the chart. The node was successfully scaled up. So thats a progress.

However still there are no driver and container toolkit pods. Also the test pod fails with the same original error

kubectl get pods -n gpu-operator

NAME                                                              READY   STATUS    RESTARTS   AGE
gpu-operator-7bdd8bf555-xw4fs                                     1/1     Running   0          6m18s
gpu-operator-helm-node-feature-discovery-master-755dd7bd66lc285   1/1     Running   0          6m18s
gpu-operator-helm-node-feature-discovery-worker-blsh4             1/1     Running   0          6m18s
gpu-operator-helm-node-feature-discovery-worker-x4n47             1/1     Running   0          6m18s

Oct 19 '22 19:10 mohittalele

@mohittalele Can you check if Daemonsets are created under gpu-operator namespace? Daemonsets are deployed with nodeSelector to deploy on nodes with GPUs(done by NFD, feature.node.kubernetes.io/pci-10de.present: "true")

Oct 19 '22 20:10 shivamerla

@shivamerla I checked the Daemonsets. Its not present there. kubectl get daemonset -n gpu-operator returns -

NAME                                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
gpu-operator-helm-node-feature-discovery-worker   2         2         2       2            2           <none>          13m

The GPU node do not have the this label feature.node.kubernetes.io/pci-10de.present: "true".

Is this label assigned automatically to node or it has be configured explicitly in AWS ?

When I manually apply this label to GPU node, GPU operator tries to create daemonsets. However fails to do so with the logs -

1.6662113914293282e+09    INFO    controllers.ClusterPolicy    Couldn't get kernelVersion, did you run the node feature discovery?    {"Request.Namespace": "default", "Request.Name": "Node", "Error": "Label.Node \"feature.node.kubernetes.io/kernel-version.full\" not found"}
1.6662113914293776e+09    INFO    controllers.ClusterPolicy    Failed to apply transformation 'nvidia-driver-daemonset' with error: 'ERROR: Could not find kernel full version: ('', '')'    {"Daemonset": "nvidia-driver-daemonset"}
1.6662113914293838e+09    INFO    controllers.ClusterPolicy    Could not pre-process    {"DaemonSet": "nvidia-driver-daemonset", "Namespace": "gpu-operator", "Error": "ERROR: Could not find kernel full version: ('', '')"}
1.6662113914294312e+09    ERROR    controller.clusterpolicy-controller    Reconciler error    {"name": "cluster-policy", "namespace": "", "error": "ERROR: Could not find kernel full version: ('', '')"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

I think this is because I am using the Amazon Linux 2 ( which I read is not supported by gpu operator) and not ubuntu. I will try again with Ubuntu as node image.

Oct 20 '22 12:10 mohittalele

Looks like NFD is somehow not able to label that GPU node. Can you attach logs of NFD master/worker pods to debug further. The above error indicates, we depend on other NFD labels as well which are missing in this case.

Oct 20 '22 17:10 shivamerla

Here are the nfd-master logs.

And nfd-daemonset did not have the toleration I specified in values.yaml. So, there was no nfd-worker pod running on the gpu node.

Helm template also shows absence of the below toleration in the nfd-worker daemonset. Could you check it once on your end ? Thanks!

    - key: type
      operator: Equal
      value: gpu
      effect: NoSchedule

Oct 21 '22 13:10 mohittalele

ah ok, you can pass tolerations to NFD too in values.yaml. Here are the current tolerations:

node-feature-discovery:
  worker:
    tolerations:
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      value: ""
      effect: "NoSchedule"
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"

Oct 21 '22 13:10 shivamerla

Ahh I see. That should solve the issue. I would test it next week and report back. Thanks for the prompt responses!

Oct 21 '22 14:10 mohittalele

@shivamerla it works as expected upon adding toleration for nfd worker. We can close this issue. :)

Nov 09 '22 21:11 mohittalele