gpu-operator
gpu-operator copied to clipboard
DamonSet creation fails on charmed-kubernetes
1. Quick Debug Checklist
- [] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [x] Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - [x] Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
Following the documented install procedure for gpu-operator on a fresh charmed-kubernetes install, I get the following error on the gpu-operator running on the node:
Couldn't create DaemonSet: ... Forbidden: disallowed by cluster policy
This results in no gpu resources becoming available to the cluster.
2. Steps to reproduce the issue
- Install charmed-kubernetes via juju on bare metal servers (w/ MaaS)
- Configure containerd to not use ubuntu system drivers (anticipating use of gpu-operator)
- Follow directions for proxy install in gpu-operator documentation
- Check status w
kubectl -n gpu-operator logs <gpu-operator>
to view logs confirming incomplete operation
3. Information to attach (optional if deemed irrelevant)
- [x] kubernetes pods status:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
abn nginx-37a47647-899f6ff4c-v6mg6 1/1 Running 1 (3h15m ago) 6h7m
default cuda-vectoradd 0/1 Pending 0 3h43m
gpu-operator gpu-operator-1656460182-node-feature-discovery-master-7d6cpjfwd 1/1 Running 0 11m
gpu-operator gpu-operator-1656460182-node-feature-discovery-worker-k2hd7 1/1 Running 0 11m
gpu-operator gpu-operator-77787587cf-57mgn 1/1 Running 0 11m
ingress-nginx-kubernetes-worker default-http-backend-kubernetes-worker-6cd58d8886-h5xjl 1/1 Running 2 (2m38s ago) 6h7m
ingress-nginx-kubernetes-worker nginx-ingress-controller-kubernetes-worker-kpfjm 1/1 Running 1 (3h15m ago) 5h12m
kube-system coredns-5564855696-79vr9 1/1 Running 1 (3h15m ago) 6h7m
kube-system kube-state-metrics-5ccbcf64d5-2tqr7 1/1 Running 1 (3h15m ago) 6h7m
kube-system metrics-server-v0.5.1-79b4746b65-sbbbl 2/2 Running 2 (3h15m ago) 6h7m
kube-system tiller-deploy-74bcf4c66c-2vnlc 1/1 Running 0 141m
kubernetes-dashboard dashboard-metrics-scraper-5cd54464bf-zf8b9 1/1 Running 1 (3h15m ago) 6h7m
kubernetes-dashboard kubernetes-dashboard-55796c99c-vnhlm 1/1 Running 1 (3h15m ago) 6h7m
- [x] kubernetes daemonset status:
kubectl get ds --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-operator gpu-operator-1656460182-node-feature-discovery-worker 1 1 1 1 1 <none> 14m
ingress-nginx-kubernetes-worker nginx-ingress-controller-kubernetes-worker 1 1 1 1 1 juju-application=kubernetes-worker 11d
- [x] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
Pod cannot get a gpu resource. This works if I use system drivers.
Name: cuda-vectoradd
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: kubernetes.io/psp: privileged
Status: Pending
IP:
IPs: <none>
Containers:
cuda-vectoradd:
Image: nvidia/samples:vectoradd-cuda11.2.1
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ggcpf (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-ggcpf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 117s (x216 over 3h47m) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
- [x] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
1.6564608582434602e+09 INFO controllers.ClusterPolicy GPU workload configuration {"NodeName": "number1", "GpuWorkloadConfig": "container"}
1.6564608582435489e+09 INFO controllers.ClusterPolicy Checking GPU state labels on the node {"NodeName": "number1"}
1.6564608582435687e+09 INFO controllers.ClusterPolicy Number of nodes with GPU label {"NodeCount": 1}
1.6564608582436178e+09 INFO controllers.ClusterPolicy Using container runtime: containerd
1.6564608582436502e+09 INFO controllers.ClusterPolicy Found Resource, updating... {"RuntimeClass": "nvidia"}
1.6564608582491097e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step completed {"state:": "pre-requisites", "status": "ready"}
1.6564608582492526e+09 INFO controllers.ClusterPolicy Found Resource, updating... {"Service": "gpu-operator", "Namespace": "gpu-operator"}
1.6564608582618704e+09 INFO controllers.ClusterPolicy INFO: ClusterPolicy step completed {"state:": "state-operator-metrics", "status": "ready"}
1.6564608582673767e+09 INFO controllers.ClusterPolicy Found Resource, skipping update {"ServiceAccount": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608582728472e+09 INFO controllers.ClusterPolicy Found Resource, updating... {"Role": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608582828317e+09 INFO controllers.ClusterPolicy Found Resource, updating... {"ClusterRole": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608582917275e+09 INFO controllers.ClusterPolicy Found Resource, updating... {"RoleBinding": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608583003638e+09 INFO controllers.ClusterPolicy Found Resource, updating... {"ClusterRoleBinding": "nvidia-driver", "Namespace": "gpu-operator"}
1.656460858304446e+09 INFO controllers.ClusterPolicy 5.4.0-121-generic {"Request.Namespace": "default", "Request.Name": "Node"}
1.656460858304628e+09 INFO controllers.ClusterPolicy DaemonSet not found, creating {"DaemonSet": "nvidia-driver-daemonset", "Namespace": "gpu-operator", "Name": "nvidia-driver-daemonset"}
1.656460858309278e+09 INFO controllers.ClusterPolicy Couldn't create DaemonSet {"DaemonSet": "nvidia-driver-daemonset", "Namespace": "gpu-operator", "Name": "nvidia-driver-daemonset", "Error": "DaemonSet.apps \"nvidia-driver-daemonset\" is invalid: [spec.template.spec.containers[0].securityContext.privileged: Forbidden: disallowed by cluster policy, spec.template.spec.initContainers[0].securityContext.privileged: Forbidden: disallowed by cluster policy]"}
1.6564608583093338e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "DaemonSet.apps \"nvidia-driver-daemonset\" is invalid: [spec.template.spec.containers[0].securityContext.privileged: Forbidden: disallowed by cluster policy, spec.template.spec.initContainers[0].securityContext.privileged: Forbidden: disallowed by cluster policy]"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
-
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
-
[ ] Docker configuration file:
cat /etc/docker/daemon.json
-
[ ] Docker runtime configuration:
docker info | grep runtime
-
[x] NVIDIA shared directory:
ls -la /run/nvidia
Does not exist
- [x] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
Does not exist
- [x] NVIDIA driver directory:
ls -la /run/nvidia/driver
Does not exist
- [x] kubelet logs
journalctl -u kubelet > kubelet.logs
-- Logs begin at Tue 2022-06-28 18:42:57 UTC, end at Wed 2022-06-29 00:08:00 UTC. --
-- No entries --
@gschwim Looks like PodSecurityPolicy admission controllers are enabled. You can install with --set psp.enabled=true
so that we create and use appropriate PSP's with required permissions.
Hi @shivamerla - Thanks for the reply. I did try --set psp.enabled=true
on several of the testing iterations but this didn't appear to make any difference. Is there something that needs to be done in addition to this to take advantage of it?
@gschwim Can you run kubectl get psp
and confirm PSP policies are created by GPU Operator. nvidia-driver serviceAccount is bound to the gpu-operator-privileged
PSP which should allow this. Can you copy the error again with PSP enabled.