gpu-manager icon indicating copy to clipboard operation
gpu-manager copied to clipboard

gpu-manager fails to run on microk8s

Open timozerrer opened this issue 3 years ago • 0 comments

Hi all,

I'm unable to run gpu-manager successfully on a single-node microk8s deployment:

E0422 21:57:32.898921 336578 server.go:131] Unable to set Type=notify in systemd service file? E0422 21:57:37.899785 336578 server.go:152] can't create container runtime manager: context deadline exceeded

Am I missing something here?

Kind regards

Full log:

copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1.7.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM.so.1.2.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-tls.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.56 to /usr/local/nvidia/lib64 copy /usr/local/host/bin/nvidia-cuda-mps-control to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-cuda-mps-server to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-debugdump to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-persistenced to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-smi to /usr/local/nvidia/bin/ rebuild ldcache launch gpu manager E0422 21:57:32.898921 336578 server.go:131] Unable to set Type=notify in systemd service file? E0422 21:57:37.899785 336578 server.go:152] can't create container runtime manager: context deadline exceeded

gpu-manager.yaml

apiVersion: apps/v1 kind: DaemonSet metadata: name: gpu-manager-daemonset namespace: kube-system spec: updateStrategy: type: RollingUpdate selector: matchLabels: name: gpu-manager-ds template: metadata: # This annotation is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: name: gpu-manager-ds spec: serviceAccount: gpu-manager tolerations: # This toleration is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ - key: CriticalAddonsOnly operator: Exists - key: tencent.com/vcuda-core operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" # only run node has gpu device nodeSelector: nvidia-device-enable: enable hostPID: true containers: - image: localhost:32000/tkestack/gpu-manager:1.1.4 imagePullPolicy: Always name: gpu-manager securityContext: privileged: true ports: - containerPort: 5678 volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins - name: vdriver mountPath: /etc/gpu-manager/vdriver - name: vmdata mountPath: /etc/gpu-manager/vm - name: log mountPath: /var/log/gpu-manager - name: checkpoint mountPath: /etc/gpu-manager/checkpoint - name: run-dir mountPath: /var/run - name: cgroup mountPath: /sys/fs/cgroup readOnly: true - name: usr-directory mountPath: /usr/local/host readOnly: true env: - name: LOG_LEVEL value: "4" - name: EXTRA_FLAGS value: "--logtostderr=false" - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName volumes: - name: device-plugin hostPath: type: Directory path: /var/lib/kubelet/device-plugins - name: vmdata hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/vm - name: vdriver hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/vdriver - name: log hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/log - name: checkpoint hostPath: type: DirectoryOrCreate path: /etc/gpu-manager/checkpoint # We have to mount the whole /var/run directory into container, because of bind mount docker.sock # inode change after host docker is restarted - name: run-dir hostPath: type: Directory path: /var/run - name: cgroup hostPath: type: Directory path: /sys/fs/cgroup # We have to mount /usr directory instead of specified library path, because of non-existing # problem for different distro - name: usr-directory hostPath: type: Directory path: /usr

timozerrer avatar Apr 22 '21 22:04 timozerrer