gpu-manager
gpu-manager copied to clipboard
0/1 nodes are available: 1 Insufficient tencent.com/vcuda-core, 1 Insufficient tencent.com/vcuda-memory.
Hello,
Pod is pending and fails to be scheduled properly. I deployed gpu-manager and gpu-admission controller. Do i need any nvidia deployments or only cuda / graphics driver?
Docker: 20.10.6 Kubelet:1.21.0
- Update: formatting
- Update2. node describe
Deployment.yaml: ` apiVersion: apps/v1 kind: Deployment
metadata: name: mnist-test labels: app: mnist-test spec: replicas: 1
selector: # define how the deployment finds the pods it manages matchLabels: app: mnist-test
template: # define the pods specifications metadata: labels: app: mnist-test
spec:
containers:
- name: mnist-test
image: localhost:5000/usecase/mnist:latest
resources:
requests:
tencent.com/vcuda-core: 100
tencent.com/vcuda-memory: 100
limits:
tencent.com/vcuda-core: 100
tencent.com/vcuda-memory: 100
`
gpu-manager logs:
copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ml.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libcuda.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opencl.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-compiler.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-encode.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvcuvid.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-fbc.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-ifr.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1.7.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libOpenGL.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libGLESv2.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/share/code/swiftshader/libEGL.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1.1.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0.0.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLdispatch.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLX_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.0 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libEGL_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.2 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-eglcore.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glcore.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-tls.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-glsi.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so to /usr/local/nvidia/lib64 copy /usr/local/host/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.465.19.01 to /usr/local/nvidia/lib64 copy /usr/local/host/bin/nvidia-cuda-mps-control to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-cuda-mps-server to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-debugdump to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-persistenced to /usr/local/nvidia/bin/ copy /usr/local/host/bin/nvidia-smi to /usr/local/nvidia/bin/ rebuild ldcache launch gpu manager E0508 13:51:05.283586 9794 server.go:132] Unable to set Type=notify in systemd service file?
Node description:
$ kubectl describe no tke-ubuntu-pc
Name: tke-ubuntu-pc
Roles: control-plane,master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=tke-ubuntu-pc
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node.kubernetes.io/exclude-from-external-load-balancers=
nvidia-device-enable=enable
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 10.28.11.59/24
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.4.128
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sat, 08 May 2021 12:35:29 +0000
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: tke-ubuntu-pc
AcquireTime: <unset>
RenewTime: Sat, 08 May 2021 14:15:13 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sat, 08 May 2021 13:50:41 +0000 Sat, 08 May 2021 13:50:41 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Sat, 08 May 2021 14:15:13 +0000 Sat, 08 May 2021 12:35:27 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 08 May 2021 14:15:13 +0000 Sat, 08 May 2021 12:35:27 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 08 May 2021 14:15:13 +0000 Sat, 08 May 2021 12:35:27 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 08 May 2021 14:15:13 +0000 Sat, 08 May 2021 12:37:28 +0000 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.28.11.59
Hostname: tke-ubuntu-pc
Capacity:
cpu: 16
ephemeral-storage: 118882128Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32851496Ki
pods: 110
tencent.com/vcuda-core: 0
tencent.com/vcuda-memory: 0
Allocatable:
cpu: 16
ephemeral-storage: 109561768984
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32749096Ki
pods: 110
tencent.com/vcuda-core: 0
tencent.com/vcuda-memory: 0
System Info:
Machine ID: b829848e929e405b849ec3a862ad7542
System UUID: eaaa97d6-94eb-b002-1c34-244bfe00f638
Boot ID: 51559bd2-92c8-4536-8fb3-2d4bdbbb10f1
Kernel Version: 5.8.0-50-generic
OS Image: Ubuntu 20.10
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.6
Kubelet Version: v1.21.0
Kube-Proxy Version: v1.21.0
PodCIDR: 192.168.0.0/24
PodCIDRs: 192.168.0.0/24
Non-terminated Pods: (12 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-kube-controllers-b656ddcfc-vbrzw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
kube-system calico-node-jwp2q 250m (1%) 0 (0%) 0 (0%) 0 (0%) 98m
kube-system coredns-558bd4d5db-fmrs4 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 99m
kube-system coredns-558bd4d5db-ksvzf 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 99m
kube-system etcd-tke-ubuntu-pc 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 99m
kube-system gpu-manager-daemonset-fgb67 0 (0%) 0 (0%) 0 (0%) 0 (0%) 48m
kube-system kube-apiserver-tke-ubuntu-pc 250m (1%) 0 (0%) 0 (0%) 0 (0%) 99m
kube-system kube-controller-manager-tke-ubuntu-pc 200m (1%) 0 (0%) 0 (0%) 0 (0%) 99m
kube-system kube-proxy-4tgqq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 99m
kube-system kube-scheduler-tke-ubuntu-pc 100m (0%) 0 (0%) 0 (0%) 0 (0%) 29m
kubernetes-dashboard dashboard-metrics-scraper-5594697f48-8cspl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 87m
kubernetes-dashboard kubernetes-dashboard-57c9bfc8c8-xqjl8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 87m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1100m (6%) 0 (0%)
memory 240Mi (0%) 340Mi (1%)
ephemeral-storage 100Mi (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
tencent.com/vcuda-core 0 0
tencent.com/vcuda-memory 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 26m kubelet Starting kubelet.
Normal NodeHasSufficientPID 25m (x7 over 26m) kubelet Node tke-ubuntu-pc status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 25m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 25m (x8 over 26m) kubelet Node tke-ubuntu-pc status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 25m (x8 over 26m) kubelet Node tke-ubuntu-pc status is now: NodeHasNoDiskPressure
Normal Starting 24m kube-proxy Starting kube-proxy.
```
After upgrading the scheduler to tkestack/gpu-manager:v1.1.4
, pods are allocated but crashing with:
Error: device plugin PreStartContainer rpc failed with err: rpc error: code = Unknown desc = PreStartContainer check failed, failed to read from checkpoint file due to json: cannot unmarshal object into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type []string
It looks like scheduler gpu-manager
failed, In my situation restart will be helpful.
After upgrading the scheduler to
tkestack/gpu-manager:v1.1.4
, pods are allocated but crashing with:Error: device plugin PreStartContainer rpc failed with err: rpc error: code = Unknown desc = PreStartContainer check failed, failed to read from checkpoint file due to json: cannot unmarshal object into Go struct field PodDevicesEntry.Data.PodDeviceEntries.DeviceIDs of type []string
Same issue.Have u solved it?