gpu-manager
gpu-manager copied to clipboard
Use a fraction gpu resource, fail to get response from manager
Please help solve the problem,The information is as follows,Thank you.
The restricted GPU configuration is as follows:
resources:
limits:
tencent.com/vcuda-core: "20"
tencent.com/vcuda-memory: "20"
requests:
tencent.com/vcuda-core: "20"
tencent.com/vcuda-memory: "20"
env:
- name: LOGGER_LEVEL
value: "5"
The running algorithm program reports the following error:
/tmp/cuda-control/src/loader.c:1056 config file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/vcuda.config
/tmp/cuda-control/src/loader.c:1057 pid file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
/tmp/cuda-control/src/loader.c:1061 register to remote: pod uid: tainerd.service, cont id: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice
F0607 15:56:33.572429 158 client.go:78] fail to get response from manager, error rpc error: code = Unknown desc = can't find kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice from docker
/tmp/cuda-control/src/register.c:87 rpc client exit with 255
gpu-manager.INFO log contents are as follows:
I0607 15:56:33.571262 626706 manager.go:369] UID: tainerd.service, cont: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice want to registration
I0607 15:56:33.571439 626706 manager.go:455] Write /etc/gpu-manager/vm/tainerd.service/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
I0607 15:56:33.573392 626706 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
gpu-manager.WARNING log contents are as follows:
W0607 15:56:44.887813 626706 manager.go:290] Find orphaned pod tainerd.service
gpu-manager.ERROR and gpu-manager.FATAL are no error log.
my gpu-manager.yaml is follwing:
apiVersion: v1
kind: ServiceAccount
metadata:
name: gpu-manager
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-manager-role
subjects:
- kind: ServiceAccount
name: gpu-manager
namespace: kube-system
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-manager-daemonset
namespace: kube-system
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
name: gpu-manager-ds
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: gpu-manager-ds
spec:
serviceAccount: gpu-manager
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: tencent.com/vcuda-core
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
# only run node has gpu device
nodeSelector:
nvidia-device-enable: enable
hostPID: true
containers:
- image: thomassong/gpu-manager:1.1.5
imagePullPolicy: IfNotPresent
name: gpu-manager
securityContext:
privileged: true
ports:
- containerPort: 5678
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: vdriver
mountPath: /etc/gpu-manager/vdriver
- name: vmdata
mountPath: /etc/gpu-manager/vm
- name: log
mountPath: /var/log/gpu-manager
- name: checkpoint
mountPath: /etc/gpu-manager/checkpoint
- name: run-dir
mountPath: /var/run
- name: cgroup
mountPath: /sys/fs/cgroup
readOnly: true
- name: usr-directory
mountPath: /usr/local/host
readOnly: true
- name: kube-root
mountPath: /root/.kube
readOnly: true
env:
- name: LOG_LEVEL
value: "5"
- name: EXTRA_FLAGS
value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: device-plugin
hostPath:
type: Directory
path: /var/lib/kubelet/device-plugins
- name: vmdata
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vm
- name: vdriver
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/vdriver
- name: log
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/log
- name: checkpoint
hostPath:
type: DirectoryOrCreate
path: /etc/gpu-manager/checkpoint
# We have to mount the whole /var/run directory into container, because of bind mount docker.sock
# inode change after host docker is restarted
- name: run-dir
hostPath:
type: Directory
path: /var/run
- name: cgroup
hostPath:
type: Directory
path: /sys/fs/cgroup
# We have to mount /usr directory instead of specified library path, because of non-existing
# problem for different distro
- name: usr-directory
hostPath:
type: Directory
path: /usr
- name: kube-root
hostPath:
type: Directory
path: /root/.kube
same with me. did u solve it?
Hi,I lowered the version of kubernetes to v1.20 and it works fine,did u solve it?
we have the same issue at the kubernetes version of v1.18.6
the same error, I think the reason is "--container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd" , use containerd as container-runtime cause this problem, i will try to solve this.
same with me. did u solve it?
I change the k8s cgroup from systemd to cgroup,it works well. Do not use -cgroup-driver=systemd The congfig like
env: - name: LOG_LEVEL value: "5" - name: EXTRA_FLAGS value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock" - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName