gpu-manager
gpu-manager copied to clipboard
empty pids goroutine 1 [running]
What's the version of your deployed gpu-manager?We have fixed this in our latest commit 808ff8c29a361f04499ff62242cd56e4f93089f6
I use v1.0.4 . which version can I use for fixed this bug?
What's the version of your deployed gpu-manager?We have fixed this in our latest commit
808ff8c29a361f04499ff62242cd56e4f93089f6
Upgrade to v1.1.2
Upgrade to v1.1.2
i use this version, Problem still exists
Upgrade to v1.1.2
i use this version, Problem still exists
Is there any log show Read from
Docker Server Version: 19.03.8
cgroupfs: /sys/fs/cgroup/memory/kubepods/burstable/pod3ac4a444-6254-4b32-bc26-bd08c9c72fbb/2b8ed585766f39bca9120b9725e7d47d607218993ab8209d7086c5064e81986d
systemd: /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0dffdd18_155c_4f16_a5cf_3e615a07c264.slice/docker-3a9b4e354a7e35b9c7a25dcb222c19dfed5fb9e00d97c7d11bd21f9ee753f865.scope
need attempts := []string{
filepath.Join(cgroupRoot, cgroupThis, id, "tasks"),
// With more recent lxc versions use, cgroup will be in lxc/
filepath.Join(cgroupRoot, cgroupThis, "lxc", id, "tasks"),
// With more recent docker, cgroup will be in docker/
filepath.Join(cgroupRoot, cgroupThis, "docker", id, "tasks"),
// Even more recent docker versions under systemd use docker-
If your cgroup is systemd,you need add flag to gpu-manager
If your cgroup is systemd,you need add flag to gpu-manager
tks, it works
but i have another question..
in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request
but in gpu-manager, i see all gpu-mem in pod, it works correctly?
pod.yaml
resources:
limits:
tencent.com/vcuda-core: "10"
tencent.com/vcuda-memory: "10"
memory: "40G"
cpu: "12"
requests:
tencent.com/vcuda-core: "10"
tencent.com/vcuda-memory: "10"
memory: "40G"
cpu: "12"
➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |
If your cgroup is systemd,you need add flag to gpu-manager
tks, it works
but i have another question..
in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request
but in gpu-manager, i see all gpu-mem in pod, it works correctly?
pod.yaml
resources: limits: tencent.com/vcuda-core: "10" tencent.com/vcuda-memory: "10" memory: "40G" cpu: "12" requests: tencent.com/vcuda-core: "10" tencent.com/vcuda-memory: "10" memory: "40G" cpu: "12"
➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |
Solution of Ali modified the kernel that means you have to use their kernel not the official
If your cgroup is systemd,you need add flag to gpu-manager
If your cgroup is systemd,you need add flag to gpu-manager
how to flag?
you can how to do in readme,add some parameter in gpu-manager.yaml😊
发自我的iPhone
在 2021年4月30日,上午11:24,zxt620 @.***> 写道:
If your cgroup is systemd,you need add flag to gpu-manager
If your cgroup is systemd,you need add flag to gpu-manager
how to flag?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Upgrade to v1.1.2
i use this version, Problem still exists I use the v1.1.2,and have the same problem,have you ever solved?