gpu-manager empty pids goroutine 1 [running]

Jan 20 '21 03:01 xuguangzhao

What's the version of your deployed gpu-manager？We have fixed this in our latest commit 808ff8c29a361f04499ff62242cd56e4f93089f6

Jan 20 '21 11:01 mYmNeo

I use v1.0.4 . which version can I use for fixed this bug?

What's the version of your deployed gpu-manager？We have fixed this in our latest commit 808ff8c29a361f04499ff62242cd56e4f93089f6

Jan 21 '21 03:01 xuguangzhao

Upgrade to v1.1.2

Jan 21 '21 08:01 mYmNeo

Upgrade to v1.1.2

i use this version, Problem still exists

Jan 21 '21 09:01 phoenixwu0229

Upgrade to v1.1.2

i use this version, Problem still exists

Is there any log show Read from

Jan 21 '21 09:01 mYmNeo

Docker Server Version: 19.03.8

cgroupfs: /sys/fs/cgroup/memory/kubepods/burstable/pod3ac4a444-6254-4b32-bc26-bd08c9c72fbb/2b8ed585766f39bca9120b9725e7d47d607218993ab8209d7086c5064e81986d

systemd: /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0dffdd18_155c_4f16_a5cf_3e615a07c264.slice/docker-3a9b4e354a7e35b9c7a25dcb222c19dfed5fb9e00d97c7d11bd21f9ee753f865.scope

need attempts := []string{ filepath.Join(cgroupRoot, cgroupThis, id, "tasks"), // With more recent lxc versions use, cgroup will be in lxc/ filepath.Join(cgroupRoot, cgroupThis, "lxc", id, "tasks"), // With more recent docker, cgroup will be in docker/ filepath.Join(cgroupRoot, cgroupThis, "docker", id, "tasks"), // Even more recent docker versions under systemd use docker-.scope/ filepath.Join(cgroupRoot, "system.slice", "docker-"+id+".scope", "tasks"), // Even more recent docker versions under cgroup/systemd/docker// filepath.Join(cgroupRoot, "..", "systemd", "docker", id, "tasks"), // Kubernetes with docker and CNI is even more different filepath.Join(cgroupRoot, "..", "systemd", "kubepods", "", "pod", id, "tasks"), // Another flavor of containers location in recent kubernetes 1.11+ filepath.Join(cgroupRoot, cgroupThis, "kubepods.slice", "kubepods-besteffort.slice", "", "docker-"+id+".scope", "tasks"), // When runs inside of a container with recent kubernetes 1.11+ filepath.Join(cgroupRoot, "kubepods.slice", "kubepods-besteffort.slice", "", "docker-"+id+".scope", "tasks"), }

Jan 21 '21 09:01 mqyang56

If your cgroup is systemd，you need add flag to gpu-manager

Jan 21 '21 10:01 mYmNeo

If your cgroup is systemd，you need add flag to gpu-manager

tks, it works

but i have another question..

in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request

but in gpu-manager, i see all gpu-mem in pod, it works correctly?

pod.yaml

      resources:
        limits:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"
        requests:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"

➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |

Jan 21 '21 11:01 phoenixwu0229

If your cgroup is systemd，you need add flag to gpu-manager

tks, it works

but i have another question..

in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request

but in gpu-manager, i see all gpu-mem in pod, it works correctly?

pod.yaml
      resources:
        limits:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"
        requests:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"
➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |

Solution of Ali modified the kernel that means you have to use their kernel not the official

Jan 22 '21 01:01 mYmNeo

If your cgroup is systemd，you need add flag to gpu-manager

how to flag?

Apr 30 '21 03:04 zxt620

you can how to do in readme，add some parameter in gpu-manager.yaml😊

发自我的iPhone

在 2021年4月30日，上午11:24，zxt620 @.***> 写道：

If your cgroup is systemd，you need add flag to gpu-manager

If your cgroup is systemd，you need add flag to gpu-manager

how to flag?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Apr 30 '21 18:04 ZeoSophia

Upgrade to v1.1.2

i use this version, Problem still exists I use the v1.1.2,and have the same problem,have you ever solved?

Nov 01 '21 01:11 yu7508

gpu-manager gpu-manager copied to clipboard

empty pids goroutine 1 [running]

gpu-manager
gpu-manager copied to clipboard