gpu-manager icon indicating copy to clipboard operation
gpu-manager copied to clipboard

empty pids goroutine 1 [running]

Open xuguangzhao opened this issue 4 years ago • 12 comments

image

xuguangzhao avatar Jan 20 '21 03:01 xuguangzhao

What's the version of your deployed gpu-manager?We have fixed this in our latest commit 808ff8c29a361f04499ff62242cd56e4f93089f6

mYmNeo avatar Jan 20 '21 11:01 mYmNeo

I use v1.0.4 . which version can I use for fixed this bug?

What's the version of your deployed gpu-manager?We have fixed this in our latest commit 808ff8c29a361f04499ff62242cd56e4f93089f6

xuguangzhao avatar Jan 21 '21 03:01 xuguangzhao

Upgrade to v1.1.2

mYmNeo avatar Jan 21 '21 08:01 mYmNeo

Upgrade to v1.1.2

i use this version, Problem still exists

phoenixwu0229 avatar Jan 21 '21 09:01 phoenixwu0229

Upgrade to v1.1.2

i use this version, Problem still exists

Is there any log show Read from

mYmNeo avatar Jan 21 '21 09:01 mYmNeo

Docker Server Version: 19.03.8

cgroupfs: /sys/fs/cgroup/memory/kubepods/burstable/pod3ac4a444-6254-4b32-bc26-bd08c9c72fbb/2b8ed585766f39bca9120b9725e7d47d607218993ab8209d7086c5064e81986d

systemd: /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0dffdd18_155c_4f16_a5cf_3e615a07c264.slice/docker-3a9b4e354a7e35b9c7a25dcb222c19dfed5fb9e00d97c7d11bd21f9ee753f865.scope

need attempts := []string{ filepath.Join(cgroupRoot, cgroupThis, id, "tasks"), // With more recent lxc versions use, cgroup will be in lxc/ filepath.Join(cgroupRoot, cgroupThis, "lxc", id, "tasks"), // With more recent docker, cgroup will be in docker/ filepath.Join(cgroupRoot, cgroupThis, "docker", id, "tasks"), // Even more recent docker versions under systemd use docker-.scope/ filepath.Join(cgroupRoot, "system.slice", "docker-"+id+".scope", "tasks"), // Even more recent docker versions under cgroup/systemd/docker// filepath.Join(cgroupRoot, "..", "systemd", "docker", id, "tasks"), // Kubernetes with docker and CNI is even more different filepath.Join(cgroupRoot, "..", "systemd", "kubepods", "", "pod", id, "tasks"), // Another flavor of containers location in recent kubernetes 1.11+ filepath.Join(cgroupRoot, cgroupThis, "kubepods.slice", "kubepods-besteffort.slice", "", "docker-"+id+".scope", "tasks"), // When runs inside of a container with recent kubernetes 1.11+ filepath.Join(cgroupRoot, "kubepods.slice", "kubepods-besteffort.slice", "", "docker-"+id+".scope", "tasks"), }

mqyang56 avatar Jan 21 '21 09:01 mqyang56

If your cgroup is systemd,you need add flag to gpu-manager

mYmNeo avatar Jan 21 '21 10:01 mYmNeo

If your cgroup is systemd,you need add flag to gpu-manager

tks, it works

but i have another question..

in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request

but in gpu-manager, i see all gpu-mem in pod, it works correctly?

pod.yaml

      resources:
        limits:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"
        requests:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"

➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |

phoenixwu0229 avatar Jan 21 '21 11:01 phoenixwu0229

If your cgroup is systemd,you need add flag to gpu-manager

tks, it works

but i have another question..

in ali gpu-share solution, nvidia-smi results will show the gpu-mem requested in pod's resource request

but in gpu-manager, i see all gpu-mem in pod, it works correctly?

pod.yaml

      resources:
        limits:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"
        requests:
          tencent.com/vcuda-core: "10"
          tencent.com/vcuda-memory: "10"
          memory: "40G"
          cpu: "12"

➜ gpu-manager git:(master) ✗ kubectl -n hpc-dlc exec -it container-tf-wutong6-7fd85bb484-9m8c4 bash root@host10307846:/notebooks# nvidia-smi Thu Jan 21 19:27:48 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 38C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |

Solution of Ali modified the kernel that means you have to use their kernel not the official

mYmNeo avatar Jan 22 '21 01:01 mYmNeo

If your cgroup is systemd,you need add flag to gpu-manager

If your cgroup is systemd,you need add flag to gpu-manager

how to flag?

zxt620 avatar Apr 30 '21 03:04 zxt620

you can how to do in readme,add some parameter in gpu-manager.yaml😊

发自我的iPhone

在 2021年4月30日,上午11:24,zxt620 @.***> 写道:

 If your cgroup is systemd,you need add flag to gpu-manager

If your cgroup is systemd,you need add flag to gpu-manager

how to flag?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ZeoSophia avatar Apr 30 '21 18:04 ZeoSophia

Upgrade to v1.1.2

i use this version, Problem still exists I use the v1.1.2,and have the same problem,have you ever solved?

yu7508 avatar Nov 01 '21 01:11 yu7508