gpushare-scheduler-extender icon indicating copy to clipboard operation
gpushare-scheduler-extender copied to clipboard

some problem about auto allocate GPU card?

Open guobingithub opened this issue 4 years ago • 6 comments

hello, my gpu server which has 4 gpu cards(every one has 7611MiB), now three containers run on the card gpu0, they total used 7601MiB. then i run a new container, as expect this new container will run on gpu1 or gpu2 or gpu3. but it does not run on gpu1/gpu2/gpu3 at all!!! Actualy it run failed!(CrashLoopBackOff)!

root@server:~# root@server:~# kubectl get po NAME READY STATUS RESTARTS AGE binpack-1-5cb847f945-7dp5g 1/1 Running 0 3h33m binpack-2-7fb6b969f-s2fmh 1/1 Running 0 64m binpack-3-84d8979f89-d6929 1/1 Running 0 59m binpack-4-669844dd5f-q9wvm 0/1 **CrashLoopBackOff** 15 56m ngx-dep1-69c964c4b5-9d7cp 1/1 Running 0 102m root@server:~# root@server:~#

my gpu server info: `root@server:~# nvidia-smi Wed May 20 18:18:17 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:18:00.0 Off | 0 | | N/A 65C P0 25W / 75W | 7601MiB / 7611MiB | 2% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P4 Off | 00000000:3B:00.0 Off | 0 | | N/A 35C P8 6W / 75W | 0MiB / 7611MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P4 Off | 00000000:5E:00.0 Off | 0 | | N/A 32C P8 6W / 75W | 0MiB / 7611MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P4 Off | 00000000:86:00.0 Off | 0 | | N/A 38C P8 7W / 75W | 0MiB / 7611MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 24689 C python 7227MiB | | 0 45236 C python 151MiB | | 0 47646 C python 213MiB | +-----------------------------------------------------------------------------+ root@server:~# root@server:~#`

and my binpack-4.yaml info is below: `root@server:/home/guobin/gpu-repo# cat binpack-4.yaml apiVersion: apps/v1 kind: Deployment

metadata: name: binpack-4 labels: app: binpack-4

spec: replicas: 1

selector: # define how the deployment finds the pods it manages matchLabels: app: binpack-4

template: # define the pods specifications metadata: labels: app: binpack-4

spec:
  containers:
  - name: binpack-4
    image: cheyang/gpu-player:v2
    resources:
      limits:
        # MiB
        aliyun.com/gpu-mem: 200`

as you can see, the aliyun.com/gpu-mem is 200MiB.

ok! these are all important info. Why this plugin can not auto allocate GPU card? or is there something i need to modify?

Thanks for your help!

guobingithub avatar May 20 '20 10:05 guobingithub

@cheyang can you give me a help ? thanks very much.

guobingithub avatar May 21 '20 09:05 guobingithub

I think 200MiB is not enough to run the tensorflow applicaiton.

cheyang avatar May 28 '20 11:05 cheyang

@cheyang OK, thank you! As you say, i set 7200MiB, but it did not work! binpack-4 still can not run up.

The problem is, i have 4 gpu cards, each card is 7611MiB, and i run binpack-1/binpack-2/binpack-3/binpack-4, these 4 containers all run on card gpu0! and binpack-4 run failed......

why these 4 containers can not run on other gpu card automatically??

guobingithub avatar May 29 '20 02:05 guobingithub

Did you install kubectl-inspect-gpushare? You can check it with the cli.

cheyang avatar May 29 '20 16:05 cheyang

@guobingithub Did you solve the problem? I meet a same one and need help.

lizongnan avatar Aug 12 '20 07:08 lizongnan

Hello @cheyang I have installed kubectl-inspect-gpushare. The following are information of 'kubectl inspect gpushare' and 'nvidia-smi'. As you can see, all pods need 18960MB GPU memory, which is significantly larger than the memory size of one GPU. Even so, these pods are not deployed to other GPUs( 1,2,3 GPUs in master and 0,1,2,3 GPUs in node6). So, what's the reason? Looking forward to your help! [root@master k8s] kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) PENDING(Allocated) GPU Memory(MiB) master 192.168.4.15 0/11178 0/11178 0/11178 0/11178 18960 18960/44712 node6 192.168.4.16 0/11178 0/11178 0/11178 0/11178 0/44712

Allocated/Total GPU Memory In Cluster: 18960/89424 (21%)

[root@master k8s] nvidia-smi Wed Aug 12 05:26:03 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 23% 32C P8 8W / 250W | 11114MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 23% 32C P8 9W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A | | 23% 36C P8 9W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A | | 23% 35C P8 10W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+

lizongnan avatar Aug 12 '20 09:08 lizongnan