arena icon indicating copy to clipboard operation
arena copied to clipboard

all GPUs in host are visible to job submit with '--gpus=0'

Open nowenL opened this issue 3 years ago • 5 comments

Env

arena version: v0.8.6+a2bec8c

k8s server version: {Major:"1", Minor:"20+", GitVersion:"v1.20.4-aliyun.1", GitCommit:"7a23884", GitTreeState:"", BuildDate:"2021-05-31T13:47:24Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Problem

  1. submit a tfjob using a docker image set NVIDIA_VISIBLE_DEVICES=all and flag '--gpus=0':
arena submit tf --gpus=0 --name=test --namespace="train-ai" --image="nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu20.04" sleep 99999
  1. attach to the job and run nvidia-smi, found all GPUs is visible to the job
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0    39W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   50C    P0   261W / 300W |  17706MiB / 32510MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   46C    P0   157W / 300W |  31426MiB / 32510MiB |     84%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0B.0 Off |                    0 |
| N/A   37C    P0   251W / 300W |  30684MiB / 32510MiB |     45%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
  1. expect no gpu is visible to the job since '--gpus=0' is set in cli

nowenL avatar Sep 03 '21 06:09 nowenL

/assign @happy2048

cheyang avatar Sep 22 '21 04:09 cheyang

@cheyang: GitHub didn't allow me to assign the following users: happy2048.

Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @happy2048

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-robot avatar Sep 22 '21 04:09 google-oss-robot

@nowenL basically, we can think of '--gpus=0' as a flag to let arena not attach gpu device into job containers explicitly. however, that might be a different semantic from the job without any gpus parameters specified, although both mean run job with cpu only.

2 quick questions to confirm:

  1. what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?
  2. if yes, then why is the cuda image used for the job?

wsxiaozhang avatar Sep 23 '21 14:09 wsxiaozhang

@wsxiaozhang Thanks for the reply. To answer you questions:

what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?

As mentioned above, expect no gpu is visible to the job. And yes, I want to run job without gpu.

if yes, then why is the cuda image used for the job?

It's a minimal example to reproduce the issue. This may happen in practice as well, for example, users reuse their GPU base image for CPU training workload. Anyway, the root cause is 'NVIDIA_VISIBLE_DEVICES=all' and cuda image is just one way to trigger it.

Another major concern is, by simply setting an env, any cluster user can gain control of GPU in the host even it's assigned to other jobs. This looks vulnerable and can become critical in some cases.

nowenL avatar Sep 24 '21 08:09 nowenL

@wsxiaozhang Thanks for the reply. To answer you questions:

what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?

As mentioned above, expect no gpu is visible to the job. And yes, I want to run job without gpu.

if yes, then why is the cuda image used for the job?

It's a minimal example to reproduce the issue. This may happen in practice as well, for example, users reuse their GPU base image for CPU training workload. Anyway, the root cause is 'NVIDIA_VISIBLE_DEVICES=all' and cuda image is just one way to trigger it.

Another major concern is, by simply setting an env, any cluster user can gain control of GPU in the host even it's assigned to other jobs. This looks vulnerable and can become critical in some cases.

@nowenL got your points now, that's fair. the coming release will fix this by overwriting NVIDIA_VISIBLE_DEVICES with value of 'void', which also disable NVIDIA_DRIVER_CAPABILITIES. It that way, as long as you specify --gpus=0 or --worker_gpus=0, arena will disable gpu mounting accordingly.

wsxiaozhang avatar Sep 27 '21 11:09 wsxiaozhang