arena
arena copied to clipboard
all GPUs in host are visible to job submit with '--gpus=0'
Env
arena version: v0.8.6+a2bec8c
k8s server version: {Major:"1", Minor:"20+", GitVersion:"v1.20.4-aliyun.1", GitCommit:"7a23884", GitTreeState:"", BuildDate:"2021-05-31T13:47:24Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Problem
- submit a tfjob using a docker image set NVIDIA_VISIBLE_DEVICES=all and flag '--gpus=0':
arena submit tf --gpus=0 --name=test --namespace="train-ai" --image="nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu20.04" sleep 99999
- attach to the job and run nvidia-smi, found all GPUs is visible to the job
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:08.0 Off | 0 |
| N/A 31C P0 39W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 |
| N/A 50C P0 261W / 300W | 17706MiB / 32510MiB | 95% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 |
| N/A 46C P0 157W / 300W | 31426MiB / 32510MiB | 84% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:0B.0 Off | 0 |
| N/A 37C P0 251W / 300W | 30684MiB / 32510MiB | 45% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
- expect no gpu is visible to the job since '--gpus=0' is set in cli
/assign @happy2048
@cheyang: GitHub didn't allow me to assign the following users: happy2048.
Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
In response to this:
/assign @happy2048
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@nowenL basically, we can think of '--gpus=0' as a flag to let arena not attach gpu device into job containers explicitly. however, that might be a different semantic from the job without any gpus parameters specified, although both mean run job with cpu only.
2 quick questions to confirm:
- what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?
- if yes, then why is the cuda image used for the job?
@wsxiaozhang Thanks for the reply. To answer you questions:
what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?
As mentioned above, expect no gpu is visible to the job
. And yes, I want to run job without gpu.
if yes, then why is the cuda image used for the job?
It's a minimal example to reproduce the issue. This may happen in practice as well, for example, users reuse their GPU base image for CPU training workload. Anyway, the root cause is 'NVIDIA_VISIBLE_DEVICES=all' and cuda image is just one way to trigger it.
Another major concern is, by simply setting an env, any cluster user can gain control of GPU in the host even it's assigned to other jobs. This looks vulnerable and can become critical in some cases.
@wsxiaozhang Thanks for the reply. To answer you questions:
what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?
As mentioned above,
expect no gpu is visible to the job
. And yes, I want to run job without gpu.if yes, then why is the cuda image used for the job?
It's a minimal example to reproduce the issue. This may happen in practice as well, for example, users reuse their GPU base image for CPU training workload. Anyway, the root cause is 'NVIDIA_VISIBLE_DEVICES=all' and cuda image is just one way to trigger it.
Another major concern is, by simply setting an env, any cluster user can gain control of GPU in the host even it's assigned to other jobs. This looks vulnerable and can become critical in some cases.
@nowenL got your points now, that's fair. the coming release will fix this by overwriting NVIDIA_VISIBLE_DEVICES with value of 'void', which also disable NVIDIA_DRIVER_CAPABILITIES. It that way, as long as you specify --gpus=0 or --worker_gpus=0, arena will disable gpu mounting accordingly.