nvidia-docker MPS Support

Hi,

When I use "CUDA Multi-Process Service" aka MPS in nvidia-docker environment, I met a couple of issues. So I'm wonder if MPS is supported in nvidia-docker? Please help me, thanks in advance~

Here is problems I have met:

When I run nvidia-cuda-mps-control -d to start mps daemon in Nvidia-docker, I can't see this process from nvidia-smi, however, I can see this process from host machine. In comparison, when I run the same command, nvidia-cuda-mps-control -d, in Host machine (physical server), I got see this from nvidia-smi. (need run a gpu program first to start MPS server)
I tried to run caffe training with MPS as a example, 2 training process at the same time in Nvidia-docker env. It showed: F0703 13:39:15.539633 97 common.cpp:165] Check failed: error == cudaSuccess (46 vs. 0) all CUDA-capable devices are busy or unavailable In comparison, this works ok in host (physical machine).

I'm trying this on P100 GPU, Ubuntu14,

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

Docker version 17.04.0-ce, build 4845c56

I hope this is the right place to ask, thanks again.

Jul 06 '17 04:07 waynesuzq

Short answer, it is not supported for now. However, we are looking at it for the 2.0 timeframe but there are a lot of corner cases that need to be investigated.

I'll update this issue with additional information once we are confident it could work properly.

Jul 06 '17 04:07 3XX0

Hi, Is the 2.0 supports the Cuda9 for Volta MPS now? @3XX0 ,thanks.

Jan 20 '18 07:01 xpp1021

This MPS Support seems like it would be a blocker creating the service deployments in orchestration. I'll be following the outcome in anticipation for a pull request use-case for the swarm or Kubernetes functionality.

@

Jan 28 '18 09:01 andrewpsp

Any progress? or is there any workaround so I can use CUDA Multi-Process Service in the container?

Feb 04 '18 08:02 vivisidea

Shouldn't it be the other way around? I.E. The MPS should run on the host so it can allocate process time to multiple containers? Is that an already supported architecture?

Feb 09 '18 04:02 ksze

With 2.0 it should work as long as you run the MPS server on the host and use --ipc=host. We're working torward a better integration though, so I'll keep this issue open.

# Launch two containers on the second GPU device
sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 nvidia-cuda-mps-control -d

docker run -ti --rm -e NVIDIA_VISIBLE_DEVICES=1 --runtime=nvidia --ipc=host nvidia/cuda
docker run -ti --rm -e NVIDIA_VISIBLE_DEVICES=1 --runtime=nvidia --ipc=host nvidia/cuda

echo quit | sudo nvidia-cuda-mps-control

Feb 09 '18 22:02 3XX0

@3XX0,

Does it mean that we can set and limit CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for each container? Any examples of usage would really help.

Could you please elaborate what you mean by "better integration"?

Thank you

Feb 11 '18 01:02 bhagatindia

mark

Mar 01 '18 11:03 WIZARD-CXY

@3XX0 How much does "-ipc=host" compromise security? Somebody asked the question on SO but no answer yet: https://stackoverflow.com/questions/38907708/docker-ipc-host-and-security

Apr 12 '18 22:04 hehaifengcn

@3XX0 Any update on when nvidia-docker will officially support MPS?

Apr 20 '18 06:04 hehaifengcn

@3XX0 I did some tests and --ipc=host does appear to work. But is there anything else we should pay attention to run current nvidia-docker 2 under MPS? Would you recommend to use it in production? Would be super helpful if you can provide some guidance here.

Apr 27 '18 04:04 hehaifengcn

I've added a wiki page on how to use MPS with Docker Compose: https://github.com/NVIDIA/nvidia-docker/wiki/MPS-(EXPERIMENTAL)

You can look at the docker-compose.yml file for implementation details.

Oct 16 '18 01:10 flx42

Hi, @flx42 , is it possible to provide a compose file which format version is 2.1? As lots of companies still use docker 1.12 in their cluster and they cannot upgrade their docker version to 17.0.6 in short term.

Oct 29 '18 13:10 azazhu

@azazhu are you running RHEL/Atomic's fork of Docker? If you do, you can just remove the runtime: lines and it should work fine. That's the docker package on RHEL/CentOS and probably other derivatives.

If that's not what you are running, you won't be able to make it work since the runtime option requires format 2.3: https://docs.docker.com/compose/compose-file/compose-versioning/#version-23

Oct 29 '18 15:10 flx42

Thx, @flx42, Could you check me if my understanding is correct or not:

nvidia-docker can work with volta MPS even if we don't use docker compose file you provided, right?

we just need a) nvidia-docker2; b) recommend to set EXCLUSIVE_PROCESS in host machine; c) start mps daemon(nvidia-cuda-mps-control in host machine; d) set CUDA_MPS_PIPE_DIRECTORY in host machine; e) make sure container can read the path of CUDA_MPS_PIPE_DIRECTORY by using -v; f) start container with "--ipc=host". Are my a,b, ~ e,f right?

another question is: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE should be set in container instead of host machine, right?

Oct 30 '18 14:10 azazhu

Yes, that should work. But you can also containerize the MPS daemon, like in the Docker Compose example. I need to document the steps with the docker CLI too.

another question is: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE should be set in container instead of host machine, right?

IIRC you can set this value for the MPS daemon, or for all CUDA client apps. I think both work fine.

Oct 30 '18 15:10 flx42

thx @flx42 , what do you mean by "containerize the MPS daemon"? To launch MPS daemon(nvidia-cuda-mps-control) on both host machine and container? In my experiment, I only launched nvidia-cuda-mps-control on host machine(i didn't launch it in container) and looks it works fine.

Oct 31 '18 02:10 azazhu

Yes, you can launch it inside a container or on the host. Both ways will work.

Oct 31 '18 03:10 flx42

hi @flx42 ,

it will be great if you can document the steps with the docker CLI, as I failed to launch docker-compose. I met "ERROR: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network" in my work env. I tried to change the "bip" to avoid the subnet conflict, but still met the same error.
I use the method I mentioned above and it can work, but it looks different from https://github.com/NVIDIA/nvidia-docker/wiki/MPS-(EXPERIMENTAL). In docker-compose.yml, looks container has sys admin permission. So container can set gpu mode(to EXCLUSIVE_PROCESS) and launch mps demon by itself(pls correct me if my understanding is wrong). While the method I used is that gpu mode is set by host machine and mps demon is launched by host machine, and container doesn't have sys admin permission. Both methods can work, right?

Oct 31 '18 11:10 azazhu

@flx42 Does MPS support pascal GPU in nvidia-docker contrainers?

Nov 06 '18 06:11 GoodJoey

@GoodJoey not with the approach documented above, you would need a Volta GPU.

Nov 06 '18 17:11 flx42

@flx42 In this wiki MPS , does Volta mean Volta Architecture or Volta GPU in sentence 'Only Volta MPS is supported' ? What's more, does 7.0 mean Compute Capability 7.0 in sentence 'NVIDIA GPU with Architecture >= Volta (7.0)' ? Forward your repley, thanks!

'

Jan 12 '20 13:01 lxl910915

Seems like mps is not supported on the newest docker version. especially it's not --runtime=nvidia but --gpus=all now. Also the missing support for docker-compose is annoying.

This example shows well that the containers have some kind of problem with cuda....

sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 nvidia-cuda-mps-control -d #start deamon
docker run -it --rm -e NVIDIA_VISIBLE_DEVICES=1 --gpus=all --ipc=host tensorflow/tensorflow:2.1.0-gpu-py3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
echo quit | sudo nvidia-cuda-mps-control #shutdown deamon

Would really love to see "usable" support of mps with docker

Feb 21 '20 08:02 renedlog

Any update on this issue?

May 20 '21 03:05 elepherai

Seems like mps is not supported on the newest docker version. especially it's not --runtime=nvidia but --gpus=all now. Also the missing support for docker-compose is annoying.

This example shows well that the containers have some kind of problem with cuda....
sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 nvidia-cuda-mps-control -d #start deamon
docker run -it --rm -e NVIDIA_VISIBLE_DEVICES=1 --gpus=all --ipc=host tensorflow/tensorflow:2.1.0-gpu-py3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
echo quit | sudo nvidia-cuda-mps-control #shutdown deamon
Would really love to see "usable" support of mps with docker

Hi, have you solved this problem?

May 20 '21 03:05 elepherai

@3XX0,

Does it mean that we can set and limit CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for each container? Any examples of usage would really help.

Could you please elaborate what you mean by "better integration"?

Thank you

Hi, have you solved this problem? I want to set different CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for each container, such as 3*30%and1*10% for a specific GPU.

Sep 13 '22 10:09 juinshell

any update?

Aug 31 '23 16:08 domef

We are working on a DRA Driver for NVIDA GPUs (https://github.com/NVIDIA/k8s-dra-driver) which will include better MPS support.

If there are use cases not covered by this (e.g. outside of K8s), please create an issue describing the use case against https://github.com/NVIDIA/nvidia-container-toolkit.

Oct 30 '23 09:10 elezar

nvidia-docker nvidia-docker copied to clipboard

MPS Support

nvidia-docker
nvidia-docker copied to clipboard