gpushare-scheduler-extender icon indicating copy to clipboard operation
gpushare-scheduler-extender copied to clipboard

When to support GPU Memory isolation

Open zhaogaolong opened this issue 5 years ago • 6 comments

我看文档里有写打算支持

https://yq.aliyun.com/articles/690623

隔离,我们这里主要讨论的是调度,隔离的方案未来会基于Nvidia的MPS来实现。 打算啥时候开始支持,有计划吗?

zhaogaolong avatar Feb 26 '20 09:02 zhaogaolong

we've been working on GPU memory isolation for some days. It a little bit complex feature and need more test to make sure it stable and safe. It's no based on MPS which would be unstable when some clients crash. Hopefully we could have our solution ready to deploy in coming April.

wsxiaozhang avatar Mar 09 '20 02:03 wsxiaozhang

Hi @wsxiaozhang ,

we are also very interested in this feature for runing CUDA workloads in our JupyterHub infrastructure.

So far, I was planning to implement the approach described here: https://github.com/NVIDIA/nvidia-docker/wiki/MPS-(EXPERIMENTAL)

Do you have any code to share? We would volunteer to test it :)

stv0g avatar Apr 17 '20 15:04 stv0g

Hi there, Is there any news about GPU memory isolation?

Mhs-220 avatar Jun 24 '20 01:06 Mhs-220

Here are some duplicate and/or related issues #51, #76

stv0g avatar Sep 23 '20 16:09 stv0g

hello, one year passed, is there any news about GPU cuda/mem isolation

MC17 avatar Oct 28 '21 03:10 MC17

I have just released nvshare, a transparent GPU sharing mechanism without memory constraints for bare-metal and Kubernetes.

  • With nvshare you can concurrently run multiple processes/containers on the same GPU safely, each process/container having the whole physical GPU memory available.
  • nvshare doesn't impose any memory limit for containers and respects security (unlike for example MPS), as each container runs in its own CUDA context, unchanged.
  • The GPU memory between processes/containers/Pods is fully isolated, as they have distinct page tables, because they use different CUDA contexts.
    • nvshare lets each process handle its CUDA context unchanged, so the level of isolation (memory access, errors) is the same offered by default by CUDA.

You can find it at https://github.com/grgalex/nvshare.

grgalex avatar Jun 03 '23 17:06 grgalex