gpushare-scheduler-extender
gpushare-scheduler-extender copied to clipboard
When to support GPU Memory isolation
我看文档里有写打算支持
https://yq.aliyun.com/articles/690623
隔离,我们这里主要讨论的是调度,隔离的方案未来会基于Nvidia的MPS来实现。 打算啥时候开始支持,有计划吗?
we've been working on GPU memory isolation for some days. It a little bit complex feature and need more test to make sure it stable and safe. It's no based on MPS which would be unstable when some clients crash. Hopefully we could have our solution ready to deploy in coming April.
Hi @wsxiaozhang ,
we are also very interested in this feature for runing CUDA workloads in our JupyterHub infrastructure.
So far, I was planning to implement the approach described here: https://github.com/NVIDIA/nvidia-docker/wiki/MPS-(EXPERIMENTAL)
Do you have any code to share? We would volunteer to test it :)
Hi there, Is there any news about GPU memory isolation?
Here are some duplicate and/or related issues #51, #76
hello, one year passed, is there any news about GPU cuda/mem isolation
I have just released nvshare
, a transparent GPU sharing mechanism without memory constraints for bare-metal and Kubernetes.
- With
nvshare
you can concurrently run multiple processes/containers on the same GPU safely, each process/container having the whole physical GPU memory available. -
nvshare
doesn't impose any memory limit for containers and respects security (unlike for example MPS), as each container runs in its own CUDA context, unchanged. - The GPU memory between processes/containers/Pods is fully isolated, as they have distinct page tables, because they use different CUDA contexts.
-
nvshare
lets each process handle its CUDA context unchanged, so the level of isolation (memory access, errors) is the same offered by default by CUDA.
-
You can find it at https://github.com/grgalex/nvshare.