criu
criu copied to clipboard
CRIU Cuda support
Any plans for supporting C/R for cuda applications?
No. Also see #527. There is no way to extract the state of the program on the GPU. If there would be a theoretical way to extract the state from the GPU then there would be a theoretical possibility to implement something like this.
With CRIU plugins it could be possible with a lot of help from the hardware vendor, but right now I would say it is not possible.
isn't cuda-gdb supposed to be able to access all info in the GPU?
also, there's this https://www.nvidia.com/en-us/design-visualization/solutions/vgpu-migration/
Interesting. Good to hear. I was not aware. No idea what information is available but maybe this would make a GPU plugin possible for CRIU. Is this something you are working on?
I might need to start working on in in near future, yes.
@montekki I was wondering if you were ever able to get started on this? Do you have any other information that you can share that can hopefully make this possible?
This is a complex project but there is some relevant work that has been done in the past: (2018) CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory (2013) A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States (2009) CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
https://github.com/tbrand/CRCUDA
Hi, just wanted to ask if there have been any updates on this?
No updates. We are still looking for volunteers who will implement this.
In the Eurosys 20, a paper named "Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning" said they implemented this by using CRIU.
@AHEADer thank you for sharing this. (2020) Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning
Indeed, thanks for pointing out that paper.
I just had a look and they write that they do not checkpoint the GPU part only the CPU part.
Seems to be an unmodified CRIU without any GPU support.
Indeed, thanks for pointing out that paper.
I just had a look and they write that they do not checkpoint the GPU part only the CPU part.
Seems to be an unmodified CRIU without any GPU support.
So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?
Indeed, thanks for pointing out that paper.
I just had a look and they write that they do not checkpoint the GPU part only the CPU part.
Seems to be an unmodified CRIU without any GPU support.
So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?
You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.
As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.
Indeed, thanks for pointing out that paper. I just had a look and they write that they do not checkpoint the GPU part only the CPU part. Seems to be an unmodified CRIU without any GPU support.
So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?
You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.
As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.
Many thanks for your reply. Now I understand what they do now. It's better to have a try following their ways.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
Do you have any updates?
JFYI there is also some more recent research on this topic here: 2020 CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM.
BTW, I don't think this issue should be closed.
BTW, I don't think this issue should be closed.
You are right. The issue has the correct label to not be closed automatically, but it seems it didn't work as expected. Let's see if it works better now.
The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads
The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads
Are there any open-source resources on this project or do we have to implement it from scratch ourselves
Are there any open-source resources on this project or do we have to implement it from scratch ourselves
I am not aware of open-source resources on this project. However, there are a few patents related to this work.
[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)
[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)
It is not really clear if they actually implemented CRIU support for Nvidia GPUs or if they are just using the device proxy which was discussed here. If there is real CRIU support, I am not aware of any discussions with upstream CRIU.
Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks
Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks
https://github.com/checkpoint-restore/criu/issues/534#issuecomment-1438561823
Thank you for pointing to the link. I did read that Microsoft paper. Any plans to add support to CRIU itself? Thanks
Any plans to add support to CRIU itself?
Support for what? Nvidia GPUs can only be supported if nvidia steps up and implements CRIU support as AMD did.