criu CRIU Cuda support

Any plans for supporting C/R for cuda applications?

Jul 31 '18 15:07 montekki

No. Also see #527. There is no way to extract the state of the program on the GPU. If there would be a theoretical way to extract the state from the GPU then there would be a theoretical possibility to implement something like this.

With CRIU plugins it could be possible with a lot of help from the hardware vendor, but right now I would say it is not possible.

Jul 31 '18 16:07 adrianreber

isn't cuda-gdb supposed to be able to access all info in the GPU?

Jul 31 '18 16:07 montekki

also, there's this https://www.nvidia.com/en-us/design-visualization/solutions/vgpu-migration/

Jul 31 '18 16:07 montekki

Interesting. Good to hear. I was not aware. No idea what information is available but maybe this would make a GPU plugin possible for CRIU. Is this something you are working on?

Jul 31 '18 21:07 adrianreber

I might need to start working on in in near future, yes.

Aug 01 '18 09:08 montekki

@montekki I was wondering if you were ever able to get started on this? Do you have any other information that you can share that can hopefully make this possible?

Jan 02 '19 23:01 pavanagrawal123

This is a complex project but there is some relevant work that has been done in the past: (2018) CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory (2013) A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States (2009) CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

https://github.com/tbrand/CRCUDA

Jan 03 '19 00:01 rst0git

Hi, just wanted to ask if there have been any updates on this?

Aug 07 '20 09:08 Muks14x

No updates. We are still looking for volunteers who will implement this.

Aug 26 '20 17:08 avagin

In the Eurosys 20, a paper named "Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning" said they implemented this by using CRIU.

Oct 17 '20 07:10 AHEADer

@AHEADer thank you for sharing this. (2020) Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning

Oct 17 '20 08:10 rst0git

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

Oct 17 '20 08:10 adrianreber

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

Oct 17 '20 09:10 AHEADer

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

Oct 17 '20 10:10 adrianreber

Indeed, thanks for pointing out that paper. I just had a look and they write that they do not checkpoint the GPU part only the CPU part. Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

Many thanks for your reply. Now I understand what they do now. It's better to have a try following their ways.

Oct 17 '20 13:10 AHEADer

A friendly reminder that this issue had no activity for 30 days.

Jan 15 '21 00:01 github-actions[bot]

A friendly reminder that this issue had no activity for 30 days.

Feb 15 '21 00:02 github-actions[bot]

A friendly reminder that this issue had no activity for 30 days.

Apr 02 '21 00:04 github-actions[bot]

Do you have any updates?

Apr 16 '22 02:04 LuYilei

JFYI there is also some more recent research on this topic here: 2020 CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM.

BTW, I don't think this issue should be closed.

Feb 21 '23 13:02 andronat

BTW, I don't think this issue should be closed.

You are right. The issue has the correct label to not be closed automatically, but it seems it didn't work as expected. Let's see if it works better now.

Feb 21 '23 14:02 adrianreber

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

Feb 21 '23 14:02 rst0git

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

Feb 21 '23 23:02 jsun-m

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

I am not aware of open-source resources on this project. However, there are a few patents related to this work.

Feb 22 '23 08:02 rst0git

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

Jun 02 '23 14:06 0x2b3bfa0

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

It is not really clear if they actually implemented CRIU support for Nvidia GPUs or if they are just using the device proxy which was discussed here. If there is real CRIU support, I am not aware of any discussions with upstream CRIU.

Jun 02 '23 15:06 adrianreber

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

Jun 04 '23 08:06 Sharathmk99

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

https://github.com/checkpoint-restore/criu/issues/534#issuecomment-1438561823

Jun 04 '23 08:06 adrianreber

Thank you for pointing to the link. I did read that Microsoft paper. Any plans to add support to CRIU itself? Thanks

Jun 04 '23 08:06 Sharathmk99

Any plans to add support to CRIU itself?

Support for what? Nvidia GPUs can only be supported if nvidia steps up and implements CRIU support as AMD did.

Jun 04 '23 09:06 adrianreber

criu criu copied to clipboard

CRIU Cuda support

criu
criu copied to clipboard