criu icon indicating copy to clipboard operation
criu copied to clipboard

CRIU Cuda support

Open montekki opened this issue 6 years ago • 34 comments

Any plans for supporting C/R for cuda applications?

montekki avatar Jul 31 '18 15:07 montekki

No. Also see #527. There is no way to extract the state of the program on the GPU. If there would be a theoretical way to extract the state from the GPU then there would be a theoretical possibility to implement something like this.

With CRIU plugins it could be possible with a lot of help from the hardware vendor, but right now I would say it is not possible.

adrianreber avatar Jul 31 '18 16:07 adrianreber

isn't cuda-gdb supposed to be able to access all info in the GPU?

montekki avatar Jul 31 '18 16:07 montekki

also, there's this https://www.nvidia.com/en-us/design-visualization/solutions/vgpu-migration/

montekki avatar Jul 31 '18 16:07 montekki

Interesting. Good to hear. I was not aware. No idea what information is available but maybe this would make a GPU plugin possible for CRIU. Is this something you are working on?

adrianreber avatar Jul 31 '18 21:07 adrianreber

I might need to start working on in in near future, yes.

montekki avatar Aug 01 '18 09:08 montekki

@montekki I was wondering if you were ever able to get started on this? Do you have any other information that you can share that can hopefully make this possible?

pavanagrawal123 avatar Jan 02 '19 23:01 pavanagrawal123

Hi, just wanted to ask if there have been any updates on this?

Muks14x avatar Aug 07 '20 09:08 Muks14x

No updates. We are still looking for volunteers who will implement this.

avagin avatar Aug 26 '20 17:08 avagin

In the Eurosys 20, a paper named "Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning" said they implemented this by using CRIU.

AHEADer avatar Oct 17 '20 07:10 AHEADer

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

adrianreber avatar Oct 17 '20 08:10 adrianreber

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

AHEADer avatar Oct 17 '20 09:10 AHEADer

Indeed, thanks for pointing out that paper.

I just had a look and they write that they do not checkpoint the GPU part only the CPU part.

Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

adrianreber avatar Oct 17 '20 10:10 adrianreber

Indeed, thanks for pointing out that paper. I just had a look and they write that they do not checkpoint the GPU part only the CPU part. Seems to be an unmodified CRIU without any GPU support.

So does it mean if we somehow copy the GPU memory & status back then we can resume it by recomputing based on some checkpointed status?

You should read the paper. They write that they use some kind of proxy to decouple the CPU process from the GPU process.

As long as you are able to close the connection to the GPU before checkpointing it should be doable, but the application needs to be checkpoint aware.

Many thanks for your reply. Now I understand what they do now. It's better to have a try following their ways.

AHEADer avatar Oct 17 '20 13:10 AHEADer

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Jan 15 '21 00:01 github-actions[bot]

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Feb 15 '21 00:02 github-actions[bot]

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Apr 02 '21 00:04 github-actions[bot]

Do you have any updates?

LuYilei avatar Apr 16 '22 02:04 LuYilei

JFYI there is also some more recent research on this topic here: 2020 CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM.

BTW, I don't think this issue should be closed.

andronat avatar Feb 21 '23 13:02 andronat

BTW, I don't think this issue should be closed.

You are right. The issue has the correct label to not be closed automatically, but it seems it didn't work as expected. Let's see if it works better now.

adrianreber avatar Feb 21 '23 14:02 adrianreber

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

rst0git avatar Feb 21 '23 14:02 rst0git

The following paper from 2022 describes in more detail the "device proxy" approach proposed in Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning mentioned above: Singularity: Planet-Scale, Preemptible and Elastic Scheduling of AI Workloads

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

jsun-m avatar Feb 21 '23 23:02 jsun-m

Are there any open-source resources on this project or do we have to implement it from scratch ourselves

I am not aware of open-source resources on this project. However, there are a few patents related to this work.

rst0git avatar Feb 22 '23 08:02 rst0git

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

0x2b3bfa0 avatar Jun 02 '23 14:06 0x2b3bfa0

[…] something that will help make that possible is our work with our hardware partners, AMD and NVIDIA, which helped implementing CRIU, or Checkpoint/Restore in Usermode for their GPUs. (What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich | YouTube, minute 9:39)

It is not really clear if they actually implemented CRIU support for Nvidia GPUs or if they are just using the device proxy which was discussed here. If there is real CRIU support, I am not aware of any discussions with upstream CRIU.

adrianreber avatar Jun 02 '23 15:06 adrianreber

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

Sharathmk99 avatar Jun 04 '23 08:06 Sharathmk99

Hey @adrianreber, I’m also interested in GPU checkpoint. Can you share any details on how device proxy can be used? Thanks

https://github.com/checkpoint-restore/criu/issues/534#issuecomment-1438561823

adrianreber avatar Jun 04 '23 08:06 adrianreber

Thank you for pointing to the link. I did read that Microsoft paper. Any plans to add support to CRIU itself? Thanks

Sharathmk99 avatar Jun 04 '23 08:06 Sharathmk99

Any plans to add support to CRIU itself?

Support for what? Nvidia GPUs can only be supported if nvidia steps up and implements CRIU support as AMD did.

adrianreber avatar Jun 04 '23 09:06 adrianreber