Radostin Stoyanov
Radostin Stoyanov
@jatin-jangir CRIU pre-dump is not currently supported with the CUDA Plugin (https://github.com/checkpoint-restore/criu/commit/fc1dbc4915ddafc1f53b66f62d4df29e209976f5). We plan to enable support for this functionality soon.
@andreyvelich Thank you so much for the heads-up! We are also working on enabling support for transparent checkpointing of distributed training workloads and would be happy to collaborate on this...
/remove-lifecycle stale
@deveshgoyal1000 Thank you for working on this! In addition to Adrian's comments, it would be great if you can add more detailed commit messages. The following contributor guide provides more...
>Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap @kolyshkin Would it be possible to confirm that the parent images have not been modified by another test?
>During the demo, did Podman use pure CDI implementation in managing access to external devices and libraries Yes, checkpoint/restore of GPU workloads with Podman should work out-of-the box when it...
> I was considering the external mapping config, but it does not look portable for larger scale in my initial point of view. @ZeroExistence We discussed this issue with @avagin...
For reference, the following patch adds a guard region bit to the pagemap that we can use in CRIU: https://lore.kernel.org/all/[email protected]/T/#u
Thanks @mihalicyn! Please feel free to open a draft pull request!
@hanwen-flow I was able to replicate these results locally. It looks like the reason `docker checkpoint create` is very slow is because it [uses containerd](https://github.com/moby/moby/blob/02a2f649866292b0dcc69bdf9e15827fa027e0d8/libcontainerd/remote/client.go#L427) to [create an OCI image](https://github.com/containerd/containerd/pull/1652)....