criu icon indicating copy to clipboard operation
criu copied to clipboard

criu/plugin: Add NVIDIA CUDA plugin

Open jesus-ramos opened this issue 1 year ago • 8 comments

Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary.

jesus-ramos avatar Jun 03 '24 17:06 jesus-ramos

If you rebase the "CentOS Stream 8" test error should disappear as we removed it (it is EOL).

A README.md with some information about what is going on and how to use it would be helpful in the plugin directory. Maybe a link to the cuda-checkpoint tool.

I would also recommend to at least have two commits here. One with the changes to CRIU and a more detailed commit message why you need the changes and one commits with actual plugin.

adrianreber avatar Jun 05 '24 13:06 adrianreber

For the CRIU changes commit would you just like the 2 changes to cr-dump.c and cr-restore.c or should I include the new plugin hooks/calls in that one as well?

jesus-ramos avatar Jun 05 '24 18:06 jesus-ramos

For the CRIU changes commit would you just like the 2 changes to cr-dump.c and cr-restore.c or should I include the new plugin hooks/calls in that one as well?

One goal is to have each commit still compile. If your changes are not directly dependent on each other moving it into extra commits sounds correct. It really depends and is not always a yes or no decision. But if you already ask the question then it would probably make sense to split it, because that seems to indicate to me you also see it as two independent things.

adrianreber avatar Jun 06 '24 06:06 adrianreber

@jesus-ramos It would be great if we could add tests for this functionality. We currently don't have GPUs available in CI, but we could add tests to verify that the cuda-checkpoint tool is called with the correct options during checkpoint/restore.

rst0git avatar Jun 11 '24 12:06 rst0git

Thank you for contributing to the project. Overall, this PR looks good to me.

avagin avatar Jun 14 '24 01:06 avagin

Here is a POC patch how we can test the cuda plugin with a mocked cuda-checkpoint tool: https://github.com/avagin/criu/commit/89df45dd2ca17b019117ef0e1f46b7eeffbbecc7

avagin avatar Jun 14 '24 16:06 avagin

#11 17.04 In file included from ../../compel/include/uapi/compel/infect.h:6, #11 17.04 from ../../criu/include/proc_parse.h:6, #11 17.04 from cuda_plugin.c:6: #11 17.04 ../../compel/include/uapi/compel/asm/sigframe.h:5:10: fatal error: asm/elf.h: No such file or directory #11 17.04 5 | #include <asm/elf.h> #11 17.04 | ^~~~~~~~~~~ #11 17.04 compilation terminated.

For the ppc and arm builds do I need to add anything to the plugin Makefile to include the proper directory for this?

jesus-ramos avatar Jun 27 '24 23:06 jesus-ramos

For the ppc and arm builds do I need to add anything to the plugin Makefile to include the proper directory for this?

This error does not seem to be related to the changes in this pull request. It also occurs on the criu-dev branch.

rst0git avatar Jun 28 '24 10:06 rst0git

@ jesus-ramos I think we can merge this pr and do other improvements in separate pr-s. Does it sound good to you?

avagin avatar Jul 01 '24 18:07 avagin

That works for me.

jesus-ramos avatar Jul 01 '24 18:07 jesus-ramos

It breaks build in CI:

#11 17.14 In file included from ../../compel/include/uapi/compel/asm/sigframe.h:4,
#11 17.14                  from ../../compel/include/uapi/compel/infect.h:6,
#11 17.14                  from ../../criu/include/proc_parse.h:6,
#11 17.14                  from cuda_plugin.c:6:
#11 17.14 /usr/include/x86_64-linux-gnu/asm/sigcontext.h:40:8: error: redefinition of 'struct _fpx_sw_bytes'
#11 17.14    40 | struct _fpx_sw_bytes {
#11 17.14       |        ^~~~~~~~~~~~~
#11 17.14 In file included from /usr/include/signal.h:301,
#11 17.14                  from ../../criu/include/util.h:7,
#11 17.14                  from cuda_plugin.c:3:

https://github.com/checkpoint-restore/criu/actions/runs/9754988212/job/26922763707?pr=2427

Snorch avatar Jul 02 '24 04:07 Snorch

Is it an issue with the include order or something else? I can see about getting some time on an aarch64 machine to check it out.

jesus-ramos avatar Jul 02 '24 04:07 jesus-ramos

Is it an issue with the include order or something else? I can see about getting some time on an aarch64 machine to check it out.

Likely, in ./plugins/cuda/cuda_plugin.c:

#include "criu-log.h"
#include "plugin.h"
#include "util.h"
#include "cr_options.h"
#include "pid.h"
#include "proc_parse.h"

#include <common/list.h>
#include <compel/infect.h>

#include <ctype.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

criu headers should probably go AFTER system headers, but I'm not fully sure how it causes the problem.

Snorch avatar Jul 02 '24 08:07 Snorch

It breaks build in CI:

Andrei opened a pull request to fix this problem: https://github.com/checkpoint-restore/criu/pull/2428

rst0git avatar Jul 02 '24 08:07 rst0git