criu/plugin: Add NVIDIA CUDA plugin
Adding support for the NVIDIA cuda-checkpoint utility, requires the use of an r555 or higher driver along with the cuda-checkpoint binary.
If you rebase the "CentOS Stream 8" test error should disappear as we removed it (it is EOL).
A README.md with some information about what is going on and how to use it would be helpful in the plugin directory. Maybe a link to the cuda-checkpoint tool.
I would also recommend to at least have two commits here. One with the changes to CRIU and a more detailed commit message why you need the changes and one commits with actual plugin.
For the CRIU changes commit would you just like the 2 changes to cr-dump.c and cr-restore.c or should I include the new plugin hooks/calls in that one as well?
For the CRIU changes commit would you just like the 2 changes to cr-dump.c and cr-restore.c or should I include the new plugin hooks/calls in that one as well?
One goal is to have each commit still compile. If your changes are not directly dependent on each other moving it into extra commits sounds correct. It really depends and is not always a yes or no decision. But if you already ask the question then it would probably make sense to split it, because that seems to indicate to me you also see it as two independent things.
@jesus-ramos It would be great if we could add tests for this functionality. We currently don't have GPUs available in CI, but we could add tests to verify that the cuda-checkpoint tool is called with the correct options during checkpoint/restore.
Thank you for contributing to the project. Overall, this PR looks good to me.
Here is a POC patch how we can test the cuda plugin with a mocked cuda-checkpoint tool: https://github.com/avagin/criu/commit/89df45dd2ca17b019117ef0e1f46b7eeffbbecc7
#11 17.04 In file included from ../../compel/include/uapi/compel/infect.h:6, #11 17.04 from ../../criu/include/proc_parse.h:6, #11 17.04 from cuda_plugin.c:6: #11 17.04 ../../compel/include/uapi/compel/asm/sigframe.h:5:10: fatal error: asm/elf.h: No such file or directory #11 17.04 5 | #include <asm/elf.h> #11 17.04 | ^~~~~~~~~~~ #11 17.04 compilation terminated.
For the ppc and arm builds do I need to add anything to the plugin Makefile to include the proper directory for this?
For the ppc and arm builds do I need to add anything to the plugin Makefile to include the proper directory for this?
This error does not seem to be related to the changes in this pull request. It also occurs on the criu-dev branch.
@ jesus-ramos I think we can merge this pr and do other improvements in separate pr-s. Does it sound good to you?
That works for me.
It breaks build in CI:
#11 17.14 In file included from ../../compel/include/uapi/compel/asm/sigframe.h:4,
#11 17.14 from ../../compel/include/uapi/compel/infect.h:6,
#11 17.14 from ../../criu/include/proc_parse.h:6,
#11 17.14 from cuda_plugin.c:6:
#11 17.14 /usr/include/x86_64-linux-gnu/asm/sigcontext.h:40:8: error: redefinition of 'struct _fpx_sw_bytes'
#11 17.14 40 | struct _fpx_sw_bytes {
#11 17.14 | ^~~~~~~~~~~~~
#11 17.14 In file included from /usr/include/signal.h:301,
#11 17.14 from ../../criu/include/util.h:7,
#11 17.14 from cuda_plugin.c:3:
https://github.com/checkpoint-restore/criu/actions/runs/9754988212/job/26922763707?pr=2427
Is it an issue with the include order or something else? I can see about getting some time on an aarch64 machine to check it out.
Is it an issue with the include order or something else? I can see about getting some time on an aarch64 machine to check it out.
Likely, in ./plugins/cuda/cuda_plugin.c:
#include "criu-log.h"
#include "plugin.h"
#include "util.h"
#include "cr_options.h"
#include "pid.h"
#include "proc_parse.h"
#include <common/list.h>
#include <compel/infect.h>
#include <ctype.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
criu headers should probably go AFTER system headers, but I'm not fully sure how it causes the problem.
It breaks build in CI:
Andrei opened a pull request to fix this problem: https://github.com/checkpoint-restore/criu/pull/2428