rr
rr copied to clipboard
Is it possible to run CRIU and rr together?
Hi, I am wondering if it is possible to checkpoint during recording which is useful to test or fuzz a long-run program for quickly resuming to a running state.
Actually I am using criu in my project, and it works well. And recently I want to use rr together for RnR.
I tried two methods:
-
try to use
criu
to checkpointrr
during recording: This fails with following tracesWarn (compel/src/lib/infect.c:126): Unable to interrupt task: 141817 (Operation not permitted) Error (compel/src/lib/infect.c:233): Unseizable non-zombie 141817 found, state S, err -1/10 Warn (compel/src/lib/infect.c:126): Unable to interrupt task: 141817 (Operation not permitted) Error (compel/src/lib/infect.c:233): Unseizable non-zombie 141817 found, state S, err -1/10 Warn (compel/src/lib/infect.c:126): Unable to interrupt task: 141817 (Operation not permitted) Error (compel/src/lib/infect.c:233): Unseizable non-zombie 141817 found, state S, err -1/10 Warn (compel/src/lib/infect.c:126): Unable to interrupt task: 141817 (Operation not permitted) Error (compel/src/lib/infect.c:233): Unseizable non-zombie 141817 found, state S, err -1/10 Warn (compel/src/lib/infect.c:126): Unable to interrupt task: 141817 (Operation not permitted) Error (compel/src/lib/infect.c:233): Unseizable non-zombie 141817 found, state S, err -1/10 Warn (compel/src/lib/infect.c:126): Unable to interrupt task: 141817 (Operation not permitted) Error (compel/src/lib/infect.c:233): Unseizable non-zombie 141817 found, state S, err -1/10 Error (criu/cr-dump.c:1788): Dumping FAILED.
-
try to use
rr
to record thecriu
restore process (rr record /home/zyh/criu/criu/criu restore ...
): This also fails with a segmentation fault$ sudo /home/zyh/rr/build/bin/rr record /home/zyh/criu/criu/criu restore -D checkpoint_folder/0 -v4 --tcp-established rr: Saving execution to trace directory `/root/.local/share/rr/criu-7'. (00.000000) Will dump/restore TCP connections (00.000557) Version: 3.16.1 (gitid v3.16.1-139-g9326c1233) (00.000992) Running on pc Linux 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022 x86_64 ....... task_args->pid: 140536 task_args->nr_threads: 16 task_args->clone_restore_fn: 0x11d10 task_args->thread_args: 0xb4540 (00.219692) pie: 140536: Switched to the restorer 140536 (00.316538) Error (criu/cr-restore.c:1480): 140536 killed by signal 11: Segmentation fault (00.317157) Error (criu/cr-restore.c:2470): Restoring FAILED.
And if I try to replay,
rr
fails with following error trace[FATAL /home/zyh/rr/src/Task.cc:847:enter_syscall()] (task 141368 (rec:140536) at time 4763) -> Assertion `session().is_recording() && !is_deterministic_signal(this)' failed to hold. got unexpected signal SIGSEGV Tail of trace dump: { real_time:2938712.703628 global_time:4743, event:`SYSCALL: umask' (state:EXITING_SYSCALL) tid:140536, ticks:787009 rax:0x12 rbx:0x5f rcx:0xffffffffffffffff rdx:0x29 rsi:0x55f5a2646bc0 rdi:0x2 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x0 r9:0x7fff54228c30 r10:0x55f5a25cdaa7 r11:0x246 r12:0x23 r13:0x0 r14:0xb4000 r15:0x55f5a306d770 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x5f fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.703748 global_time:4744, event:`SYSCALLBUF_FLUSH' tid:140536, ticks:787884 { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'close', ret:0x0, size:0x10 } { syscall:'gettimeofday', ret:0x0, size:0x20 } } { real_time:2938712.703757 global_time:4745, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:140536, ticks:787884 rax:0xffffffffffffffda rbx:0x7f rcx:0xffffffffffffffff rdx:0x9d rsi:0x55f5a2646bc0 rdi:0x7f rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x0 r9:0x7fff54228c50 r10:0x55f5a25d7b85 r11:0x246 r12:0x0 r13:0x55f5a2646bc0 r14:0x9d r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.703764 global_time:4746, event:`SYSCALLBUF_RESET' tid:140536, ticks:787884 } { real_time:2938712.703828 global_time:4747, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:140536, ticks:787884 rax:0x9d rbx:0x7f rcx:0xffffffffffffffff rdx:0x9d rsi:0x55f5a2646bc0 rdi:0x7f rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x0 r9:0x7fff54228c50 r10:0x55f5a25d7b85 r11:0x246 r12:0x0 r13:0x55f5a2646bc0 r14:0x9d r15:0x0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.703989 global_time:4748, event:`PATCH_SYSCALL' tid:140536, ticks:787893 rax:0x27 rbx:0xb4000 rcx:0xffffffffffffffff rdx:0x12250 rsi:0x55f5a2646bc0 rdi:0xb4000 rbp:0x23070 rsp:0x22ed8 r8:0x0 r9:0x7fff54228c50 r10:0xffffffffffffffff r11:0x246 r12:0xc1000 r13:0x0 r14:0xb4000 r15:0x55f5a306d770 rip:0x16b37 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f6c845f3b80 gs_base:0x0 { tid:140536, addr:0x7000309e, length:0x4f } { tid:140536, addr:0x16b37, length:0x5 } { tid:140536, addr:0x16b3c, length:(nil) } } { real_time:2938712.704065 global_time:4749, event:`SYSCALLBUF_FLUSH' tid:140536, ticks:787917 { syscall:'getpid', ret:0x224f8, size:0x10 } } { real_time:2938712.704084 global_time:4750, event:`SYSCALL: rt_sigaction' (state:ENTERING_SYSCALL) tid:140536, ticks:787917 rax:0xffffffffffffffda rbx:0xd rcx:0xffffffffffffffff rdx:0x0 rsi:0x22f50 rdi:0x11 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x0 r9:0x7fff54228c50 r10:0x8 r11:0x246 r12:0xc1000 r13:0x0 r14:0xb4000 r15:0x55f5a306d770 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704091 global_time:4751, event:`SYSCALLBUF_RESET' tid:140536, ticks:787917 } { real_time:2938712.704132 global_time:4752, event:`SYSCALL: rt_sigaction' (state:EXITING_SYSCALL) tid:140536, ticks:787917 rax:0x0 rbx:0xd rcx:0xffffffffffffffff rdx:0x0 rsi:0x22f50 rdi:0x11 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x0 r9:0x7fff54228c50 r10:0x8 r11:0x246 r12:0xc1000 r13:0x0 r14:0xb4000 r15:0x55f5a306d770 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704235 global_time:4753, event:`SYSCALLBUF_FLUSH' tid:140536, ticks:788193 { syscall:'rt_sigprocmask', ret:0x0, size:0x18 } { syscall:'close', ret:0xfffffffffffffff7, size:0x10 } { syscall:'gettimeofday', ret:0x0, size:0x20 } { syscall:'gettid', ret:0x224f8, size:0x10 } } { real_time:2938712.704244 global_time:4754, event:`SYSCALL: write' (state:ENTERING_SYSCALL) tid:140536, ticks:788193 rax:0xffffffffffffffda rbx:0x7f rcx:0xffffffffffffffff rdx:0x39 rsi:0x22cd0 rdi:0x7f rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0x22c18 r11:0x246 r12:0x0 r13:0x22cd0 r14:0x39 r15:0x22cb0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704250 global_time:4755, event:`SYSCALLBUF_RESET' tid:140536, ticks:788193 } { real_time:2938712.704285 global_time:4756, event:`SYSCALL: write' (state:EXITING_SYSCALL) tid:140536, ticks:788193 rax:0x39 rbx:0x7f rcx:0xffffffffffffffff rdx:0x39 rsi:0x22cd0 rdi:0x7f rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0x22c18 r11:0x246 r12:0x0 r13:0x22cd0 r14:0x39 r15:0x22cb0 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x1 fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704388 global_time:4757, event:`SYSCALL: munmap' (state:ENTERING_SYSCALL) tid:140536, ticks:788205 rax:0xffffffffffffffda rbx:0xb rcx:0xffffffffffffffff rdx:0xffffffff rsi:0x10000 rdi:0x0 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0xffffffffffffffff r11:0x246 r12:0x10000 r13:0x0 r14:0x22f38 r15:0xb5000 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704424 global_time:4758, event:`SYSCALL: munmap' (state:EXITING_SYSCALL) tid:140536, ticks:788205 rax:0x0 rbx:0xb rcx:0xffffffffffffffff rdx:0xffffffff rsi:0x10000 rdi:0x0 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0xffffffffffffffff r11:0x246 r12:0x10000 r13:0x0 r14:0x22f38 r15:0xb5000 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704560 global_time:4759, event:`SYSCALL: munmap' (state:ENTERING_SYSCALL) tid:140536, ticks:788215 rax:0xffffffffffffffda rbx:0xb rcx:0xffffffffffffffff rdx:0x16b3c rsi:0x7f6c3d3cf000 rdi:0xc5000 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0xffffffffffffffff r11:0x246 r12:0xc5000 r13:0x0 r14:0x22f38 r15:0xb5000 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704693 global_time:4760, event:`SYSCALL: munmap' (state:EXITING_SYSCALL) tid:140536, ticks:788215 rax:0x0 rbx:0xb rcx:0xffffffffffffffff rdx:0x16b3c rsi:0x7f6c3d3cf000 rdi:0xc5000 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0xffffffffffffffff r11:0x246 r12:0xc5000 r13:0x0 r14:0x22f38 r15:0xb5000 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704805 global_time:4761, event:`SIGNAL: SIGSEGV(det)' tid:140536, ticks:788215 rax:0x0 rbx:0xb rcx:0xffffffffffffffff rdx:0x16b3c rsi:0x7f6c3d3cf000 rdi:0xc5000 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0xffffffffffffffff r11:0x246 r12:0xc5000 r13:0x0 r14:0x22f38 r15:0xb5000 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f6c845f3b80 gs_base:0x0 } { real_time:2938712.704860 global_time:4762, event:`SIGNAL_DELIVERY: SIGSEGV(det)' tid:140536, ticks:788215 rax:0x0 rbx:0xb rcx:0xffffffffffffffff rdx:0x16b3c rsi:0x7f6c3d3cf000 rdi:0xc5000 rbp:0x7f6c844f2fa0 rsp:0x7f6c844f2d90 r8:0x5 r9:0x22c24 r10:0xffffffffffffffff r11:0x246 r12:0xc5000 r13:0x0 r14:0x22f38 r15:0xb5000 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f6c845f3b80 gs_base:0x0 { tid:140536, addr:0x7f6c844f2d90, length:(nil) } } { real_time:2938712.800120 global_time:4763, event:`EXIT' tid:140536, ticks:788215 } { real_time:2938712.800429 global_time:4764, event:`SYSCALL: futex' (state:EXITING_SYSCALL) tid:140559, ticks:250351 rax:0xfffffffffffffdfc rbx:0xca rcx:0xffffffffffffffff rdx:0x1 rsi:0x0 rdi:0x7f6c8488600c rbp:0x681fffa0 rsp:0x681ffd90 r8:0x0 r9:0x0 r10:0x7fff54229670 r11:0x246 r12:0x7fff54229670 r13:0x7f6c8488600c r14:0x5 r15:0x28 rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xca fs_base:0x7f6c845f3b80 gs_base:0x0 } === Start rr backtrace: /home/zyh/rr/build/bin/rr(_ZN2rr13dump_rr_stackEv+0x5d)[0x558b9bec720c] /home/zyh/rr/build/bin/rr(_ZN2rr9GdbServer15emergency_debugEPNS_4TaskE+0x1a6)[0x558b9bcce01e] /home/zyh/rr/build/bin/rr(+0x3c3ac4)[0x558b9bd06ac4] /home/zyh/rr/build/bin/rr(_ZN2rr21EmergencyDebugOstreamD1Ev+0x66)[0x558b9bd06d4c] /home/zyh/rr/build/bin/rr(_ZN2rr4Task13enter_syscallEv+0x319)[0x558b9be7c573] /home/zyh/rr/build/bin/rr(_ZN2rr18AutoRemoteSyscalls12syscall_baseEiRNS_9RegistersE+0x4dc)[0x558b9bc5601a] /home/zyh/rr/build/bin/rr(_ZN2rr18AutoRemoteSyscalls14syscall_helperILi3EEEliRNS_9RegistersE+0x2c)[0x558b9bc4ca1e] /home/zyh/rr/build/bin/rr(_ZN2rr18AutoRemoteSyscalls14syscall_helperILi2ElJEEEliRNS_9RegistersET0_DpT1_+0x43)[0x558b9be97fbb] /home/zyh/rr/build/bin/rr(_ZN2rr18AutoRemoteSyscalls14syscall_helperILi1ENS_10remote_ptrIvEEJlEEEliRNS_9RegistersET0_DpT1_+0x49)[0x558b9be9669b] /home/zyh/rr/build/bin/rr(_ZN2rr18AutoRemoteSyscalls18infallible_syscallIJNS_10remote_ptrIvEElEEEliDpT_+0x90)[0x558b9be8f886] /home/zyh/rr/build/bin/rr(_ZN2rr4Task17unmap_buffers_forERNS_18AutoRemoteSyscallsEPS0_NS_10remote_ptrI14syscallbuf_hdrEE+0x84)[0x558b9be7ba86] /home/zyh/rr/build/bin/rr(_ZN2rr4Task15destroy_buffersEPS0_S1_+0xc2)[0x558b9be7b8be] /home/zyh/rr/build/bin/rr(_ZN2rr4Task15destroy_buffersEv+0x27)[0x558b9bda43b9] /home/zyh/rr/build/bin/rr(+0x4c1c4f)[0x558b9be04c4f] /home/zyh/rr/build/bin/rr(_ZN2rr13ReplaySession9exit_taskEPNS_10ReplayTaskE+0x8c)[0x558b9be04e9e] /home/zyh/rr/build/bin/rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x3d4)[0x558b9be04b10] /home/zyh/rr/build/bin/rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0x1d7)[0x558b9be0607b] /home/zyh/rr/build/bin/rr(_ZN2rr14ReplayTimeline19replay_step_forwardENS_10RunCommandEl+0x10a)[0x558b9be2615e] /home/zyh/rr/build/bin/rr(_ZN2rr9GdbServer14debug_one_stepERNS_10GdbRequestE+0x5a2)[0x558b9bccae62] /home/zyh/rr/build/bin/rr(_ZN2rr9GdbServer12serve_replayERKNS0_15ConnectionFlagsE+0x54f)[0x558b9bccd0c5] /home/zyh/rr/build/bin/rr(+0x4b77dc)[0x558b9bdfa7dc] /home/zyh/rr/build/bin/rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x407)[0x558b9bdfb2a1] /home/zyh/rr/build/bin/rr(main+0x27d)[0x558b9bee306f] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f8e41f0b0b3] /home/zyh/rr/build/bin/rr(_start+0x2e)[0x558b9bc236ee] === End rr backtrace Launch gdb with gdb '-l' '10000' '-ex' 'set sysroot /' '-ex' 'target extended-remote 127.0.0.1:10296' /root/.local/share/rr/criu-7/mmap_hardlink_4_criu
So maybe we can not mix using these two techniques, and I need to find some other methods to achieve the goal?
Thanks!
Additional info:
If I run rr record -n --no-file-cloning --no-read-cloning criu ...
, I got
task_args->pid: 140536
task_args->nr_threads: 16
task_args->clone_restore_fn: 0x11d10
task_args->thread_args: 0xb4540
(00.393776) pie: 140536: Switched to the restorer 140536
(00.394881) pie: 140536: Mapping native vDSO at 0xc1000
[FATAL /home/zyh/rr/src/record_syscall.cc:6119:rec_process_syscall_arch()]
(task 140878 (rec:140878) at time 8403)
-> Assertion `t->regs().syscall_result_signed() == -syscall_state.expect_errno' failed to hold. Expected EINVAL for 'arch_prctl' but got result 4096 (errno errno(-4096)); unknown arch_prctl(0x2003)
After some investigations, I figure out this issue of arch_prctl, and add some code to rr
case Arch::arch_prctl:
switch ((int)regs.arg1_signed()) {
case ARCH_SET_FS:
case ARCH_SET_GS:
case 0x3001:
case 0x2003:
case 0x2002:
case 0x2001:
break;
Then this error disappears, but I still can not record the criu restore process. It seems some vdso or munmap fails
(00.395709) pie: 140536: Mapping native vDSO at 0xc1000
(00.396282) pie: 140536: vdso: Using gettimeofday() on vdso at 0xc4840
[FATAL /home/zyh/rr/src/Task.cc:847:enter_syscall()]
(task 140536 (rec:140536) at time 5330)
-> Assertion `session().is_recording() && !is_deterministic_signal(this)' failed to hold. got unexpected signal SIGSEGV
Tail of trace dump:
{
real_time:2952753.345947 global_time:5310, event:`SYSCALL: munmap' (state:ENTERING_SYSCALL) tid:140536, ticks:715202
rax:0xffffffffffffffda rbx:0xb4000 rcx:0xffffffffffffffff rdx:0xffffffff rsi:0x10000 rdi:0x0 rbp:0x7f1eebafa000 rsp:0x22ed8 r8:0x5 r9:0x22c24 r10:0xffffffffffffffff r11:0x246 r12:0x10000 r13:0x0 r14:0x22f38 r15:0xb5000 rip:0x16b39 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xb fs_base:0x7f1f32a59b80 gs_base:0x0
}
CRIU does a lot of weird stuff. It's probably possible to get it working under rr but it would be some work.
Using CRIU to checkpoint rr would require fixing things on the CRIU side. I have no idea if that's possible, e.g. I don't know if CRIU can support checkpoint/restore of perf_event_open state.
I guess one important question is what you actually want to do with CRIU+rr.
I guess one important question is what you actually want to do with CRIU+rr.
Hi @rocallahan
I am working on a research project to test and fuzz some concurrent and multi-process programs, with checkpoint to accelerate the long-run program and reproduction to debug the found bugs. So I am investigating some methods to achieve these two goals. For now, I have integrated CRIU in my project to checkpoint and restore the target during a test process. And now I want to try to use rr
for the reproduction goal.
I try to disable vdso for simplifying the CRIU C/R progress by passing vdso=0
when booting, but it seems rr can not run with this option.
[FATAL /home/zyh/rr/src/AddressSpace.cc:266:find_syscall_instruction()]
(task 44558 (rec:44558) at time 14)
-> Assertion `has_vdso()' failed to hold. Kernel with vDSO disabled?
Yeah. You could probably change rr to search more more memory regions for a usable syscall instruction.
Yeah. You could probably change rr to search more more memory regions for a usable syscall instruction.
Hi @rocallahan After debugging for a long time, I still got a segfault like the first post and still don't know why... Any suggestions about how to debug this? thanks~
I guess one important question is what you actually want to do with CRIU+rr.
I would love this crossover to be possible, because it would allow record+replay of bugs which manifest long after a program starts. Specifically, MMTk runs garbage collection "stress tests" on CI where we collect after every allocation of a long program. Some bugs seem to only pop up after many hours of such testing, but we don't necessarily have the disk space to be able to record such runs. Being able to checkpoint every once in a while and to replay the program from those checkpoints would drastically reduce disk space requirements.
On further investigation, I see that #2184 addresses the issue of checkpointing much more thoroughly, which would serve us just as well as this
After some investigations, I figure out this issue of arch_prctl, and add some code to rr
case Arch::arch_prctl: switch ((int)regs.arg1_signed()) { case ARCH_SET_FS: case ARCH_SET_GS: case 0x3001: case 0x2003: case 0x2002: case 0x2001: break;
Then this error disappears
I wonder if it is reasonable to send a PR for this change, and either change the issue title to be explicit about the combination of CRIU+rr or close it as duplicate of the checkpoint feature issue.
@GitMensch
The code snippet is not well tested as I only have limited knowledge about prctl
and rr
, and needs more investigations with someone who is familiar with them.