criu process hangs on system startup after CRIU restore

Description

My service launches CRIU for each process (it could be the same binary with different arguments) simultaneously. When all dumps are complete, the service ends. The system gets rebooted. The system boots, the service starts again and goes through each dump folder. It runs CRIU on every image simultaneously.

Service did dump /usr/sbin/criu-ns dump --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/dump/2022-06-07T15:31:40Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --tree 385732 --ghost-limit 1G --tcp-established

and restore /usr/sbin/criu-ns restore --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/restore/2022-06-07T15:36:06Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --restore-detached --tcp-close

Dumps and Restores are always successful, but the process hangs. It is only reproducible after the system gets rebooted. While the system is on everything works perfectly.

the process stuck here (not CRIU process, CRIU finished successfully)

Thread 30 (Thread 0x7f01b27fc700 (LWP 2799)):
#0  __libc_read (nbytes=5, buf=0x11bd0f73, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=6, buf=0x11bd0f73, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f01cad0e3d9 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007f01cad0967e in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007f01cad084d4 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007f01cad08aa7 in BIO_read () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007f01cabf1b91 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007f01cabf5e1e in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007f01cabf36d0 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007f01cabfac45 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007f01cac05a3f in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007f01cac05b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1

How to solve this hanging problem? Maybe the connections were restored somehow wrong? Is CRIU doing something bad or is my system causing such behavior (these reboots, etc)?

criu --version Version: 3.16.1

Jun 07 '22 20:06 PavloMykhailyshyn

@Snorch and @rst0git Could you be so kind to suggest something about this?

Jun 07 '22 20:06 PavloMykhailyshyn

@PavloMykhailyshyn could you try adding --tcp-close to the dump command as well? (We recently enabled this option for dump as well.)

Jun 08 '22 06:06 rst0git

#11 0x00007f01cac05b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1

It looks like the process is stuck on SSL_read() reading from closed TCP socket. https://www.openssl.org/docs/man1.1.1/man3/SSL_read.html

Jun 08 '22 06:06 rst0git

still the same

Jun 08 '22 14:06 PavloMykhailyshyn

I will try to run several tests and get back to you with criu logs and callstacks

Jun 09 '22 05:06 PavloMykhailyshyn

/usr/sbin/criu-ns dump --images-dir /var/lib/myservice/agent-dumps/images/image-id --shell-job --ext-unix-sk -v4 --log-file ../../logs/dump/2022-06-16T09:37:55Z_image-id.log --action-script /usr/local/sbin/criu_action_script.sh --tree 3469854 --ghost-limit 1G --tcp-close dump.txt

/usr/sbin/criu-ns restore --images-dir /var/lib/myservice/agent-dumps/images/image-id --shell-job --ext-unix-sk -v4 --log-file ../../logs/restore/2022-06-16T09:45:38Z_image-id.log --action-script /usr/local/sbin/criu_action_script.sh --restore-detached --tcp-close restore.txt

Thread 28 (Thread 0x7fed2ffe7700 (LWP 2550)):
#0  __libc_read (nbytes=5, buf=0x7fed3c058f43, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __libc_read (fd=6, buf=0x7fed3c058f43, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007fed48ade3d9 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007fed48ad967e in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007fed48ad84d4 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007fed48ad8aa7 in BIO_read () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007fed489c1b91 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007fed489c5e1e in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007fed489c36d0 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007fed489cac45 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007fed489d5a3f in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007fed489d5b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1

Thread 10 (Thread 0x7fed42521700 (LWP 2532)):
#0  __libc_recv (flags=<optimised out>, len=1, buf=0x7fed4251eca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=5, buf=0x7fed4251eca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23

everything else waits on futex_wait_cancelable

Jun 16 '22 11:06 PavloMykhailyshyn

@rst0git please take a look

Jun 16 '22 11:06 PavloMykhailyshyn

but I noticed that after some period of time (perhaps 10-50 min, could be more, it depends) everything becomes normal and the process continues to work

Jun 16 '22 12:06 PavloMykhailyshyn

I attempted to dump and restore the restored process. Now it is truly stuck. I did it for two processes:

first one

#0  __libc_recv (flags=<optimised out>, len=1, buf=0x7f8b3effaca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=5, buf=0x7f8b3effaca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23

second

#0  __libc_recv (flags=<optimised out>, len=1, buf=0x7fa30d7f7ca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  __libc_recv (fd=5, buf=0x7fa30d7f7ca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23

Jun 16 '22 15:06 PavloMykhailyshyn

processes call such fucntion recv(m_sock, buf, block_read, 0);

Jun 16 '22 15:06 PavloMykhailyshyn

@Snorch & @rst0git Please suggest something here Does the bug exist in criu or is it a problem on my end?

Jun 16 '22 15:06 PavloMykhailyshyn

after attaching and detaching gdb from the process everything magically started to work

Jun 16 '22 15:06 PavloMykhailyshyn

but the process takes much more time to finish its work than before dump and restore (e.g. in a normal situation it lasts 40 min, after dump and restore almost 5 hours)

Jun 16 '22 16:06 PavloMykhailyshyn

any thoughts? @rst0git @Snorch @avagin @adrianreber The problem is crucial, I need a suggestion here

Jun 18 '22 20:06 PavloMykhailyshyn

What kernel do you use? Does it support time namespaces? Could you strace (strace -fo strace.log -s 1024 -p PID) the target process after restore? If it starts to work after gdb, it has to start work after strace too. I want to see first a few hundred lines in the log.

Jun 20 '22 14:06 avagin

@avagin

5.4.0-99-generic I do not know, where can I find it? From Wikipedia The time namespace allows processes to see different system times in a way similar to the UTS namespace. It was proposed in 2018 and landed on Linux 5.6, which was released in March 2020 It may not be supported by my kernel.

strace after the restore strace.txt

one more time after previous strace strace1.txt

the process still doesn't write any logs and probably hangs

Jun 21 '22 10:06 PavloMykhailyshyn

@PavloMykhailyshyn you can check timens supported with:

criu check --feature timens

Jun 21 '22 15:06 rst0git

Error (criu/cr-check.c:1315): Time namespaces are not supported

Jun 21 '22 16:06 PavloMykhailyshyn

@PavloMykhailyshyn I think time namespaces should fix this issue. What distro do you use? It looks like Ubuntu or Debian. Could you install the 5.10 kernel and check that the issue isn't reproduced in the new environment.

Jun 21 '22 17:06 avagin

everything works fine when I updated the kernel to the 5.10 version is there a way to fix the problem on the 5.4 version? (or is this the only way to solve the hanging problem?)

Jun 21 '22 20:06 PavloMykhailyshyn

@PavloMykhailyshyn I don't know any simple way to solve this issue on old kernels.

Jun 21 '22 21:06 avagin

@avagin could you elaborate more on these time-namespaces? Can you explain why this solves my problem?

Jun 22 '22 10:06 PavloMykhailyshyn

does it affect my processes in any way?

Jun 22 '22 11:06 PavloMykhailyshyn

For example, it will make our processes run with completely different timestamps

Jun 22 '22 11:06 PavloMykhailyshyn

@PavloMykhailyshyn Security protocols are often designed with the assumption that the system clock doesn't jump backwards or forwards unexpectedly (e.g., RFC 3161). However, from the perspective of an application that is transparently checkpointed and then restored at later time or on a system with different system clock this assumption doesn't hold. Thus, time namespace was introduced to address this problem. The following wiki page contains more information and links to articles about the time namespace: https://criu.org/Time_namespace

Jun 22 '22 15:06 rst0git

In Linux, we have two clocks CLOCK_MONOTONIC and CLOCK_BOOTTIME that cannot be set and represents monotonic time since some unspecified starting point: https://linux.die.net/man/2/clock_gettime

These clocks are reset on reboot. After C/R, we have to guarantee that these clocks will be monotonic and will not jump to far for restored processes.

Jun 22 '22 15:06 avagin

so to be clear enough there is no way around fixing something on my end (probably fix code in restored processes) the only way is to upgrade the kernel am I right? or we can workaround somehow this?

Jun 23 '22 14:06 PavloMykhailyshyn

I only followed this issue loosely, but upgrading Kernel seems the only solution.

Jun 23 '22 14:06 adrianreber

I faced this problem during the restore

(00.166452) 137041: Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
(00.166468) 137041: Error (criu/files.c:1200): Unable to open fd=2 id=0x17

criu_restore.txt

why did this happen? 78 processes were successfully restored, and only two failed with this error. @adrianreber @avagin @rst0git

Jul 01 '22 07:07 PavloMykhailyshyn

(00.166424) 137041: tcp: Restoring TCP connection
(00.166431) 137041: tcp: Restoring TCP connection id 17 ino 9ac87
(00.166442) 137041: No tcp-stream-9ac87.img image
(00.166452) 137041: Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)

@PavloMykhailyshyn It looks like this error appears because tcp-stream-9ac87.img is missing.

This problem should be fixed in https://github.com/checkpoint-restore/criu/pull/1579.

Would you be able to test with a more recent version of CRIU?

Jul 01 '22 15:07 rst0git

criu criu copied to clipboard

process hangs on system startup after CRIU restore

criu
criu copied to clipboard