criu
criu copied to clipboard
process hangs on system startup after CRIU restore
Description
My service launches CRIU for each process (it could be the same binary with different arguments) simultaneously. When all dumps are complete, the service ends. The system gets rebooted. The system boots, the service starts again and goes through each dump folder. It runs CRIU on every image simultaneously.
Service did dump
/usr/sbin/criu-ns dump --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/dump/2022-06-07T15:31:40Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --tree 385732 --ghost-limit 1G --tcp-established
and restore
/usr/sbin/criu-ns restore --images-dir /var/lib/dumps/images/zeb2ft-s5kpj2-64gu7z --shell-job --ext-unix-sk -v4 --log-file ../../logs/restore/2022-06-07T15:36:06Z_zeb2ft-s5kpj2-64gu7z.log --action-script /usr/local/sbin/criu_action_script.sh --restore-detached --tcp-close
Dumps and Restores are always successful, but the process hangs. It is only reproducible after the system gets rebooted. While the system is on everything works perfectly.
the process stuck here (not CRIU process, CRIU finished successfully)
Thread 30 (Thread 0x7f01b27fc700 (LWP 2799)):
#0 __libc_read (nbytes=5, buf=0x11bd0f73, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __libc_read (fd=6, buf=0x11bd0f73, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f01cad0e3d9 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3 0x00007f01cad0967e in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4 0x00007f01cad084d4 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5 0x00007f01cad08aa7 in BIO_read () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6 0x00007f01cabf1b91 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#7 0x00007f01cabf5e1e in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#8 0x00007f01cabf36d0 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#9 0x00007f01cabfac45 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007f01cac05a3f in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007f01cac05b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
How to solve this hanging problem? Maybe the connections were restored somehow wrong? Is CRIU doing something bad or is my system causing such behavior (these reboots, etc)?
criu --version Version: 3.16.1
@Snorch and @rst0git Could you be so kind to suggest something about this?
@PavloMykhailyshyn could you try adding --tcp-close to the dump command as well?
(We recently enabled this option for dump as well.)
#11 0x00007f01cac05b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
It looks like the process is stuck on SSL_read() reading from closed TCP socket.
https://www.openssl.org/docs/man1.1.1/man3/SSL_read.html
still the same
I will try to run several tests and get back to you with criu logs and callstacks
/usr/sbin/criu-ns dump --images-dir /var/lib/myservice/agent-dumps/images/image-id --shell-job --ext-unix-sk -v4 --log-file ../../logs/dump/2022-06-16T09:37:55Z_image-id.log --action-script /usr/local/sbin/criu_action_script.sh --tree 3469854 --ghost-limit 1G --tcp-close
dump.txt
/usr/sbin/criu-ns restore --images-dir /var/lib/myservice/agent-dumps/images/image-id --shell-job --ext-unix-sk -v4 --log-file ../../logs/restore/2022-06-16T09:45:38Z_image-id.log --action-script /usr/local/sbin/criu_action_script.sh --restore-detached --tcp-close
restore.txt
Thread 28 (Thread 0x7fed2ffe7700 (LWP 2550)):
#0 __libc_read (nbytes=5, buf=0x7fed3c058f43, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __libc_read (fd=6, buf=0x7fed3c058f43, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007fed48ade3d9 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3 0x00007fed48ad967e in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4 0x00007fed48ad84d4 in ?? () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5 0x00007fed48ad8aa7 in BIO_read () from target:/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6 0x00007fed489c1b91 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#7 0x00007fed489c5e1e in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#8 0x00007fed489c36d0 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#9 0x00007fed489cac45 in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007fed489d5a3f in ?? () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007fed489d5b47 in SSL_read () from target:/lib/x86_64-linux-gnu/libssl.so.1.1
Thread 10 (Thread 0x7fed42521700 (LWP 2532)):
#0 __libc_recv (flags=<optimised out>, len=1, buf=0x7fed4251eca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1 __libc_recv (fd=5, buf=0x7fed4251eca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
everything else waits on futex_wait_cancelable
@rst0git please take a look
but I noticed that after some period of time (perhaps 10-50 min, could be more, it depends) everything becomes normal and the process continues to work
I attempted to dump and restore the restored process. Now it is truly stuck. I did it for two processes:
- first one
#0 __libc_recv (flags=<optimised out>, len=1, buf=0x7f8b3effaca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1 __libc_recv (fd=5, buf=0x7f8b3effaca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
- second
#0 __libc_recv (flags=<optimised out>, len=1, buf=0x7fa30d7f7ca0, fd=5) at ../sysdeps/unix/sysv/linux/recv.c:28
#1 __libc_recv (fd=5, buf=0x7fa30d7f7ca0, len=1, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:23
processes call such fucntion
recv(m_sock, buf, block_read, 0);
@Snorch & @rst0git Please suggest something here Does the bug exist in criu or is it a problem on my end?
after attaching and detaching gdb from the process everything magically started to work
but the process takes much more time to finish its work than before dump and restore (e.g. in a normal situation it lasts 40 min, after dump and restore almost 5 hours)
any thoughts? @rst0git @Snorch @avagin @adrianreber The problem is crucial, I need a suggestion here
What kernel do you use? Does it support time namespaces? Could you strace (strace -fo strace.log -s 1024 -p PID) the target process after restore? If it starts to work after gdb, it has to start work after strace too. I want to see first a few hundred lines in the log.
@avagin
5.4.0-99-generic
I do not know, where can I find it? From Wikipedia The time namespace allows processes to see different system times in a way similar to the UTS namespace. It was proposed in 2018 and landed on Linux 5.6, which was released in March 2020
It may not be supported by my kernel.
strace after the restore strace.txt
one more time after previous strace strace1.txt
the process still doesn't write any logs and probably hangs
@PavloMykhailyshyn you can check timens supported with:
criu check --feature timens
Error (criu/cr-check.c:1315): Time namespaces are not supported
@PavloMykhailyshyn I think time namespaces should fix this issue. What distro do you use? It looks like Ubuntu or Debian. Could you install the 5.10 kernel and check that the issue isn't reproduced in the new environment.
everything works fine when I updated the kernel to the 5.10 version is there a way to fix the problem on the 5.4 version? (or is this the only way to solve the hanging problem?)
@PavloMykhailyshyn I don't know any simple way to solve this issue on old kernels.
@avagin could you elaborate more on these time-namespaces? Can you explain why this solves my problem?
does it affect my processes in any way?
For example, it will make our processes run with completely different timestamps
@PavloMykhailyshyn Security protocols are often designed with the assumption that the system clock doesn't jump backwards or forwards unexpectedly (e.g., RFC 3161). However, from the perspective of an application that is transparently checkpointed and then restored at later time or on a system with different system clock this assumption doesn't hold. Thus, time namespace was introduced to address this problem. The following wiki page contains more information and links to articles about the time namespace: https://criu.org/Time_namespace
In Linux, we have two clocks CLOCK_MONOTONIC and CLOCK_BOOTTIME that cannot be set and represents monotonic time since some unspecified starting point: https://linux.die.net/man/2/clock_gettime
These clocks are reset on reboot. After C/R, we have to guarantee that these clocks will be monotonic and will not jump to far for restored processes.
so to be clear enough there is no way around fixing something on my end (probably fix code in restored processes) the only way is to upgrade the kernel am I right? or we can workaround somehow this?
I only followed this issue loosely, but upgrading Kernel seems the only solution.
I faced this problem during the restore
(00.166452) 137041: Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
(00.166468) 137041: Error (criu/files.c:1200): Unable to open fd=2 id=0x17
why did this happen? 78 processes were successfully restored, and only two failed with this error. @adrianreber @avagin @rst0git
(00.166424) 137041: tcp: Restoring TCP connection
(00.166431) 137041: tcp: Restoring TCP connection id 17 ino 9ac87
(00.166442) 137041: No tcp-stream-9ac87.img image
(00.166452) 137041: Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
@PavloMykhailyshyn It looks like this error appears because tcp-stream-9ac87.img is missing.
This problem should be fixed in https://github.com/checkpoint-restore/criu/pull/1579.
Would you be able to test with a more recent version of CRIU?