criu "Unable to detach <pid>: No such process" on restore

Description

I am not sure if this is a bug or not, but I could not find any information about it by googling, so I am hoping I can get help here.

Earlier I posted #2071 . Using the pycriu workaround I described there (removing the closing of FD 0,1,2 when started in SWRK mode) seemed to fix the tty inheritance issue, but now I face a different one:

If I checkpoint and restore a custom executable from a python REPL, everything works fine. However, if I create a python script that:

Is set as the container command to be run on start
Starts my custom executable via psutil.Popen
Configures the patched pycriu
Dumps the process
Tries to restore the process

The script fails on step 5. This only happens if I try to checkpoint and restore the custom executable. Replacing the Popen command with a simple bash loop works as expected (no failures on dump or restore).

Steps to reproduce the issue: This is kind of tricky as it only seems to fail when running the custom executable, which I am unable to provide here. But to describe what it does:

Runs about 120 different threads that do different things
Connects over web sockets to different services running inside other containers (using Nanomsg-NG)
Connects to or listens at over a mix several other types of sockets

Best chance of reproducing is:

Run a similarly complex executable
Try to checkpoint and restore by using the test scripts below (see additional environment details)
Observe it fail with the complex executable, but not with the simple bash loop

Describe the results you received: I receive a restore failure (full log also provided below):

(00.065255) Run late stage hook from criu master for external devices
(00.065259) restore late stage hook for external plugin failed
(00.065262) Running pre-resume scripts
(00.065571) Error (criu/cr-restore.c:2135): Unable to detach 114: No such process
(00.065584) Error (criu/cr-restore.c:2509): Killing processes because of failure on restore.
The Network was unlocked so some data or a connection may have been lost.
(00.065596) Error (criu/cr-restore.c:2536): Restoring FAILED.

Describe the results you expected: Restore succeeds

Additional information you deem important (e.g. issue happens only occasionally): Only seems to fail in certain circumstances, but I cannot determine what those are by using the restore log.

CRIU logs and information:

CRIU full dump/restore logs:

criu.log

Output of `criu --version`:

Version: 3.17.1
GitID: v3.17.1

Output of `criu check --all`:

# criu check --all
Error (criu/util.c:641): exited, status=3
Warn  (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
Warn  (criu/cr-check.c:1231): clone3() with set_tid not supported
Error (criu/cr-check.c:1273): Time namespaces are not supported
Error (criu/cr-check.c:1283): IFLA_NEW_IFINDEX isn't supported
Warn  (criu/cr-check.c:1305): Pidfd store requires pidfd_getfd syscall which is not supported
Warn  (criu/cr-check.c:804): ptrace(PTRACE_GET_RSEQ_CONFIGURATION) isn't supported. C/R of processes which are using rseq() won't work.
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

Additional environment details:

Running inside a docker container in privileged mode and as the root user.

Pycriu patch (my_custom_pycriu.py):

import errno
import fcntl
import os
import socket
import struct

from pycriu import *  # type: ignore
from pycriu import criu as _orig_criu  # type: ignore
from pycriu.criu import _criu_comm_bin as _orig_criu_comm_bin  # type: ignore


class _criu_comm_bin(_orig_criu_comm_bin):
    def connect(self, daemon):
        # Kind of the same thing we do in libcriu
        css = socket.socketpair(socket.AF_UNIX, socket.SOCK_SEQPACKET)
        flags = fcntl.fcntl(css[1], fcntl.F_GETFD)
        fcntl.fcntl(css[1], fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)
        flags = fcntl.fcntl(css[0], fcntl.F_GETFD)
        fcntl.fcntl(css[0], fcntl.F_SETFD, flags & ~fcntl.FD_CLOEXEC)

        self.daemon = daemon

        p = os.fork()

        if p == 0:

            def exec_criu():
                # Don't close FDs here
                css[0].send(struct.pack("i", os.getpid()))
                os.execv(self.comm, [self.comm, "swrk", "%d" % css[0].fileno()])
                os._exit(1)

            if daemon:
                # Python has no daemon(3) alternative,
                # so we need to mimic it ourself.
                p = os.fork()

                if p == 0:
                    os.setsid()
                    # Only close if running as daemon
                    os.close(0)
                    os.close(1)
                    os.close(2)

                    exec_criu()
                else:
                    os._exit(0)
            else:
   
             exec_criu()
        else:
            if daemon:
                os.waitpid(p, 0)

        css[0].close()
        self.swrk = struct.unpack("i", css[1].recv(4))[0]
        self.sk = css[1]

        return self.sk


class criu(_orig_criu):
    def use_binary(self, bin_name):
        """
        Access criu by execing it using provided path to criu binary.
        """
        self._comm = _criu_comm_bin(bin_name)

Test script:

import os
from time import sleep

import psutil
import my_custom_pycriu as pycriu

criu = pycriu.criu()
criu.use_binary("/usr/local/sbin/criu") # or wherever it is

criu.opts.tcp_established = True
criu.opts.shell_job = True
criu.opts.log_level = 4
criu.opts.images_dir_fd = os.open("/path/to/checkpoint/dir", os.O_DIRECTORY)

env = {...}

sleep(30) # Arbitrary wait to allow for other services to come online (testing with docker-compose)
p = psutil.Popen(
    # ["sh", "-c", 'while true; do echo "foo"; sleep 0.5; done'], # simple loop - restore works
    ["/path/to/custom/binary"], # fails restore with this
    env=env,
    cwd="/path/to/cwd",
    start_new_session=True,
)

criu.opts.pid = p.pid

sleep(3) # Wait a bit before killing it
criu.dump()
p.wait()
sleep(3) # wait a bit before restoring 
criu.restore()
p = psutil.Process(p.pid)
p.terminate()

print(p.pid, p.status(), flush=True) # Mostly useless, here to easily validate restore() has not raised an exception
while True:
    pass # hang here to not stop the container

Feb 01 '23 19:02 astro-stan

Today I also tried to reproduce the problem as a shell script:

#!/bin/sh

sleep 30

{EXTRA_ENVS} /path/to/binary/binary.bin &
PID=$!

sleep 5
criu dump -t ${PID} --tcp-established --shell-job -D /path/to/checkpoint/dir

sleep 5
criu restore -d --tcp-established --shell-job -D /path/to/checkpoint/dir

sleep 2
kill ${PID}

Again this works when executed from an interactive shell inside the container, but not when set as the container start command. In the latter case the same failure on restore occurs:

Warn  (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
Warn  (compel/arch/x86/src/lib/infect.c:352): Will restore 76 with interrupted system call
Warn  (compel/arch/x86/src/lib/infect.c:352): Will restore 105 with interrupted system call
Warn  (compel/arch/x86/src/lib/infect.c:352): Will restore 106 with interrupted system call
Warn  (criu/kerndat.c:1117): Can't keep kdat cache on non-tempfs
Error (criu/cr-restore.c:2135): Unable to detach 95: No such process
Error (criu/cr-restore.c:2509): Killing processes because of failure on restore.
The Network was unlocked so some data or a connection may have been lost.
Error (criu/cr-restore.c:2536): Restoring FAILED.

Results of criu --version and criu check --all, as well as the environment details are the same as originally posted

Feb 02 '23 15:02 astro-stan

Do you see anything suspicious in dmesg or other log files?

Feb 02 '23 15:02 adrianreber

I am not sure what I am looking for exactly but in several places there are criu segfaults. For example:

....
[275655.817563] br-de6afa129f06: port 6(vethb5bad44) entered forwarding state
[275655.850233] criu[18178]: segfault at 200 ip 000055b447faffd0 sp 00007fffd2474a00 error 4 in criu[55b447f81000+ad000]
[275655.850239] Code: 00 48 8d 74 24 30 89 ef e8 4d 20 fd ff 85 c0 0f 85 4d 0c 00 00 48 8b 44 24 38 48 83 f8 ff 0f 84 92 0f 00 00 89 05 98 f1 10 00 <48> 8b 83 00 02 00 00 48 85 c0 0f 84 f0 01 00 00 4c 8b 0d b9 70 11
....

I can post the full dmesg log here if it is helpful.

Feb 02 '23 15:02 astro-stan

Something crashed. Good to know at least.

Did you build CRIU yourself or are you using the version from your distribution?

Feb 02 '23 16:02 adrianreber

I built it from the 3.17.1 tag. Here are the relevant parts of the Dockerfile:

FROM python:3.9-alpine3.15 as builder

############################### CRIU Compilation ###############################
ARG SUPPORT_CHECKPOINT_RESTORE=1
ARG CRIU_VERSION=v3.17.1

RUN if [[ "$SUPPORT_CHECKPOINT_RESTORE" -ne 0 ]] ; then \
    apk update --no-cache && \
    apk upgrade --no-cache && \
    apk add --update --no-cache gcc build-base coreutils procps git gnutls-dev libaio-dev \
        libcap-dev libnet-dev libnl3-dev nftables nftables-dev pkgconfig protobuf-c-dev \
        protobuf-dev py3-pip py3-protobuf python3 sudo libbsd-dev asciidoc xmlto libcap && \
    git clone https://github.com/checkpoint-restore/criu && \
    cd criu && \
    git checkout ${CRIU_VERSION} && \
    make criu install DESTDIR=./install PREFIX=/usr \
    ; else \
    mkdir -p /criu/install/dummy/ \
    ; fi
################################################################################

....


FROM python:3.9-alpine3.15

############################### CRIU Instalation ###############################
ARG SUPPORT_CHECKPOINT_RESTORE=1

COPY --from=builder /criu/install/* /criu/

RUN if [[ "$SUPPORT_CHECKPOINT_RESTORE" -ne 0 ]] ; then \
    apk update --no-cache && \
    apk upgrade --no-cache && \
    apk add --update --no-cache gnutls nftables protobuf-c libnl3 libnet \
    ip6tables iptables nftables iproute2 && \
    apk add --update --no-cache libcap && \
    cp -r /criu/* $(python3 -c "import sys; print(sys.executable.split('bin')[0])") && \
    setcap cap_sys_time,cap_dac_override,cap_chown,cap_setpcap,cap_setgid,cap_audit_control,cap_dac_read_search,cap_net_admin,cap_sys_chroot,cap_sys_ptrace,cap_fowner,cap_kill,cap_fsetid,cap_sys_resource,cap_setuid,cap_sys_admin=eip $(python3 -c "import sys; print(sys.executable.split('bin')[0])")/sbin/criu && \
    apk del libcap && \
    rm -rf /var/cache/apk/* \
    ; fi && \
    rm -rf /criu
################################################################################

....

Feb 02 '23 16:02 astro-stan

So, something crashes. Not sure why. Can you try with glibc instead of musl. I am not sure the combination of CRIU and musl is tested as well as with glibc. Although I am not sure that this is a problem.

At first I thought it might be rseq (restartable sequence) related, because we have seen crashes when running without rseq support. Musl, however, does not use rseq, so it might not be rseq related at all, but at the same time I do not know if many users are using musl or not. So if you try it with a container with glibc to see if that helps or not.

Feb 02 '23 16:02 adrianreber

Just modified the docker image to use python:3.9-bullseye as base for all stages. Everything else is the same - building criu from the same tag, running the container with the same docker-compose setup.

Fails with the same error (can attach logs and criu info if it is helpful).

I feel it must be something related to the way CRIU is called. In all cases the test script is invoked as the root user in a container with the --privileged option set. But only works when I invoke it from an interactive shell inside the container.

EDIT: For completeness, I also built criu from the head of the criu-dev branch - same result.

Feb 02 '23 18:02 astro-stan

If it is helpful, I can also build CRIU with extra debugging information. For example I can checkout a particular commit that adds extra debug logs around where it the restore failure happens.

Feb 02 '23 18:02 astro-stan

Can you get a backtrace from a core dump with debug symbols?

Feb 02 '23 21:02 adrianreber

I can build it with debug symbols (if I understand correctly it is as simple as passing DEBUG=1 to make), but I will need a short guide on how to do the rest.

Feb 02 '23 21:02 astro-stan

Built with debug symbols, set ulimit -c to unlimited and set an appropriate directory for the core dump file. But criu does not core dump on restore. How can I force it to do so before exiting?

Feb 03 '23 11:02 astro-stan

Could you show content of /proc/sys/kernel/core_pattern?

Feb 21 '23 02:02 avagin

A friendly reminder that this issue had no activity for 30 days.

Mar 24 '23 00:03 github-actions[bot]

@sdimovv are you able to dump the python process with success? I am using below code to dump the process but not sure how to pass opts.images_dir_fd option to criu.

Can you please share how you implemented it? Below is my code.

import os from pycriu.criu import * c=criu() f = os.open("dummy", os.O_DIRECTORY) c.opts.images_dir_fd = f c.use_binary("/home/ec2-user/criu/criu/criu/criu") c.dump()

Jun 05 '24 20:06 prenit-wankhede