Checkpoint fails when some processes are running
Environment
criu-3.19-1.el9.x86_64
criu-libs-3.19-1.el9.x86_64
RHEL 9.6
Kernel 5.14.0-570.12.1.el9_6.x86_64
podman-5.4.0-1.el9.x86_64
Process dbus-daemon
(00.148866) irmap: Scanning /no-such-path hint (00.148869) irmap: Refresh stat for /no-such-path (00.148891) Warn (criu/irmap.c:104): irmap: Can't stat /no-such-path: No such file or directory (00.148895) Error (criu/fsnotify.c:284): fsnotify: Can't dump that handle (00.148914) ---------------------------------------- (00.148949) Error (criu/cr-dump.c:1674): Dump files (pid: 203743) failed with -1 (00.148961) Waiting for 203743 to trap (00.148984) Daemon 203743 exited trapping (00.148998) Sent msg to daemon 3 0 0 pie: 35: __fetched msg: 3 0 0 pie: 35: 35: new_sp=0x7ff5c88e9bc8 ip 0x7ff5c73e144b (00.149149) 203743 was trapped (00.149184) 203743 was trapped (00.149191) 203743 (native) is going to execute the syscall 15, required is 15 (00.149217) 203743 was stopped
# ps -f -p 203743
UID PID PPID C STIME TTY TIME CMD
dbus 203743 203677 0 23:42 ? 00:00:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
Process nscd
(00.102710) fsnotify: link as .
(00.102718) fsnotify: openable (inode don't match) as .
(00.102788) Error (criu/fsnotify.c:263): fsnotify: Can't find suitable path for handle (dev 0x18 ino 0x6b01): -2
(00.102803) ----------------------------------------
(00.102827) Error (criu/cr-dump.c:1674): Dump files (pid: 203747) failed with -1
(00.102836) Waiting for 203747 to trap
(00.102857) Daemon 203747 exited trapping
(00.102867) Sent msg to daemon 3 0 0
pie: 39: __fetched msg: 3 0 0
pie: 39: 39: new_sp=0x7f8acf3f7d88 ip 0x7f8ace725487
(00.102971) 203747 was trapped
(00.103001) 203747 was trapped
(00.103006) 203747 (native) is going to execute the syscall 15, required is 15
(00.103029) 203747 was stopped
# ps -ef | grep 203747
nscd 203747 203677 0 23:42 ? 00:00:00 /usr/sbin/nscd
# find / -inum 27393
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:17/device:3a6/device:3a7/uevent
# stat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:17/device:3a6/device:3a7/uevent
File: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:17/device:3a6/device:3a7/uevent
Size: 4096 Blocks: 0 IO Block: 4096 regular file
Device: 3eh/62d Inode: 27393 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-09-06 23:49:40.592170585 -0700
Modify: 2025-09-06 23:49:40.592170585 -0700
Change: 2025-09-06 23:49:40.592170585 -0700
Birth: -
Process gssproxy
(00.227518) sockets: Searching for socket 0x130993 family 1
(00.227521) sockets: Searching for socket 0x3df4 family 1
(00.227525) Error (criu/sk-unix.c:418): unix: Unix socket 1247635 without peer 15860
(00.227536) ----------------------------------------
(00.227598) Error (criu/cr-dump.c:1674): Dump files (pid: 203752) failed with -1
(00.227636) Waiting for 203752 to trap
(00.227656) Daemon 203752 exited trapping
(00.227672) Sent msg to daemon 3 0 0
pie: 44: __fetched msg: 3 0 0
pie: 44: 44: new_sp=0x7f281ac67988 ip 0x7f2818c17487
(00.227771) 203752 was trapped
(00.227800) 203752 was trapped
(00.227803) 203752 (native) is going to execute the syscall 15, required is 15
(00.227824) 203752 was stopped
# ps -ef | grep 203752
root 203752 203677 0 23:42 ? 00:00:00 /usr/sbin/gssproxy -D
Process ssd_be
(00.136289) fsnotify: link as .
(00.136293) fsnotify: openable (inode don't match) as .
(00.136343) Error (criu/fsnotify.c:263): fsnotify: Can't find suitable path for handle (dev 0x18 ino 0x6b01): -2
(00.136356) ----------------------------------------
(00.136370) Error (criu/cr-dump.c:1674): Dump files (pid: 203844) failed with -1
(00.136375) Waiting for 203844 to trap
(00.136393) Daemon 203844 exited trapping
(00.136399) Sent msg to daemon 3 0 0
pie: 134: __fetched msg: 3 0 0
pie: 134: 134: new_sp=0x7fd12be63188 ip 0x7fd132b9144b
(00.136492) 203844 was trapped
(00.136520) 203844 was trapped
(00.136524) 203844 (native) is going to execute the syscall 15, required is 15
(00.136543) 203844 was stopped
# ps -ef | grep 203844
root 203844 203825 0 23:42 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 --logger=files
Process crond
(00.146395) sockets: Searching for socket 0x1192cb family 1
(00.146403) sockets: Searching for socket 0x3df4 family 1
(00.146411) Error (criu/sk-unix.c:418): unix: Unix socket 1151691 without peer 15860
(00.146424) ----------------------------------------
(00.146452) Error (criu/cr-dump.c:1674): Dump files (pid: 203849) failed with -1
(00.146462) Waiting for 203849 to trap
(00.146480) Daemon 203849 exited trapping
(00.146490) Sent msg to daemon 3 0 0
pie: 139: __fetched msg: 3 0 0
pie: 139: 139: new_sp=0x7f00c8105c88 ip 0x7f00c740c148
(00.146628) 203849 was trapped
(00.146660) 203849 was trapped
(00.146667) 203849 (native) is going to execute the syscall 15, required is 15
(00.146692) 203849 was stopped
# ps -ef | grep 203849
root 203849 203677 0 23:42 ? 00:00:00 /usr/sbin/crond -n
Can you tell us more about the workload you are dumping? Is it a container?
It’s a ubi-initd based image running NFS server and gssproxy with ability to connect to AD/LDAP domain (using SSSD). Users can SSH to the container.
@vikas-goel how do you start this container? How do you call criu? If you don't use standard tools like runc, podman, I recommend you to look how C/R is implemented in one of these projects. Basically, CRIU needs some help to proper handle container environments.
The podman command is used to start the container and create checkpoint.
Could you please attach the full CRIU log?
For future reference, please provide as much information as possible. It will save us time and help us fix the problem faster.
podman container checkpoint --leave-running --ignore-volumes --file-locks --export abcd.tar.gz abcd-main
gssproxy-dump.log.txt nscd-dump.log.txt sssd_be-dump.log.txt crond-dump.log.txt dbus-dump.log.txt
(00.112481) Error (criu/sk-unix.c:418): unix: Unix socket 1321771 without peer 15860
There are a few sockets with this peer. @vikas-goel could you try to find out where this socket is created. Are you dumping a few containers? Are they all completely separate or they have any inter-container dependencies? Maybe you can create a reproducer that I can run in my environment?
There is only one container running in the system in a pod. Podman runs an infra pod for every pod. I believe the two containers are independent. I am trying to create checkpoint for one container.
# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e55f6a4b85e3 localhost/podman-pause:5.4.0-1739375653 16 hours ago Up 16 hours fa390e625f00-infra
228b1ec500df flex.io/netbackup/main:11.0.1 /sbin/init 16 hours ago Up 16 hours (healthy) abcd-main
# podman exec -it abcd-main ps -ef UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Sep07 ? 00:00:08 /sbin/init
root 22 1 0 Sep07 ? 00:00:00 /usr/lib/systemd/systemd-journald
rpc 33 1 0 Sep07 ? 00:00:00 /usr/bin/rpcbind -w -f
dbus 35 1 0 Sep07 ? 00:00:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork -
nscd 40 1 0 Sep07 ? 00:00:00 /usr/sbin/nscd
root 41 1 0 Sep07 ? 00:00:00 /usr/sbin/oddjobd -n -p /run/oddjobd.pid -t 300
root 48 1 0 Sep07 ? 00:00:00 /usr/sbin/gssproxy -D
root 116 1 0 Sep07 ? 00:00:00 /usr/bin/lsyncd -nodaemon /etc/lsyncd.conf
root 117 1 0 Sep07 ? 00:00:00 /usr/sbin/sssd -i --logger=files
root 118 1 0 Sep07 ? 00:00:00 /usr/sbin/sshd -D -oCiphers=aes256-ctr,aes192-ctr,aes128-ct
root 134 117 0 Sep07 ? 00:00:00 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0
root 135 117 0 Sep07 ? 00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root 136 117 0 Sep07 ? 00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root 137 117 0 Sep07 ? 00:00:00 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=file
root 139 1 0 Sep07 ? 00:00:00 /usr/sbin/crond -n
root 4216 0 0 14:08 pts/1 00:00:00 ps -ef
Regarding the (gssproxy) socket without peer error, output of ss command from the container for the socket number printed in the dump.log file.
# ss -emrOT | grep 1937494
u_dgr ESTAB 0 0 * 1937494 * 0 users:(("gssproxy",pid=48,tid=58,fd=3),("gssproxy",pid=48,tid=57,fd=3),("gssproxy",pid=48,tid=56,fd=3),("gssproxy",pid=48,tid=55,fd=3),("gssproxy",pid=48,tid=54,fd=3),("gssproxy",pid=48,tid=48,fd=3))
A friendly reminder that this issue had no activity for 30 days.
I will appreciate an update, @avagin .
@vikas-goel Would it be possible to provide the podman commands used to create these containers or Pods? This will help us to replicate the error locally and investigate the issue.
A friendly reminder that this issue had no activity for 30 days.