criu icon indicating copy to clipboard operation
criu copied to clipboard

Checkpoint fails when some processes are running

Open vikas-goel opened this issue 3 months ago • 12 comments

Environment

criu-3.19-1.el9.x86_64
criu-libs-3.19-1.el9.x86_64

RHEL 9.6
Kernel 5.14.0-570.12.1.el9_6.x86_64
podman-5.4.0-1.el9.x86_64

Process dbus-daemon

(00.148866) irmap: Scanning /no-such-path hint (00.148869) irmap: Refresh stat for /no-such-path (00.148891) Warn  (criu/irmap.c:104): irmap: Can't stat /no-such-path: No such file or directory (00.148895) Error (criu/fsnotify.c:284): fsnotify:      Can't dump that handle (00.148914) ---------------------------------------- (00.148949) Error (criu/cr-dump.c:1674): Dump files (pid: 203743) failed with -1 (00.148961) Waiting for 203743 to trap (00.148984) Daemon 203743 exited trapping (00.148998) Sent msg to daemon 3 0 0 pie: 35: __fetched msg: 3 0 0 pie: 35: 35: new_sp=0x7ff5c88e9bc8 ip 0x7ff5c73e144b (00.149149) 203743 was trapped (00.149184) 203743 was trapped (00.149191) 203743 (native) is going to execute the syscall 15, required is 15 (00.149217) 203743 was stopped

# ps -f -p 203743
UID          PID    PPID  C STIME TTY          TIME CMD
dbus      203743  203677  0 23:42 ?        00:00:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only

Process nscd (00.102710) fsnotify:                   link as . (00.102718) fsnotify:                   openable (inode don't match) as . (00.102788) Error (criu/fsnotify.c:263): fsnotify: Can't find suitable path for handle (dev 0x18 ino 0x6b01): -2 (00.102803) ---------------------------------------- (00.102827) Error (criu/cr-dump.c:1674): Dump files (pid: 203747) failed with -1 (00.102836) Waiting for 203747 to trap (00.102857) Daemon 203747 exited trapping (00.102867) Sent msg to daemon 3 0 0 pie: 39: __fetched msg: 3 0 0 pie: 39: 39: new_sp=0x7f8acf3f7d88 ip 0x7f8ace725487 (00.102971) 203747 was trapped (00.103001) 203747 was trapped (00.103006) 203747 (native) is going to execute the syscall 15, required is 15 (00.103029) 203747 was stopped

# ps -ef | grep 203747
nscd        203747  203677  0 23:42 ?        00:00:00 /usr/sbin/nscd

# find / -inum 27393
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:17/device:3a6/device:3a7/uevent
# stat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:17/device:3a6/device:3a7/uevent
  File: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:17/device:3a6/device:3a7/uevent
  Size: 4096            Blocks: 0          IO Block: 4096   regular file
Device: 3eh/62d Inode: 27393       Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-09-06 23:49:40.592170585 -0700
Modify: 2025-09-06 23:49:40.592170585 -0700
Change: 2025-09-06 23:49:40.592170585 -0700
 Birth: -

Process gssproxy (00.227518) sockets: Searching for socket 0x130993 family 1 (00.227521) sockets: Searching for socket 0x3df4 family 1 (00.227525) Error (criu/sk-unix.c:418): unix: Unix socket 1247635 without peer 15860 (00.227536) ---------------------------------------- (00.227598) Error (criu/cr-dump.c:1674): Dump files (pid: 203752) failed with -1 (00.227636) Waiting for 203752 to trap (00.227656) Daemon 203752 exited trapping (00.227672) Sent msg to daemon 3 0 0 pie: 44: __fetched msg: 3 0 0 pie: 44: 44: new_sp=0x7f281ac67988 ip 0x7f2818c17487 (00.227771) 203752 was trapped (00.227800) 203752 was trapped (00.227803) 203752 (native) is going to execute the syscall 15, required is 15 (00.227824) 203752 was stopped

# ps -ef | grep 203752
root      203752  203677  0 23:42 ?        00:00:00 /usr/sbin/gssproxy -D

Process ssd_be (00.136289) fsnotify:                   link as . (00.136293) fsnotify:                   openable (inode don't match) as . (00.136343) Error (criu/fsnotify.c:263): fsnotify: Can't find suitable path for handle (dev 0x18 ino 0x6b01): -2 (00.136356) ---------------------------------------- (00.136370) Error (criu/cr-dump.c:1674): Dump files (pid: 203844) failed with -1 (00.136375) Waiting for 203844 to trap (00.136393) Daemon 203844 exited trapping (00.136399) Sent msg to daemon 3 0 0 pie: 134: __fetched msg: 3 0 0 pie: 134: 134: new_sp=0x7fd12be63188 ip 0x7fd132b9144b (00.136492) 203844 was trapped (00.136520) 203844 was trapped (00.136524) 203844 (native) is going to execute the syscall 15, required is 15 (00.136543) 203844 was stopped

# ps -ef | grep 203844
root      203844  203825  0 23:42 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0 --logger=files

Process crond (00.146395) sockets: Searching for socket 0x1192cb family 1 (00.146403) sockets: Searching for socket 0x3df4 family 1 (00.146411) Error (criu/sk-unix.c:418): unix: Unix socket 1151691 without peer 15860 (00.146424) ---------------------------------------- (00.146452) Error (criu/cr-dump.c:1674): Dump files (pid: 203849) failed with -1 (00.146462) Waiting for 203849 to trap (00.146480) Daemon 203849 exited trapping (00.146490) Sent msg to daemon 3 0 0 pie: 139: __fetched msg: 3 0 0 pie: 139: 139: new_sp=0x7f00c8105c88 ip 0x7f00c740c148 (00.146628) 203849 was trapped (00.146660) 203849 was trapped (00.146667) 203849 (native) is going to execute the syscall 15, required is 15 (00.146692) 203849 was stopped

# ps -ef | grep 203849
root      203849  203677  0 23:42 ?        00:00:00 /usr/sbin/crond -n

vikas-goel avatar Sep 07 '25 07:09 vikas-goel

Can you tell us more about the workload you are dumping? Is it a container?

avagin avatar Sep 07 '25 15:09 avagin

It’s a ubi-initd based image running NFS server and gssproxy with ability to connect to AD/LDAP domain (using SSSD). Users can SSH to the container.

vikas-goel avatar Sep 07 '25 19:09 vikas-goel

@vikas-goel how do you start this container? How do you call criu? If you don't use standard tools like runc, podman, I recommend you to look how C/R is implemented in one of these projects. Basically, CRIU needs some help to proper handle container environments.

avagin avatar Sep 07 '25 19:09 avagin

The podman command is used to start the container and create checkpoint.

vikas-goel avatar Sep 07 '25 19:09 vikas-goel

Could you please attach the full CRIU log?

For future reference, please provide as much information as possible. It will save us time and help us fix the problem faster.

avagin avatar Sep 07 '25 20:09 avagin

podman container checkpoint --leave-running --ignore-volumes --file-locks --export abcd.tar.gz abcd-main

gssproxy-dump.log.txt nscd-dump.log.txt sssd_be-dump.log.txt crond-dump.log.txt dbus-dump.log.txt

vikas-goel avatar Sep 08 '25 05:09 vikas-goel

(00.112481) Error (criu/sk-unix.c:418): unix: Unix socket 1321771 without peer 15860

There are a few sockets with this peer. @vikas-goel could you try to find out where this socket is created. Are you dumping a few containers? Are they all completely separate or they have any inter-container dependencies? Maybe you can create a reproducer that I can run in my environment?

avagin avatar Sep 08 '25 15:09 avagin

There is only one container running in the system in a pod. Podman runs an infra pod for every pod. I believe the two containers are independent. I am trying to create checkpoint for one container.

# podman ps
CONTAINER ID  IMAGE                                    COMMAND     CREATED       STATUS                 PORTS       NAMES
e55f6a4b85e3  localhost/podman-pause:5.4.0-1739375653              16 hours ago  Up 16 hours                        fa390e625f00-infra
228b1ec500df  flex.io/netbackup/main:11.0.1            /sbin/init  16 hours ago  Up 16 hours (healthy)              abcd-main

# podman exec -it abcd-main ps -ef                                                          UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 Sep07 ?        00:00:08 /sbin/init
root          22       1  0 Sep07 ?        00:00:00 /usr/lib/systemd/systemd-journald
rpc           33       1  0 Sep07 ?        00:00:00 /usr/bin/rpcbind -w -f
dbus          35       1  0 Sep07 ?        00:00:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork -
nscd          40       1  0 Sep07 ?        00:00:00 /usr/sbin/nscd
root          41       1  0 Sep07 ?        00:00:00 /usr/sbin/oddjobd -n -p /run/oddjobd.pid -t 300
root          48       1  0 Sep07 ?        00:00:00 /usr/sbin/gssproxy -D
root         116       1  0 Sep07 ?        00:00:00 /usr/bin/lsyncd -nodaemon /etc/lsyncd.conf
root         117       1  0 Sep07 ?        00:00:00 /usr/sbin/sssd -i --logger=files
root         118       1  0 Sep07 ?        00:00:00 /usr/sbin/sshd -D -oCiphers=aes256-ctr,aes192-ctr,aes128-ct
root         134     117  0 Sep07 ?        00:00:00 /usr/libexec/sssd/sssd_be --domain default --uid 0 --gid 0
root         135     117  0 Sep07 ?        00:00:00 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
root         136     117  0 Sep07 ?        00:00:00 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
root         137     117  0 Sep07 ?        00:00:00 /usr/libexec/sssd/sssd_autofs --uid 0 --gid 0 --logger=file
root         139       1  0 Sep07 ?        00:00:00 /usr/sbin/crond -n
root        4216       0  0 14:08 pts/1    00:00:00 ps -ef

Regarding the (gssproxy) socket without peer error, output of ss command from the container for the socket number printed in the dump.log file.

# ss -emrOT  | grep 1937494
u_dgr ESTAB 0      0                                                   * 1937494            * 0    users:(("gssproxy",pid=48,tid=58,fd=3),("gssproxy",pid=48,tid=57,fd=3),("gssproxy",pid=48,tid=56,fd=3),("gssproxy",pid=48,tid=55,fd=3),("gssproxy",pid=48,tid=54,fd=3),("gssproxy",pid=48,tid=48,fd=3))                                                                    

vikas-goel avatar Sep 08 '25 21:09 vikas-goel

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Oct 09 '25 00:10 github-actions[bot]

I will appreciate an update, @avagin .

vikas-goel avatar Oct 09 '25 17:10 vikas-goel

@vikas-goel Would it be possible to provide the podman commands used to create these containers or Pods? This will help us to replicate the error locally and investigate the issue.

rst0git avatar Oct 10 '25 10:10 rst0git

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Nov 10 '25 00:11 github-actions[bot]