criu icon indicating copy to clipboard operation
criu copied to clipboard

Restoring PACG on older ARM64 CPU hangs

Open hanwen-flow opened this issue 1 month ago • 10 comments

I see occasional failures restoring a set of processes in podman.

The symptom is a timeout. Some debugging shows that it is hanging here:

(gdb) bt
#0  0x0000ffff95607be4 in syscall () from /lib/aarch64-linux-gnu/libc.so.6
#1  0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0, 
    addr1=0xffff9595b00c) at include/common/lock.h:29
#2  __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
#3  0x0000aaaae8a21078 in restore_wait_inprogress_tasks () at criu/cr-restore.c:194
#4  restore_switch_stage (next_stage=5) at criu/cr-restore.c:224
#5  restore_root_task (init=<optimized out>) at criu/cr-restore.c:2213
#6  0x0000aaaae8a220fc in cr_restore_tasks () at criu/cr-restore.c:2417
#7  0x0000aaaae8a27554 in restore_using_req (req=<optimized out>, sk=3) at criu/cr-service.c:889
#8  cr_service_work (sk=3) at criu/cr-service.c:1365
#9  0x0000aaaae89f5f3c in main (argc=3, argv=0xffffc4786758, envp=<optimized out>) at criu/crtools.c:191
(gdb) up
#1  0x0000aaaae8a1fc64 in sys_futex (addr2=0x0, val3=0, timeout=0xffffc4786258, val1=<optimized out>, op=0, 
    addr1=0xffff9595b00c) at include/common/lock.h:29
29	include/common/lock.h: No such file or directory.
(gdb) 
#2  __restore_wait_inprogress_tasks (participants=participants@entry=0) at criu/cr-restore.c:182
182	criu/cr-restore.c: No such file or directory.
(gdb) p task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0
(gdb) p &task_entries->nr_in_progress
Cannot access memory at address 0xaaaae8b5d1b0

the last lines in the restore.log are

(05.342893) pie: 134: restoring lsm profile (current) changeprofile containers-default-engflow
(05.343043) pie: 132: seccomp: Restored mode 2 on tid 132
(05.343086) pie: 132: restoring lsm profile (current) changeprofile containers-default-engflow

(I changed the profile name from its default.)

This happens occasionally on AWS ARM64 machines. We're running a set of machine types, the machine that has the above hang was a c6gd.2xlarge, cpuinfo

processor	: 0
BogoMIPS	: 243.75
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x3
CPU part	: 0xd0c
CPU revision	: 1

the problem is machine specific: the exact same snapshot restores correctly on a different machine, but on the affected machine, the hang reproduces.

I am using locally modified version of

commit c61329b30387aa50634e794a4781dde64cb2a6c3
Author: Radostin Stoyanov <[email protected]>
Date:   Sun May 11 11:33:29 2025 +0100

    seize: fix pause devices for frozen containers

(the mod is a minor tweak to symlink the lazy pages socket and is unaffected). The same version has been working reliably on x64.

hanwen-flow avatar Nov 14 '25 11:11 hanwen-flow

It seems to relate to machine type; The machine type that seems to work is c7gd. They are "AWS Graviton2" (broken) and "AWS Graviton3" (working).

hanwen-flow avatar Nov 14 '25 11:11 hanwen-flow

We have another similar issue: https://github.com/checkpoint-restore/criu/issues/2720

@hanwen-flow could you attach the full log? If you see restore_wait_inprogress_tasks in the backtrace, it means one of restored tasks hansn't complete the restore process. Could you try to look at child processes?

avagin avatar Nov 14 '25 19:11 avagin

We have another similar issue: #2720

Actually, issue #2720 is not like this one. However, since commit c61329b30387aa50634e794a4781dde64cb2a6c3, there have been a few ARM fixes that might be related to this issue: 64276874d89825452baee6c756046e1277a41c48 restore: flush caches during restore 95d5e2e59b1b83ba5400e7eea6db57f77424fb80 compel: flush caches after parasite injection dcee5bd6ff2d632bd4e1d4d09d2ffb2bf683d6a2 make: Disable branch-protection for PIE code on ARM64

avagin avatar Nov 15 '25 16:11 avagin

a full log from a similar hang is attached here:

restore (1).log

hanwen-flow avatar Nov 15 '25 22:11 hanwen-flow

I looked at the changes, but they looked like they had different symptoms. But yes, I can upgrade and see if it helps.

Could you try to look at child processes?

What should I be looking for?

hanwen-flow avatar Nov 15 '25 22:11 hanwen-flow

(05.176328) Error (criu/arch/aarch64/crtools.c:285): PACG support is required from the source system.

The issue is that a dumped process utilized Pointer Authentication Code (PAC) CPU extension (specifically PACG) that were enabled on the source platform. The target platform lacks support for this extension. This should be a fatal error, but CRIU did not abort the restore process.

avagin avatar Nov 16 '25 00:11 avagin

interesting, so I guess we created the snapshot on graviton 3 and restoring on graviton 2 failed. That makes sense, because we saw other cases where graviton 2 worked (that must've been snapshots created on the same platform.)

Can similar problems occur on x86?

The binaries involved are identical across platforms, so this is a runtime decision. Do you know of a generic way to restrict features that affect CRIU operation?

hanwen-flow avatar Nov 16 '25 07:11 hanwen-flow

Can similar problems occur on x86?

@hanwen-flow PAC is an AArch64 architecture feature. The error "PACG support is required" was introduced with https://github.com/checkpoint-restore/criu/pull/2609 and indicates that PAC was used during checkpointing.

The binaries involved are identical across platforms, so this is a runtime decision. Do you know of a generic way to restrict features that affect CRIU operation?

There are some compiler options that can be used to disable branch protection: https://developer.arm.com/documentation/109576/0100/Tools-and-software-support/Compiler-options

rst0git avatar Nov 17 '25 09:11 rst0git

Can similar problems occur on x86?

I would say yes. If you have code that runtime detects certain features and uses instructions that the destination CPU does not have. You cannot really migrate to an older CPU on any architecture if some features the code uses are missing. The same problem kind of also exists with VMs. If you limit your VM to not use all features of the host CPU it can be migrated to older CPUs. Not sure how disable newer CPU features in a process. There might be some setting, depending, on the application to not use all of the latest CPU features.

adrianreber avatar Nov 17 '25 12:11 adrianreber

Can similar problems occur on x86? The binaries involved are identical across platforms, so this is a runtime decision. Do you know of a generic way to restrict features that affect CRIU operation?

@hanwen-flow Yes, this can happen on x86, but it has become less critical in the last few years because no new features of this type have been added. We are very close to the moment when the shadow stack will be enabled by default, and this question will be raised again on x86 as well. As for solutions, we've discussed this problem many times, and some of our users have out-of-tree solutions. OpenVZ had custom changes in their kernel. Google solved this problem in their libraries. However, no one has yet suggested a valuable upstream solution. We always considered filtering CPUID and adjusting all related kernel mechanisms. That approach looks too intrusive. Yesterday, I started thinking that we can introduce the ability to mask some features from AT_HWCAP vectors. This is a much simpler feature and should work for most users. Here is my draft implementation: https://github.com/avagin/linux-task-diag/commit/ca32ef4c5edee82f4f06f98d6760d1a58c0af345

avagin avatar Nov 18 '25 21:11 avagin

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Dec 19 '25 00:12 github-actions[bot]