criu
criu copied to clipboard
page-xfer error during TestUsernsCheckpoint in runc CI
Description
While testing https://github.com/checkpoint-restore/criu/pull/2545 in runc CI (https://github.com/opencontainers/runc/pull/4559), I saw this error three times for the last 36 or so hours. I don't think this is something I've seen before.
The error is always happens on AlmaLinux-8 in cirrus-ci. Here's a copy-paste of all three occurences.
- From https://cirrus-ci.com/task/6207980225429504
=== RUN TestUsernsCheckpoint/0
=== RUN TestUsernsCheckpoint/1
time="2024-12-17T06:57:27Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint12863306293/003/criu/dump.log\""
time="2024-12-17T06:57:27Z" level=warning msg="841:(00.111681) page-xfer: Transferring pages:"
time="2024-12-17T06:57:27Z" level=warning msg="842:(00.111682) page-xfer: \tbuf 1/1"
time="2024-12-17T06:57:27Z" level=warning msg="843:(00.111684) page-xfer: \tp 0x7fff903b8000 [1]"
time="2024-12-17T06:57:27Z" level=warning msg="844:(00.111689) page-xfer: \th 0x7fff903b9000 [1]"
time="2024-12-17T06:57:27Z" level=warning msg="845:(00.111691) page-xfer: Checking 0x7fff903b9000/4096 hole"
time="2024-12-17T06:57:27Z" level=warning msg="846:(00.111693) Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap"
time="2024-12-17T06:57:27Z" level=warning msg="847:(00.111697) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7fff903b9000/4096 not found in parent"
time="2024-12-17T06:57:27Z" level=warning msg="848:(00.111716) page-pipe: Killing page pipe"
time="2024-12-17T06:57:27Z" level=warning msg="849:(00.111760) ----------------------------------------"
time="2024-12-17T06:57:27Z" level=warning msg="850:(00.111764) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-17T06:57:27Z" level=warning msg=...
time="2024-12-17T06:57:27Z" level=warning msg="860:(00.112043) net: Unlock network"
time="2024-12-17T06:57:27Z" level=warning msg="861:(00.112046) Running network-unlock scripts"
time="2024-12-17T06:57:27Z" level=warning msg="862:(00.112048) \tRPC"
time="2024-12-17T06:57:27Z" level=warning msg="863:(00.133784) Unfreezing tasks into 1"
time="2024-12-17T06:57:27Z" level=warning msg="864:(00.133799) \tUnseizing 97673 into 1"
time="2024-12-17T06:57:27Z" level=warning msg="865:(00.133822) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-17T06:57:27Z" level=warning msg=---
checkpoint_test.go:118: criu failed: type DUMP errno 0
=== RUN TestUsernsCheckpoint/2
=== RUN TestUsernsCheckpoint/3
- From https://cirrus-ci.com/task/6751216950050816:
=== RUN TestUsernsCheckpoint
=== RUN TestUsernsCheckpoint/0
time="2024-12-17T07:10:49Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint02954809155/003/criu/dump.log\""
time="2024-12-17T07:10:49Z" level=warning msg="842:(00.186487) page-xfer: Transferring pages:"
time="2024-12-17T07:10:49Z" level=warning msg="843:(00.186489) page-xfer: \tbuf 1/1"
time="2024-12-17T07:10:49Z" level=warning msg="844:(00.186491) page-xfer: \tp 0x7ffde716c000 [1]"
time="2024-12-17T07:10:49Z" level=warning msg="845:(00.186498) page-xfer: \th 0x7ffde716d000 [1]"
time="2024-12-17T07:10:49Z" level=warning msg="846:(00.186499) page-xfer: Checking 0x7ffde716d000/4096 hole"
time="2024-12-17T07:10:49Z" level=warning msg="847:(00.186502) Error (criu/page-xfer.c:299): page-xfer: Missing 7ffde716d000 in parent pagemap"
time="2024-12-17T07:10:49Z" level=warning msg="848:(00.186506) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7ffde716d000/4096 not found in parent"
time="2024-12-17T07:10:49Z" level=warning msg="849:(00.186529) page-pipe: Killing page pipe"
time="2024-12-17T07:10:49Z" level=warning msg="850:(00.186561) ----------------------------------------"
time="2024-12-17T07:10:49Z" level=warning msg="851:(00.186563) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-17T07:10:49Z" level=warning msg=...
time="2024-12-17T07:10:49Z" level=warning msg="861:(00.186977) net: Unlock network"
time="2024-12-17T07:10:49Z" level=warning msg="862:(00.186981) Running network-unlock scripts"
time="2024-12-17T07:10:49Z" level=warning msg="863:(00.186983) \tRPC"
time="2024-12-17T07:10:49Z" level=warning msg="864:(00.204552) Unfreezing tasks into 1"
time="2024-12-17T07:10:49Z" level=warning msg="865:(00.204578) \tUnseizing 95994 into 1"
time="2024-12-17T07:10:49Z" level=warning msg="866:(00.204602) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-17T07:10:49Z" level=warning msg=---
checkpoint_test.go:118: criu failed: type DUMP errno 0
=== RUN TestUsernsCheckpoint/1
=== RUN TestUsernsCheckpoint/2
- From https://cirrus-ci.com/task/5627926906929152?logs=unit_tests_1#L20
=== RUN TestUsernsCheckpoint
time="2024-12-18T04:09:25Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint1601804805/003/criu/dump.log\""
time="2024-12-18T04:09:25Z" level=warning msg="843:(00.143747) page-xfer: Transferring pages:"
time="2024-12-18T04:09:25Z" level=warning msg="844:(00.143748) page-xfer: \tbuf 1/1"
time="2024-12-18T04:09:25Z" level=warning msg="845:(00.143750) page-xfer: \tp 0x7ffcd5eec000 [1]"
time="2024-12-18T04:09:25Z" level=warning msg="846:(00.143756) page-xfer: \th 0x7ffcd5eed000 [1]"
time="2024-12-18T04:09:25Z" level=warning msg="847:(00.143758) page-xfer: Checking 0x7ffcd5eed000/4096 hole"
time="2024-12-18T04:09:25Z" level=warning msg="848:(00.143761) Error (criu/page-xfer.c:299): page-xfer: Missing 7ffcd5eed000 in parent pagemap"
time="2024-12-18T04:09:25Z" level=warning msg="849:(00.143764) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7ffcd5eed000/4096 not found in parent"
time="2024-12-18T04:09:25Z" level=warning msg="850:(00.143793) page-pipe: Killing page pipe"
time="2024-12-18T04:09:25Z" level=warning msg="851:(00.143820) ----------------------------------------"
time="2024-12-18T04:09:25Z" level=warning msg="852:(00.143822) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-18T04:09:25Z" level=warning msg=...
time="2024-12-18T04:09:25Z" level=warning msg="862:(00.144124) net: Unlock network"
time="2024-12-18T04:09:25Z" level=warning msg="863:(00.144129) Running network-unlock scripts"
time="2024-12-18T04:09:25Z" level=warning msg="864:(00.144131) \tRPC"
time="2024-12-18T04:09:25Z" level=warning msg="865:(00.155348) Unfreezing tasks into 1"
time="2024-12-18T04:09:25Z" level=warning msg="866:(00.155361) \tUnseizing 96793 into 1"
time="2024-12-18T04:09:25Z" level=warning msg="867:(00.155389) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-18T04:09:25Z" level=warning msg=---
checkpoint_test.go:113: criu failed: type DUMP errno 0
--- FAIL: TestUsernsCheckpoint (0.60s)
Steps to reproduce the issue:
I guess you can download runc and run make localunittest or go test -v -exec sudo -run Checkpoint ./libcontainer/integration.
Describe the results you received:
See above.
Describe the results you expected:
No errors.
Additional information you deem important (e.g. issue happens only occasionally):
CRIU logs and information:
I don't have full logs but let me know if it will be helpful if I amend CI to produce those.
CRIU version is from https://github.com/checkpoint-restore/criu/pull/2545
Additional environment details:
Cirrus-CI, almalinux-8 (see .cirrus.yml in https://github.com/opencontainers/runc/pull/4559)
Two more cases:
- https://cirrus-ci.com/task/4861362922323968
- https://cirrus-ci.com/task/6435038033608704
One more, this time it's non-userns test. https://cirrus-ci.com/task/6077670582124544?logs=unit_tests_1#L750
https://cirrus-ci.com/task/5833066755653632?logs=unit_tests_kir4#L752
Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap
@kolyshkin Would it be possible to confirm that the parent images have not been modified by another test?
Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap
@kolyshkin Would it be possible to confirm that the parent images have not been modified by another test?
The source code for the test is here: https://github.com/opencontainers/runc/blob/main/libcontainer/integration/checkpoint_test.go
It does not use t.Parallel(), meaning the tests are running one-by-one. Even if we would, each test case run uses its own temp dir.
If you have any ideas of how to confirm this in any other way, let me know.
Note that:
- I can only reproduce this on AlmaLinux 8 in Cirrus-CI environment
- I can't repro this in Cirrus CI using AlmaLinux 9 (newer kernel etc) in an otherwise similar setup
- I can't repro this using GHA CI (Ubuntu 20.04)
- I can't repro this locally using AlmaLinux 8 in a vagrant vm
A friendly reminder that this issue had no activity for 30 days.
Stil seeing this from time to time, and forgot that I already opened this bug. I guess I can try to git-bisect this.
I guess I can try to git-bisect this.
No I can't -- tried to repro this locally in an alma linux 8 VM and it never fails. Only in CI.
This means it can be something bad happening in CI only.
Re-posting this to a correct issue:
It can be a pagemap_scan issue. Could you compile CRIU without pagemap_scan and try it out? You can change the code here: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L88
another way is to add
--fault 135to a config file.
Originally posted by @avagin in #2650
It can be a pagemap_scan issue. Could you compile CRIU without pagemap_scan and try it out? You can change the code here: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L88 another way is to add
--fault 135to a config file.
No, it's not about PAGEMAP_SCAN per se
[kir@lima-alma8 ~]$ sudo criu -vvvv check --all 2>&1 | grep PAGEMAP_SCAN
(00.000677) PAGEMAP_SCAN isn't supported
[kir@lima-alma8 ~]$ uname -a
Linux lima-alma8 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 8 07:11:46 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
[kir@lima-alma8 ~]$ criu --version
Version: 4.1
but it might be related to some patches added around that time (stock criu-3.18-5.module_el8.10.0+3926+f12484f5 does not have this error).
As it only happens in GHA CI it will take a few days (or weeks?) to bisect it (see e.g. https://github.com/opencontainers/runc/pull/4559).
Alternatively, we can disable these tests in EL8 since it's an older environment and I guess no one cares.
Played with claude a bit, and it suggested that commit cb64d73ad has a bug. When I wanted a fix, it found out that the very bug it found is already fixed in commit 53d3e0c7d by @mihalicyn.
I've checked the repo that we're using (adrian/criu-el8) and the build date (2025-04-09) suggests that the patch (written in May, merged in August) is not included.
@adrianreber can I ask you to build newer criu packages for el8 (in https://copr.fedorainfracloud.org/coprs/adrian/criu-el8/), with commit 53d3e0c7d included?
@avagin you might want to backport the above commit to 4.1.2.
Here's a copy-paste from code:
Great news! The fix has already been implemented in commit 53d3e0c7d ("criu/mem: refactor should_dump_page helper") by Alexander Mikhalitsyn.
The Bug in cb64d73ad
The bug I identified in commit cb64d73ad was exactly what's described in the fix commit:
Problem: should_dump_page() could return -1 (as a u64) when pmc_fill() failed, but the callers didn't expect this. They treated the return value as either:
- vaddr (page should be dumped)
- vaddr + PAGE_SIZE (skip to next page)
- Or other skip-ahead addresses
When -1 was returned (which is 0xFFFFFFFFFFFFFFFF as u64):
- The loop would do: vaddr = 0xFFFFFFFFFFFFFFFF - PAGE_SIZE
- Then increment: vaddr += PAGE_SIZE = 0xFFFFFFFFFFFFFFFF
- The loop condition vaddr < vma->e->end would fail
- Pages at the end of VMAs that needed parent hole entries were silently skipped
- On restore, page-xfer would fail with "Missing X in parent pagemap"
The Fix (commit 53d3e0c7d)
The fix changed should_dump_page() to:
- Return int (0 for success, -1 for error) instead of overloading the return value
- Pass results through a struct page_info structure with: - next: the next address to process - softdirty: whether the page is soft-dirty
- Update all call sites to check the return value: if (should_dump_page(pmc, vma->e, vaddr, &page_info)) return -1; // Proper error handling
This ensures that errors from pmc_fill() are properly propagated and don't cause silent data corruption.