criu icon indicating copy to clipboard operation
criu copied to clipboard

page-xfer error during TestUsernsCheckpoint in runc CI

Open kolyshkin opened this issue 11 months ago • 10 comments
trafficstars

Description

While testing https://github.com/checkpoint-restore/criu/pull/2545 in runc CI (https://github.com/opencontainers/runc/pull/4559), I saw this error three times for the last 36 or so hours. I don't think this is something I've seen before.

The error is always happens on AlmaLinux-8 in cirrus-ci. Here's a copy-paste of all three occurences.

  1. From https://cirrus-ci.com/task/6207980225429504
=== RUN   TestUsernsCheckpoint/0
=== RUN   TestUsernsCheckpoint/1
time="2024-12-17T06:57:27Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint12863306293/003/criu/dump.log\""
time="2024-12-17T06:57:27Z" level=warning msg="841:(00.111681) page-xfer: Transferring pages:"
time="2024-12-17T06:57:27Z" level=warning msg="842:(00.111682) page-xfer: \tbuf 1/1"
time="2024-12-17T06:57:27Z" level=warning msg="843:(00.111684) page-xfer: \tp 0x7fff903b8000 [1]"
time="2024-12-17T06:57:27Z" level=warning msg="844:(00.111689) page-xfer: \th 0x7fff903b9000 [1]"
time="2024-12-17T06:57:27Z" level=warning msg="845:(00.111691) page-xfer: Checking 0x7fff903b9000/4096 hole"
time="2024-12-17T06:57:27Z" level=warning msg="846:(00.111693) Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap"
time="2024-12-17T06:57:27Z" level=warning msg="847:(00.111697) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7fff903b9000/4096 not found in parent"
time="2024-12-17T06:57:27Z" level=warning msg="848:(00.111716) page-pipe: Killing page pipe"
time="2024-12-17T06:57:27Z" level=warning msg="849:(00.111760) ----------------------------------------"
time="2024-12-17T06:57:27Z" level=warning msg="850:(00.111764) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-17T06:57:27Z" level=warning msg=...
time="2024-12-17T06:57:27Z" level=warning msg="860:(00.112043) net: Unlock network"
time="2024-12-17T06:57:27Z" level=warning msg="861:(00.112046) Running network-unlock scripts"
time="2024-12-17T06:57:27Z" level=warning msg="862:(00.112048) \tRPC"
time="2024-12-17T06:57:27Z" level=warning msg="863:(00.133784) Unfreezing tasks into 1"
time="2024-12-17T06:57:27Z" level=warning msg="864:(00.133799) \tUnseizing 97673 into 1"
time="2024-12-17T06:57:27Z" level=warning msg="865:(00.133822) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-17T06:57:27Z" level=warning msg=---
    checkpoint_test.go:118: criu failed: type DUMP errno 0
=== RUN   TestUsernsCheckpoint/2
=== RUN   TestUsernsCheckpoint/3
  1. From https://cirrus-ci.com/task/6751216950050816:
=== RUN   TestUsernsCheckpoint
=== RUN   TestUsernsCheckpoint/0
time="2024-12-17T07:10:49Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint02954809155/003/criu/dump.log\""
time="2024-12-17T07:10:49Z" level=warning msg="842:(00.186487) page-xfer: Transferring pages:"
time="2024-12-17T07:10:49Z" level=warning msg="843:(00.186489) page-xfer: \tbuf 1/1"
time="2024-12-17T07:10:49Z" level=warning msg="844:(00.186491) page-xfer: \tp 0x7ffde716c000 [1]"
time="2024-12-17T07:10:49Z" level=warning msg="845:(00.186498) page-xfer: \th 0x7ffde716d000 [1]"
time="2024-12-17T07:10:49Z" level=warning msg="846:(00.186499) page-xfer: Checking 0x7ffde716d000/4096 hole"
time="2024-12-17T07:10:49Z" level=warning msg="847:(00.186502) Error (criu/page-xfer.c:299): page-xfer: Missing 7ffde716d000 in parent pagemap"
time="2024-12-17T07:10:49Z" level=warning msg="848:(00.186506) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7ffde716d000/4096 not found in parent"
time="2024-12-17T07:10:49Z" level=warning msg="849:(00.186529) page-pipe: Killing page pipe"
time="2024-12-17T07:10:49Z" level=warning msg="850:(00.186561) ----------------------------------------"
time="2024-12-17T07:10:49Z" level=warning msg="851:(00.186563) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-17T07:10:49Z" level=warning msg=...
time="2024-12-17T07:10:49Z" level=warning msg="861:(00.186977) net: Unlock network"
time="2024-12-17T07:10:49Z" level=warning msg="862:(00.186981) Running network-unlock scripts"
time="2024-12-17T07:10:49Z" level=warning msg="863:(00.186983) \tRPC"
time="2024-12-17T07:10:49Z" level=warning msg="864:(00.204552) Unfreezing tasks into 1"
time="2024-12-17T07:10:49Z" level=warning msg="865:(00.204578) \tUnseizing 95994 into 1"
time="2024-12-17T07:10:49Z" level=warning msg="866:(00.204602) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-17T07:10:49Z" level=warning msg=---
    checkpoint_test.go:118: criu failed: type DUMP errno 0
=== RUN   TestUsernsCheckpoint/1
=== RUN   TestUsernsCheckpoint/2
  1. From https://cirrus-ci.com/task/5627926906929152?logs=unit_tests_1#L20
=== RUN   TestUsernsCheckpoint
time="2024-12-18T04:09:25Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint1601804805/003/criu/dump.log\""
time="2024-12-18T04:09:25Z" level=warning msg="843:(00.143747) page-xfer: Transferring pages:"
time="2024-12-18T04:09:25Z" level=warning msg="844:(00.143748) page-xfer: \tbuf 1/1"
time="2024-12-18T04:09:25Z" level=warning msg="845:(00.143750) page-xfer: \tp 0x7ffcd5eec000 [1]"
time="2024-12-18T04:09:25Z" level=warning msg="846:(00.143756) page-xfer: \th 0x7ffcd5eed000 [1]"
time="2024-12-18T04:09:25Z" level=warning msg="847:(00.143758) page-xfer: Checking 0x7ffcd5eed000/4096 hole"
time="2024-12-18T04:09:25Z" level=warning msg="848:(00.143761) Error (criu/page-xfer.c:299): page-xfer: Missing 7ffcd5eed000 in parent pagemap"
time="2024-12-18T04:09:25Z" level=warning msg="849:(00.143764) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7ffcd5eed000/4096 not found in parent"
time="2024-12-18T04:09:25Z" level=warning msg="850:(00.143793) page-pipe: Killing page pipe"
time="2024-12-18T04:09:25Z" level=warning msg="851:(00.143820) ----------------------------------------"
time="2024-12-18T04:09:25Z" level=warning msg="852:(00.143822) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-18T04:09:25Z" level=warning msg=...
time="2024-12-18T04:09:25Z" level=warning msg="862:(00.144124) net: Unlock network"
time="2024-12-18T04:09:25Z" level=warning msg="863:(00.144129) Running network-unlock scripts"
time="2024-12-18T04:09:25Z" level=warning msg="864:(00.144131) \tRPC"
time="2024-12-18T04:09:25Z" level=warning msg="865:(00.155348) Unfreezing tasks into 1"
time="2024-12-18T04:09:25Z" level=warning msg="866:(00.155361) \tUnseizing 96793 into 1"
time="2024-12-18T04:09:25Z" level=warning msg="867:(00.155389) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-18T04:09:25Z" level=warning msg=---
    checkpoint_test.go:113: criu failed: type DUMP errno 0
--- FAIL: TestUsernsCheckpoint (0.60s)

Steps to reproduce the issue:

I guess you can download runc and run make localunittest or go test -v -exec sudo -run Checkpoint ./libcontainer/integration.

Describe the results you received:

See above.

Describe the results you expected:

No errors.

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

I don't have full logs but let me know if it will be helpful if I amend CI to produce those.

CRIU version is from https://github.com/checkpoint-restore/criu/pull/2545

Additional environment details:

Cirrus-CI, almalinux-8 (see .cirrus.yml in https://github.com/opencontainers/runc/pull/4559)

kolyshkin avatar Dec 18 '24 04:12 kolyshkin

Two more cases:

  1. https://cirrus-ci.com/task/4861362922323968
  2. https://cirrus-ci.com/task/6435038033608704

kolyshkin avatar Dec 18 '24 06:12 kolyshkin

One more, this time it's non-userns test. https://cirrus-ci.com/task/6077670582124544?logs=unit_tests_1#L750

kolyshkin avatar Dec 18 '24 09:12 kolyshkin

https://cirrus-ci.com/task/5833066755653632?logs=unit_tests_kir4#L752

kolyshkin avatar Dec 18 '24 10:12 kolyshkin

Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap

@kolyshkin Would it be possible to confirm that the parent images have not been modified by another test?

rst0git avatar Dec 18 '24 10:12 rst0git

Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap

@kolyshkin Would it be possible to confirm that the parent images have not been modified by another test?

The source code for the test is here: https://github.com/opencontainers/runc/blob/main/libcontainer/integration/checkpoint_test.go

It does not use t.Parallel(), meaning the tests are running one-by-one. Even if we would, each test case run uses its own temp dir.

If you have any ideas of how to confirm this in any other way, let me know.

kolyshkin avatar Dec 18 '24 21:12 kolyshkin

Note that:

  • I can only reproduce this on AlmaLinux 8 in Cirrus-CI environment
  • I can't repro this in Cirrus CI using AlmaLinux 9 (newer kernel etc) in an otherwise similar setup
  • I can't repro this using GHA CI (Ubuntu 20.04)
  • I can't repro this locally using AlmaLinux 8 in a vagrant vm

kolyshkin avatar Dec 18 '24 21:12 kolyshkin

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Jan 18 '25 00:01 github-actions[bot]

Stil seeing this from time to time, and forgot that I already opened this bug. I guess I can try to git-bisect this.

kolyshkin avatar Apr 18 '25 23:04 kolyshkin

I guess I can try to git-bisect this.

No I can't -- tried to repro this locally in an alma linux 8 VM and it never fails. Only in CI.

This means it can be something bad happening in CI only.

kolyshkin avatar Apr 19 '25 20:04 kolyshkin

Re-posting this to a correct issue:

It can be a pagemap_scan issue. Could you compile CRIU without pagemap_scan and try it out? You can change the code here: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L88

another way is to add --fault 135 to a config file.

Originally posted by @avagin in #2650

kolyshkin avatar Apr 29 '25 20:04 kolyshkin

It can be a pagemap_scan issue. Could you compile CRIU without pagemap_scan and try it out? You can change the code here: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L88 another way is to add --fault 135 to a config file.

Originally posted by @avagin in #2650

No, it's not about PAGEMAP_SCAN per se

[kir@lima-alma8 ~]$ sudo criu -vvvv check --all 2>&1 | grep PAGEMAP_SCAN
(00.000677) PAGEMAP_SCAN isn't supported
[kir@lima-alma8 ~]$ uname -a
Linux lima-alma8 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 8 07:11:46 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
[kir@lima-alma8 ~]$ criu --version
Version: 4.1

but it might be related to some patches added around that time (stock criu-3.18-5.module_el8.10.0+3926+f12484f5 does not have this error).

As it only happens in GHA CI it will take a few days (or weeks?) to bisect it (see e.g. https://github.com/opencontainers/runc/pull/4559).

Alternatively, we can disable these tests in EL8 since it's an older environment and I guess no one cares.

kolyshkin avatar Oct 20 '25 21:10 kolyshkin

Played with claude a bit, and it suggested that commit cb64d73ad has a bug. When I wanted a fix, it found out that the very bug it found is already fixed in commit 53d3e0c7d by @mihalicyn.

I've checked the repo that we're using (adrian/criu-el8) and the build date (2025-04-09) suggests that the patch (written in May, merged in August) is not included.

@adrianreber can I ask you to build newer criu packages for el8 (in https://copr.fedorainfracloud.org/coprs/adrian/criu-el8/), with commit 53d3e0c7d included?

@avagin you might want to backport the above commit to 4.1.2.


Here's a copy-paste from code:

Great news! The fix has already been implemented in commit 53d3e0c7d ("criu/mem: refactor should_dump_page helper") by Alexander Mikhalitsyn.

The Bug in cb64d73ad

The bug I identified in commit cb64d73ad was exactly what's described in the fix commit:

Problem: should_dump_page() could return -1 (as a u64) when pmc_fill() failed, but the callers didn't expect this. They treated the return value as either:

  • vaddr (page should be dumped)
  • vaddr + PAGE_SIZE (skip to next page)
  • Or other skip-ahead addresses

When -1 was returned (which is 0xFFFFFFFFFFFFFFFF as u64):

  1. The loop would do: vaddr = 0xFFFFFFFFFFFFFFFF - PAGE_SIZE
  2. Then increment: vaddr += PAGE_SIZE = 0xFFFFFFFFFFFFFFFF
  3. The loop condition vaddr < vma->e->end would fail
  4. Pages at the end of VMAs that needed parent hole entries were silently skipped
  5. On restore, page-xfer would fail with "Missing X in parent pagemap"

The Fix (commit 53d3e0c7d)

The fix changed should_dump_page() to:

  1. Return int (0 for success, -1 for error) instead of overloading the return value
  2. Pass results through a struct page_info structure with: - next: the next address to process - softdirty: whether the page is soft-dirty
  3. Update all call sites to check the return value: if (should_dump_page(pmc, vma->e, vaddr, &page_info)) return -1; // Proper error handling

This ensures that errors from pmc_fill() are properly propagated and don't cause silent data corruption.

kolyshkin avatar Oct 20 '25 22:10 kolyshkin