runc icon indicating copy to clipboard operation
runc copied to clipboard

flaky test: TestUsernsCheckpoint

Open lifubang opened this issue 1 year ago • 6 comments

I saw this happend many times in centos7.

=== RUN   TestUsernsCheckpoint
time="2024-05-07T10:08:51Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint611938415/003/criu-parent/dump.log\""
time="2024-05-07T10:08:51Z" level=warning msg="116:(09.514467) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="117:(09.614644) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="118:(09.714816) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="119:(09.814957) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="120:(09.915110) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="121:(10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0"
time="2024-05-07T10:08:51Z" level=warning msg="122:(10.000563) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="123:(10.000694) Error (compel/src/lib/infect.c:234): Unseizable non-zombie 9017 found, state D, err -1/10"
time="2024-05-07T10:08:51Z" level=warning msg="124:(10.000773) Unfreezing tasks into 1"
time="2024-05-07T10:08:51Z" level=warning msg="125:(10.000778) \tUnseizing 9017 into 1"
time="2024-05-07T10:08:51Z" level=warning msg="126:(10.000783) Error (compel/src/lib/infect.c:355): Unable to detach from 9017: No such process"
time="2024-05-07T10:08:51Z" level=warning msg="127:(10.000800) Writing image inventory (version 1)"
time="2024-05-07T10:08:51Z" level=warning msg="128:(10.000976) Error (criu/cr-dump.c:1581): Pre-dumping FAILED."
time="2024-05-07T10:08:51Z" level=warning msg=---
    checkpoint_test.go:115: === /tmp/TestUsernsCheckpoint611938415/003/criu-parent/dump.log ===
    checkpoint_test.go:115: (00.000052) Version: 3.16 (gitid 0)
    checkpoint_test.go:115: (00.000067) Running on cirrus-task-5639495050067968 Linux 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64
    checkpoint_test.go:115: (00.000070) Would overwrite RPC settings with values from /etc/criu/runc.conf
    checkpoint_test.go:115: (00.000094) Loaded kdat cache from /run/criu/criu.kdat
    checkpoint_test.go:115: (00.000142) rlimit: RLIMIT_NOFILE unlimited for self
    checkpoint_test.go:115: (00.000148) Enforcing memory tracking for pre-dump.
    checkpoint_test.go:115: (00.000156) Enforcing tasks run after pre-dump.
    checkpoint_test.go:115: (00.000170) irmap: Searching irmap cache in work dir
    checkpoint_test.go:115: (00.000180) No irmap-cache image
    checkpoint_test.go:115: (00.000181) irmap: Searching irmap cache in parent
    checkpoint_test.go:115: (00.000185) No parent images directory provided
    checkpoint_test.go:115: (00.000187) irmap: No irmap cache
    checkpoint_test.go:115: (00.000205) cpu: x86_family 25 x86_vendor_id AuthenticAMD x86_model_id AMD EPYC 7B13
    checkpoint_test.go:115: (00.000210) cpu: fpu: xfeatures_mask 0x5 xsave_size 832 xsave_size_max 2440 xsaves_size 832
    checkpoint_test.go:115: (00.000213) cpu: fpu: x87 floating point registers     xstate_offsets      0 / 0      xstate_sizes    160 / 160   
    checkpoint_test.go:115: (00.000215) cpu: fpu: AVX registers                    xstate_offsets    576 / 576    xstate_sizes    256 / 256   
    checkpoint_test.go:115: (00.000217) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:0
    checkpoint_test.go:115: (00.000338) Detected cgroup V1 freezer
    checkpoint_test.go:115: (00.000340) freezing processes: 100000 attempts with 100 ms steps
    checkpoint_test.go:115: (00.000351) freezer.state=THAWED
    checkpoint_test.go:115: (00.000358) freezer.state=FREEZING
    checkpoint_test.go:115: (00.100446) freezer.state=FREEZING
    checkpoint_test.go:115: (00.201766) freezer.state=FREEZING
    checkpoint_test.go:115: (00.301871) freezer.state=FREEZING
    checkpoint_test.go:115: (00.401990) freezer.state=FREEZING
    checkpoint_test.go:115: (00.502110) freezer.state=FREEZING
    checkpoint_test.go:115: (00.602214) freezer.state=FREEZING
    checkpoint_test.go:115: (00.702313) freezer.state=FREEZING
    checkpoint_test.go:115: (00.802425) freezer.state=FREEZING
    checkpoint_test.go:115: (00.902531) freezer.state=FREEZING
    checkpoint_test.go:115: (01.002635) freezer.state=FREEZING
    checkpoint_test.go:115: (01.102755) freezer.state=FREEZING
    checkpoint_test.go:115: (01.202870) freezer.state=FREEZING
    checkpoint_test.go:115: (01.303058) freezer.state=FREEZING
    checkpoint_test.go:115: (01.403208) freezer.state=FREEZING
    checkpoint_test.go:115: (01.503308) freezer.state=FREEZING
    checkpoint_test.go:115: (01.603429) freezer.state=FREEZING
    checkpoint_test.go:115: (01.703589) freezer.state=FREEZING
    checkpoint_test.go:115: (01.803726) freezer.state=FREEZING
    checkpoint_test.go:115: (01.903872) freezer.state=FREEZING
    checkpoint_test.go:115: (02.004022) freezer.state=FREEZING
    checkpoint_test.go:115: (02.104139) freezer.state=FREEZING
    checkpoint_test.go:115: (02.204270) freezer.state=FREEZING
    checkpoint_test.go:115: (02.304422) freezer.state=FREEZING
    checkpoint_test.go:115: (02.404578) freezer.state=FREEZING
    checkpoint_test.go:115: (02.504717) freezer.state=FREEZING
    checkpoint_test.go:115: (02.604860) freezer.state=FREEZING
    checkpoint_test.go:115: (02.704987) freezer.state=FREEZING
    checkpoint_test.go:115: (02.805144) freezer.state=FREEZING
    checkpoint_test.go:115: (02.905275) freezer.state=FREEZING
    checkpoint_test.go:115: (03.005410) freezer.state=FREEZING
    checkpoint_test.go:115: (03.105546) freezer.state=FREEZING
    checkpoint_test.go:115: (03.205676) freezer.state=FREEZING
    checkpoint_test.go:115: (03.305821) freezer.state=FREEZING
    checkpoint_test.go:115: (03.405941) freezer.state=FREEZING
    checkpoint_test.go:115: (03.506057) freezer.state=FREEZING
    checkpoint_test.go:115: (03.606181) freezer.state=FREEZING
    checkpoint_test.go:115: (03.706322) freezer.state=FREEZING
    checkpoint_test.go:115: (03.806446) freezer.state=FREEZING
    checkpoint_test.go:115: (03.906569) freezer.state=FREEZING
    checkpoint_test.go:115: (04.006738) freezer.state=FREEZING
    checkpoint_test.go:115: (04.106903) freezer.state=FREEZING
    checkpoint_test.go:115: (04.207032) freezer.state=FREEZING
    checkpoint_test.go:115: (04.307154) freezer.state=FREEZING
    checkpoint_test.go:115: (04.407273) freezer.state=FREEZING
    checkpoint_test.go:115: (04.507399) freezer.state=FREEZING
    checkpoint_test.go:115: (04.607502) freezer.state=FREEZING
    checkpoint_test.go:115: (04.707592) freezer.state=FREEZING
    checkpoint_test.go:115: (04.807698) freezer.state=FREEZING
    checkpoint_test.go:115: (04.907829) freezer.state=FREEZING
    checkpoint_test.go:115: (05.007957) freezer.state=FREEZING
    checkpoint_test.go:115: (05.108092) freezer.state=FREEZING
    checkpoint_test.go:115: (05.208199) freezer.state=FREEZING
    checkpoint_test.go:115: (05.308309) freezer.state=FREEZING
    checkpoint_test.go:115: (05.408418) freezer.state=FREEZING
    checkpoint_test.go:115: (05.508566) freezer.state=FREEZING
    checkpoint_test.go:115: (05.608724) freezer.state=FREEZING
    checkpoint_test.go:115: (05.708885) freezer.state=FREEZING
    checkpoint_test.go:115: (05.809035) freezer.state=FREEZING
    checkpoint_test.go:115: (05.909159) freezer.state=FREEZING
    checkpoint_test.go:115: (06.009283) freezer.state=FREEZING
    checkpoint_test.go:115: (06.109410) freezer.state=FREEZING
    checkpoint_test.go:115: (06.209537) freezer.state=FREEZING
    checkpoint_test.go:115: (06.309662) freezer.state=FREEZING
    checkpoint_test.go:115: (06.409787) freezer.state=FREEZING
    checkpoint_test.go:115: (06.509905) freezer.state=FREEZING
    checkpoint_test.go:115: (06.610031) freezer.state=FREEZING
    checkpoint_test.go:115: (06.710165) freezer.state=FREEZING
    checkpoint_test.go:115: (06.810288) freezer.state=FREEZING
    checkpoint_test.go:115: (06.910416) freezer.state=FREEZING
    checkpoint_test.go:115: (07.010552) freezer.state=FREEZING
    checkpoint_test.go:115: (07.110678) freezer.state=FREEZING
    checkpoint_test.go:115: (07.210806) freezer.state=FREEZING
    checkpoint_test.go:115: (07.310933) freezer.state=FREEZING
    checkpoint_test.go:115: (07.411069) freezer.state=FREEZING
    checkpoint_test.go:115: (07.511252) freezer.state=FREEZING
    checkpoint_test.go:115: (07.611415) freezer.state=FREEZING
    checkpoint_test.go:115: (07.711588) freezer.state=FREEZING
    checkpoint_test.go:115: (07.811742) freezer.state=FREEZING
    checkpoint_test.go:115: (07.911897) freezer.state=FREEZING
    checkpoint_test.go:115: (08.012029) freezer.state=FREEZING
    checkpoint_test.go:115: (08.112217) freezer.state=FREEZING
    checkpoint_test.go:115: (08.212392) freezer.state=FREEZING
    checkpoint_test.go:115: (08.312553) freezer.state=FREEZING
    checkpoint_test.go:115: (08.412734) freezer.state=FREEZING
    checkpoint_test.go:115: (08.512909) freezer.state=FREEZING
    checkpoint_test.go:115: (08.613067) freezer.state=FREEZING
    checkpoint_test.go:115: (08.713220) freezer.state=FREEZING
    checkpoint_test.go:115: (08.813373) freezer.state=FREEZING
    checkpoint_test.go:115: (08.913548) freezer.state=FREEZING
    checkpoint_test.go:115: (09.013704) freezer.state=FREEZING
    checkpoint_test.go:115: (09.113850) freezer.state=FREEZING
    checkpoint_test.go:115: (09.213999) freezer.state=FREEZING
    checkpoint_test.go:115: (09.314151) freezer.state=FREEZING
    checkpoint_test.go:115: (09.414305) freezer.state=FREEZING
    checkpoint_test.go:115: (09.514467) freezer.state=FREEZING
    checkpoint_test.go:115: (09.614644) freezer.state=FREEZING
    checkpoint_test.go:115: (09.714816) freezer.state=FREEZING
    checkpoint_test.go:115: (09.814957) freezer.state=FREEZING
    checkpoint_test.go:115: (09.915110) freezer.state=FREEZING
    checkpoint_test.go:115: (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
    checkpoint_test.go:115: (10.000563) freezer.state=FREEZING
    checkpoint_test.go:115: (10.000694) Error (compel/src/lib/infect.c:234): Unseizable non-zombie 9017 found, state D, err -1/10
    checkpoint_test.go:115: (10.000773) Unfreezing tasks into 1
    checkpoint_test.go:115: (10.000778) 	Unseizing 9017 into 1
    checkpoint_test.go:115: (10.000783) Error (compel/src/lib/infect.c:355): Unable to detach from 9017: No such process
    checkpoint_test.go:115: (10.000800) Writing image inventory (version 1)
    checkpoint_test.go:115: (10.000976) Error (criu/cr-dump.c:1581): Pre-dumping FAILED.
    checkpoint_test.go:115: === END ===
    checkpoint_test.go:119: criu failed: type PRE_DUMP errno 0
--- FAIL: TestUsernsCheckpoint (10.31s)

lifubang avatar May 07 '24 10:05 lifubang

I've seen this a few times, too.

@lifubang this means that the kernel can't freeze the cgroup despite the repeated attempts, so criu gives up.

Alas, this might be a kernel issue, and the CentOS 7 kernel is too old. In general, cgroup freezer is not very reliable, I previously had to implement some hacks in runc to work around it (see #2941 and the earlier PRs linked from there).

We can either try to add similar kludges to https://github.com/checkpoint-restore/criu, or skip these tests on CentOS 7.

kolyshkin avatar May 23 '24 20:05 kolyshkin

skip these tests on CentOS 7.

I have to rerun the centos 7 tests manually for many times, so let’s skip them in centos 7?

lifubang avatar Jun 01 '24 23:06 lifubang

😢 It appeares in ubuntu now.

https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300

Failure logs === RUN TestCheckpoint checkpoint_test.go:115: === /tmp/TestCheckpoint1478934365/003/criu-parent/dump.log === checkpoint_test.go:115: (00.000021) Version: 3.19 (gitid 5c35d75) checkpoint_test.go:115: (00.000035) Running on fv-az[691](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:692)-944 Linux 5.15.0-1064-azure #73~20.04.1-Ubuntu SMP Mon May 6 09:43:44 UTC 2024 x86_64 checkpoint_test.go:115: (00.000038) Would overwrite RPC settings with values from /etc/criu/runc.conf checkpoint_test.go:115: (00.000061) Loaded kdat cache from /run/criu.kdat checkpoint_test.go:115: (00.000073) Hugetlb size 2 Mb is supported but cannot get dev's number checkpoint_test.go:115: (00.000081) Hugetlb size 1024 Mb is supported but cannot get dev's number checkpoint_test.go:115: (00.000391) rlimit: RLIMIT_NOFILE unlimited for self checkpoint_test.go:115: (00.000401) Enforcing memory tracking for pre-dump. checkpoint_test.go:115: (00.000403) Enforcing tasks run after pre-dump. checkpoint_test.go:115: (00.000428) irmap: Searching irmap cache in work dir checkpoint_test.go:115: (00.000437) No irmap-cache image checkpoint_test.go:115: (00.000440) irmap: Searching irmap cache in parent checkpoint_test.go:115: (00.000444) No parent images directory provided checkpoint_test.go:115: (00.000446) irmap: No irmap cache checkpoint_test.go:115: (00.000469) cpu: x86_family 25 x86_vendor_id AuthenticAMD x86_model_id AMD EPYC 7763 64-Core Processor checkpoint_test.go:115: (00.000476) cpu: fpu: xfeatures_mask 0x5 xsave_size 832 xsave_size_max 832 xsaves_size 832 checkpoint_test.go:115: (00.000487) cpu: fpu: x87 floating point registers xstate_offsets 0 / 0 xstate_sizes 160 / 160 checkpoint_test.go:115: (00.000491) cpu: fpu: AVX registers xstate_offsets 576 / 576 xstate_sizes 256 / 256 checkpoint_test.go:115: (00.000494) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1 checkpoint_test.go:115: (00.000651) Detected cgroup V1 freezer checkpoint_test.go:115: (00.000655) freezing processes: 100000 attempts with 100 ms steps checkpoint_test.go:115: (00.000665) freezer.state=THAWED checkpoint_test.go:115: (00.000674) freezer.state=FREEZING checkpoint_test.go:115: (00.100754) freezer.state=FREEZING checkpoint_test.go:115: (00.200851) freezer.state=FREEZING checkpoint_test.go:115: (00.300941) freezer.state=FREEZING checkpoint_test.go:115: (00.401039) freezer.state=FREEZING checkpoint_test.go:115: (00.501138) freezer.state=FREEZING checkpoint_test.go:115: (00.601233) freezer.state=FREEZING checkpoint_test.go:115: (00.701325) freezer.state=FREEZING checkpoint_test.go:115: (00.801419) freezer.state=FREEZING checkpoint_test.go:115: (00.901518) freezer.state=FREEZING checkpoint_test.go:115: (01.001609) freezer.state=FREEZING checkpoint_test.go:115: (01.101707) freezer.state=FREEZING checkpoint_test.go:115: (01.201801) freezer.state=FREEZING checkpoint_test.go:115: (01.301898) freezer.state=FREEZING checkpoint_test.go:115: (01.402005) freezer.state=FREEZING checkpoint_test.go:115: (01.502110) freezer.state=FREEZING checkpoint_test.go:115: (01.602214) freezer.state=FREEZING checkpoint_test.go:115: (01.702327) freezer.state=FREEZING checkpoint_test.go:115: (01.802432) freezer.state=FREEZING checkpoint_test.go:115: (01.902530) freezer.state=FREEZING checkpoint_test.go:115: (02.002627) freezer.state=FREEZING checkpoint_test.go:115: (02.102735) freezer.state=FREEZING checkpoint_test.go:115: (02.202838) freezer.state=FREEZING checkpoint_test.go:115: (02.302932) freezer.state=FREEZING checkpoint_test.go:115: (02.403025) freezer.state=FREEZING checkpoint_test.go:115: (02.503113) freezer.state=FREEZING checkpoint_test.go:115: (02.603232) freezer.state=FREEZING checkpoint_test.go:115: (02.703337) freezer.state=FREEZING checkpoint_test.go:115: (02.803439) freezer.state=FREEZING checkpoint_test.go:115: (02.903534) freezer.state=FREEZING checkpoint_test.go:115: (03.003627) freezer.state=FREEZING checkpoint_test.go:115: (03.103735) freezer.state=FREEZING checkpoint_test.go:115: (03.203828) freezer.state=FREEZING checkpoint_test.go:115: (03.303924) freezer.state=FREEZING checkpoint_test.go:115: (03.404029) freezer.state=FREEZING checkpoint_test.go:115: (03.504143) freezer.state=FREEZING checkpoint_test.go:115: (03.604243) freezer.state=FREEZING checkpoint_test.go:115: (03.704340) freezer.state=FREEZING checkpoint_test.go:115: (03.804425) freezer.state=FREEZING checkpoint_test.go:115: (03.904534) freezer.state=FREEZING checkpoint_test.go:115: (04.004650) freezer.state=FREEZING checkpoint_test.go:115: (04.104787) freezer.state=FREEZING checkpoint_test.go:115: (04.204909) freezer.state=FREEZING checkpoint_test.go:115: (04.305027) freezer.state=FREEZING checkpoint_test.go:115: (04.405145) freezer.state=FREEZING checkpoint_test.go:115: (04.505259) freezer.state=FREEZING checkpoint_test.go:115: (04.605384) freezer.state=FREEZING checkpoint_test.go:115: (04.705527) freezer.state=FREEZING checkpoint_test.go:115: (04.805639) freezer.state=FREEZING checkpoint_test.go:115: (04.905750) freezer.state=FREEZING checkpoint_test.go:115: (05.005870) freezer.state=FREEZING checkpoint_test.go:115: (05.105985) freezer.state=FREEZING checkpoint_test.go:115: (05.206093) freezer.state=FREEZING checkpoint_test.go:115: (05.306197) freezer.state=FREEZING checkpoint_test.go:115: (05.406293) freezer.state=FREEZING checkpoint_test.go:115: (05.506414) freezer.state=FREEZING checkpoint_test.go:115: (05.606538) freezer.state=FREEZING checkpoint_test.go:115: (05.706664) freezer.state=FREEZING checkpoint_test.go:115: (05.806777) freezer.state=FREEZING checkpoint_test.go:115: (05.906886) freezer.state=FREEZING checkpoint_test.go:115: (06.00[699](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:700)3) freezer.state=FREEZING checkpoint_test.go:115: (06.107105) freezer.state=FREEZING checkpoint_test.go:115: (06.207225) freezer.state=FREEZING checkpoint_test.go:115: (06.307351) freezer.state=FREEZING checkpoint_test.go:115: (06.407476) freezer.state=FREEZING checkpoint_test.go:115: (06.507600) freezer.state=FREEZING checkpoint_test.go:115: (06.607720) freezer.state=FREEZING checkpoint_test.go:115: (06.707852) freezer.state=FREEZING checkpoint_test.go:115: (06.807984) freezer.state=FREEZING checkpoint_test.go:115: (06.908105) freezer.state=FREEZING checkpoint_test.go:115: (07.008230) freezer.state=FREEZING checkpoint_test.go:115: (07.108347) freezer.state=FREEZING checkpoint_test.go:115: (07.208461) freezer.state=FREEZING checkpoint_test.go:115: (07.308576) freezer.state=FREEZING checkpoint_test.go:115: (07.408689) freezer.state=FREEZING checkpoint_test.go:115: (07.508813) freezer.state=FREEZING checkpoint_test.go:115: (07.608952) freezer.state=FREEZING checkpoint_test.go:115: (07.709072) freezer.state=FREEZING checkpoint_test.go:115: (07.809186) freezer.state=FREEZING checkpoint_test.go:115: (07.909295) freezer.state=FREEZING checkpoint_test.go:115: (08.009419) freezer.state=FREEZING checkpoint_test.go:115: (08.109523) freezer.state=FREEZING checkpoint_test.go:115: (08.209629) freezer.state=FREEZING checkpoint_test.go:115: (08.309736) freezer.state=FREEZING checkpoint_test.go:115: (08.409861) freezer.state=FREEZING checkpoint_test.go:115: (08.509985) freezer.state=FREEZING checkpoint_test.go:115: (08.610104) freezer.state=FREEZING checkpoint_test.go:115: (08.710225) freezer.state=FREEZING checkpoint_test.go:115: (08.810343) freezer.state=FREEZING checkpoint_test.go:115: (08.910458) freezer.state=FREEZING checkpoint_test.go:115: (09.010584) freezer.state=FREEZING checkpoint_test.go:115: (09.110[701](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:702)) freezer.state=FREEZING checkpoint_test.go:115: (09.210807) freezer.state=FREEZING checkpoint_test.go:115: (09.310927) freezer.state=FREEZING checkpoint_test.go:115: (09.411052) freezer.state=FREEZING checkpoint_test.go:115: (09.511165) freezer.state=FREEZING checkpoint_test.go:115: (09.611291) freezer.state=FREEZING checkpoint_test.go:115: (09.711398) freezer.state=FREEZING checkpoint_test.go:115: (09.811526) freezer.state=FREEZING checkpoint_test.go:115: (09.911645) freezer.state=FREEZING checkpoint_test.go:115: (10.000726) Error (criu/cr-dump.c:1784): Timeout reached. Try to interrupt: 0 checkpoint_test.go:115: (10.000770) freezer.state=FREEZING checkpoint_test.go:115: (10.000850) Unfreezing tasks into 1 checkpoint_test.go:115: (10.000857) Unseizing 12457 into 1 checkpoint_test.go:115: (10.000872) Error (compel/src/lib/infect.c:418): Unable to detach from 12457: No such process checkpoint_test.go:115: (10.000879) Writing image inventory (version 1) checkpoint_test.go:115: (10.000952) Error (criu/cr-dump.c:1898): Pre-dumping FAILED. checkpoint_test.go:115: === END === checkpoint_test.go:116: criu failed: type PRE_DUMP errno 0 log file: /tmp/TestCheckpoint1478934365/003/criu-parent/dump.log time="2024-06-03T01:06:39Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/pids/test/integration: device or resource busy" time="2024-06-03T01:06:39Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/blkio/test/integration: device or resource busy" --- FAIL: TestCheckpoint (10.25s)

lifubang avatar Jun 03 '24 01:06 lifubang

For CentOS 7, we use somewhat dated criu v3.16 from https://copr.fedorainfracloud.org/coprs/adrian/criu-el7/builds/, with the latest one being v3.19. @adrianreber might or might not want to look into that, as CentOS 7 will be EOL in a year).

For Ubuntu 20.04, we use latest criu v3.19 (thanks @rst0git for keeping up with the builds!), but it's an older kernel (5.15) which I think might be the reason (cgroup freezer fails). Maybe @avagin may shed some light as to why simple checkpointing might fail during freeze.

kolyshkin avatar Jun 04 '24 00:06 kolyshkin

I would not worry about CentOS 7. It goes EOL end of June 2024. Just disable it. The CentOS 7 kernel never really supported everything and CRIU support was always a tech preview. Newer versions of CRIU probably do not even build on CentOS 7 as we removed Python 2 support from CRIU. You can also disable CentOS Stream 8 based test. That went EOL end of May 2024.

adrianreber avatar Jun 04 '24 06:06 adrianreber

In general, cgroup freezer is not very reliable, I previously had to implement some hacks in runc to work around it (see https://github.com/opencontainers/runc/pull/2941 and the earlier PRs linked from there).

@kolyshkin Would it make sense to use a similar approach in freeze_processes()?

rst0git avatar Jun 04 '24 08:06 rst0git

@kolyshkin Would it make sense to use a similar approach in freeze_processes()?

Alas, with all that jazz it still fails sometimes, and people suggest even longer delays (see e.g. https://github.com/opencontainers/runc/pull/4388). The question is where to draw the line? Like, what amount of attempts is enough?

kolyshkin avatar Oct 28 '24 22:10 kolyshkin

In general, cgroup freezer is not very reliable, I previously had to implement some hacks in runc to work around it (see #2941 and the earlier PRs linked from there).

@kolyshkin Would it make sense to use a similar approach in freeze_processes()?

I've decided to go ahead with this: https://github.com/checkpoint-restore/criu/pull/2545

kolyshkin avatar Dec 13 '24 01:12 kolyshkin

Should be fixed in criu v4.1 (unless they backport checkpoint-restore/criu#2545 to a 4.0.x release).

kolyshkin avatar Jan 16 '25 05:01 kolyshkin