buildah icon indicating copy to clipboard operation
buildah copied to clipboard

CI: runc cgroup error msg flake

Open Luap99 opened this issue 6 months ago • 1 comments

[+0588s] not ok 561 copy-file-relative-context-dir
[+0588s] # (from function `expect_line_count' in file ./helpers.bash, line 591,
[+0588s] #  in test file ./copy.bats, line 597)
[+0588s] #   `expect_line_count 1' failed
[+0588s] # /var/tmp/go/src/github.com/containers/buildah/tests /var/tmp/go/src/github.com/containers/buildah/tests
[+0588s] # # [checking for: docker.io/library/busybox]
[+0588s] # # [restoring from cache: /tmp/bats-run-lPAMEB/suite/buildah-image-cache / docker.io/library/busybox]
[+0588s] # Getting image source signatures
[+0588s] # Copying blob sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572
[+0588s] # Copying config sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f
[+0588s] # Writing manifest to image destination
[+0588s] # # /var/tmp/go/src/github.com/containers/buildah/tests/./../bin/buildah from --quiet --signature-policy /var/tmp/go/src/github.com/containers/buildah/tests/./policy.json busybox
[+0588s] # busybox-working-container
[+0588s] # # /var/tmp/go/src/github.com/containers/buildah/tests/./../bin/buildah copy --contextdir /tmp/buildah_tests.ishht1/context busybox-working-container test_file /opt/
[+0588s] # 42145a076d5e72262bea80733dac7785067a42f76c7b45a5a047f70e4403440f
[+0588s] # # /var/tmp/go/src/github.com/containers/buildah/tests/./../bin/buildah run busybox-working-container ls -1 /opt/
[+0588s] # test_file
[+0588s] # time="2025-06-04T04:16:33-05:00" level=error msg="seek /sys/fs/cgroup/system.slice/runc-buildah-buildah4055347767.scope/cgroup.freeze: no such device"
[+0588s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0588s] # #| FAIL: buildah run busybox-working-container ls -1 /opt/
[+0588s] # #| Expected 1 lines of output, got 2
[+0588s] # #| Output was:
[+0588s] # #| >test_file
[+0588s] # #| >time="2025-06-04T04:16:33-05:00" level=error msg="seek /sys/fs/cgroup/system.slice/runc-buildah-buildah4055347767.scope/cgroup.freeze: no such device"
[+0588s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

https://api.cirrus-ci.com/v1/task/5992802590392320/logs/integration_test.log

Luap99 avatar Jun 04 '25 09:06 Luap99

That one's cropping up frequently.

nalind avatar Jun 04 '25 12:06 nalind

This is a really bad one flake. I rerun the test on my PR 5 times now still failing.

time="2025-06-04T04:16:33-05:00" level=error msg="seek /sys/fs/cgroup/system.slice/runc-buildah-buildah4055347767.scope/cgroup.freeze: no such device"

Looking at code I suspect that the root cause is the cgroup reading here https://github.com/opencontainers/cgroups/blob/b970779131d3e4540132ccfb16dc49890491f8d5/fs2/freezer.go#L53-L71

I guess the issue is that the cgroup was deleted after the open but before the seek call. If open doesn't error on ENODEV then maybe the seek/read shouldn't either. Not sure why the code seeks at all since we just opened the file that seems like an unnecessary syscall.

@kolyshkin Any chance you could have a look at this? crun seem to handle this per https://github.com/containers/crun/pull/539/commits/c6bd3143e2434b4ee4163e37045595bb6298090c I believe that is why we don't see the issue there I guess.

Luap99 avatar Jun 30 '25 16:06 Luap99

looks like it is being tracked in https://github.com/opencontainers/runc/issues/4798

giuseppe avatar Jul 03 '25 08:07 giuseppe

@nalind This is flaking on basically every buildah PR I look at. Should we revert the runc testing here, at least until this flake gets sorted out in runc?

Luap99 avatar Jul 14 '25 16:07 Luap99

Yeah, I don't have knowledge of when it's going to be resolved.

nalind avatar Jul 14 '25 18:07 nalind

Being fixed by https://github.com/opencontainers/cgroups/pull/25, will do my best to to fast-track it

kolyshkin avatar Jul 14 '25 19:07 kolyshkin

#6286 proposes moving those test tasks to their own non-blocking groups for the meantime.

nalind avatar Jul 14 '25 20:07 nalind

runc pr: https://github.com/opencontainers/runc/pull/4805

kolyshkin avatar Jul 14 '25 21:07 kolyshkin

This is now merged into runc main via https://github.com/opencontainers/runc/pull/4808. Guess this one can be closed if there are no new failures?

kolyshkin avatar Jul 15 '25 19:07 kolyshkin

@kolyshkin Thanks but we need to wait until runc is released with this fix and then but into the distribution packages so we can update them in our CI env. As long as it still flakes in CI we should keep this open.

Luap99 avatar Jul 21 '25 09:07 Luap99

Sorry, I thought you're testing runc git HEAD. Will try to fix in in all supported branches.

kolyshkin avatar Jul 23 '25 06:07 kolyshkin

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Aug 23 '25 00:08 github-actions[bot]