podman unable to create pod cgroup: slice was already loaded or has a fragment file

[ Copy of https://github.com/containers/crun/issues/1560 ]

This is now the number one flake in parallel-podman-testing land. It is not manifesting in actual CI, only on my f40 laptop, and it's preventing me from parallelizing 250-systemd.bats:

not ok 204 |250| podman generate - systemd template no support for pod in 11179ms                                                                               
# tags: ci:parallel                                                                                                                                             
# (from function `bail-now' in file test/system/helpers.bash, line 192,                                                                                         
                                                                                                                                              576,1         69% 
# tags: ci:parallel                                                                                                                                             
# (from function `bail-now' in file test/system/helpers.bash, line 192,                                                                                         
#  from function `die' in file test/system/helpers.bash, line 969,
#  from function `run_podman' in file test/system/helpers.bash, line 571,
#  in test file test/system/250-systemd.bats, line 264)
#   `run_podman pod create --name $podname' failed
#
# [14:04:02.496829054] $ bin/podman pod create --name p-t204-pqf7zzmu
# [14:04:12.662179170] Error: unable to create pod cgroup for pod b88d54dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8: creating cgroup user.slice/u
ser-14904.slice/[email protected]/user.slice/user-libpod_pod_b88d54dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8.slice: Unit user-libpod_pod_b88d5
4dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8.slice was already loaded or has a fragment file.
# [14:04:12.664998385] [ rc=125 (** EXPECTED 0 **) ]

The trigger is enabling parallel tests in 250-systemd.bats. It reproduces very quickly (80-90%) with file_tags=ci:parallel, but also reproduces (~40%) if I just do test_tags on the envar or systemd template tests. I have never seen this failure before adding tags to 250.bats, and have never seen it in any of the runs where I've removed the parallel tags from 250.bats. It is possible that service_setup() (which runs a bunch of systemctls) is to blame, but I am baffled as to how.

Kagi search finds https://github.com/containers/crun/issues/1138 but that's OOM-related and I'm pretty sure nothing is OOMing.

Workaround is easy, don't parallelize 250.bats.

Sep 18 '24 20:09 edsantiago

just started looking into this. Is it safe to run multiple service_setup/service_cleanup in parallel?

Sep 19 '24 13:09 giuseppe

There's one part that I'm suspicious of and need to fix: the global SERVICE_NAME. I am working on a fix for that. However, that should only affect multiple same-file tests running at once. The fragment flake happens even if only one test from this file is parallel-enabled.

Sep 19 '24 14:09 edsantiago

The bug reproduces even with the most carefulest parallel-safe-paranoia I can write. And, still, even with only one test in the 250 file enabled.

Sep 19 '24 21:09 edsantiago

A friendly reminder that this issue had no activity for 30 days.

Oct 20 '24 00:10 github-actions[bot]

Based on a tip from the interwebz I ran systemctl --user reset-failed and the problem went away. Then it came back this week, and systemctl --user list-units --failed showed a ton of results, and it turns out our system tests are leaking some sort of systemd cruft. Lots of leaks in quadlet, a few in systemd tests, and some that I can't figure out in healthcheck tests. I've got a patch for the first two in my pet parallel PR, will test and submit one of these days.

Oct 22 '24 20:10 edsantiago