podman icon indicating copy to clipboard operation
podman copied to clipboard

podman stop: Unable to clean up network: unmounting network namespace: device or resource busy

Open edsantiago opened this issue 2 years ago • 39 comments

Seeing this quite often in my PR with #18442 cherrypicked:

? Enter [AfterEach] TOP-LEVEL
$ podman [options] stop --all -t 0
time="2023-08-22T18:28:28-05:00" level=error msg="Unable to clean up network for container XYZ:
    "unmounting network namespace for container XYZ:
        failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
            remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Example: int f38 rootless.

It almost always happens together with the "Storage ... removed" flake (#19702), e.g.:

So I mostly file under that issue, because my flake tool has no provision for multiple buckets.

No pattern yet (that I can see) in when it fails, which tests, etc.

edsantiago avatar Aug 23 '23 16:08 edsantiago

EBUSY should only be the error if the netns is still mounted. However we umount directly before the remove call so that should not cause problems.

The umount call uses MNT_DETACH so it may not actually umount right a away. I have no idea why this flag is used and how this interacts with the nsfs mount points.

Luap99 avatar Aug 24 '23 10:08 Luap99

Seeing this a lot, but again, usually with #19702 ("Storage ... removed") and I can't dual-assign flakes. So since that one already has a ton of sample failures, I'm going to start assigning flakes here. Maybe alternating. Here are last night's failures:

  • fedora-37 : int podman fedora-37 root host sqlite
    • 08-23 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
  • fedora-38 : int podman fedora-38 root host sqlite
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
  • fedora-38 : int podman fedora-38 rootless host boltdb
    • 08-22 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • rawhide : int podman rawhide rootless host sqlite
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options

One of those also includes this fun set of errors, does this help?

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"

edsantiago avatar Aug 24 '23 11:08 edsantiago

I've got a little more data. There seem to be two not-quite-identical failure modes, depending on root or rootless:

f38 rootless:

Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
    remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy

f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

  1. root has two IPAM errors
  2. root has a "tearing down ... open container netns" error
  3. rootless says "failed to remove ns path ... EBUSY", root says "failed to unmount NS ... ENOENT"

They're too similar for me to split this into two separate issues, but I'll listen to the opinion of experts.

HTH.

edsantiago avatar Aug 24 '23 16:08 edsantiago

Do you think you can remove the MNT_DETATCH?

rhatdan avatar Aug 24 '23 18:08 rhatdan

I have no idea why it was added in the first place, maybe it is needed?

Git blame goes all the way back to 8c52aa15f0e4927c0e570102efaa34dbe93d6156 which claims it needs MNT_DETATCH but provides no explanation at all why? @mheon This code was forked from cni upstream and that one never used it so...


f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

1. root has two IPAM errors

2. root has a "tearing down ... open container netns" error

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

Luap99 avatar Aug 25 '23 09:08 Luap99

WHEW! After much suffering, I removed MNT_DETACH. That causes absolutely everything to fail hard, even system tests which so far have been immune to this flake.

edsantiago avatar Aug 25 '23 22:08 edsantiago

I think I originally added the MNT_DETACH flag because we were seeing intermittent failures during cleanup due to the namespace still being in use, and I was expecting that making the unmount lazy would resolve things.

mheon avatar Aug 26 '23 01:08 mheon

I'm giving up on this: I am pulling the stderr-on-teardown checks from my flake-check PR. It's too much, costing me way too much time between this and #19702. Until these two are fixed, I can't justify the time it takes me to sort through these flakes.

FWIW, here is the catalog so far:

  • fedora-37 : int podman fedora-37 root container sqlite
    • 08-28 08:57 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-27 09:08 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-26 18:25 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-24 11:23 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24 11:23 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • fedora-37 : int podman fedora-37 root host sqlite
    • 08-26 18:25 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-24 11:25 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-23 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 08-28 21:50 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test HostAliases
    • 08-28 13:20 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-28 11:35 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28 08:56 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 08-24 19:02 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-24 19:02 in TOP-LEVEL [AfterEach] Podman kube generate podman generate kube sharing pid namespace
    • 08-24 11:21 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • fedora-38 : int podman fedora-38 root container sqlite
    • 08-24 19:01 in TOP-LEVEL [AfterEach] Podman exec podman exec preserves container groups with --user and --group-add
  • fedora-38 : int podman fedora-38 root host sqlite
    • 08-28 21:52 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 08-28 13:22 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-28 11:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube replace non-existing pod
    • 08-28 11:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27 09:05 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
    • 08-24 19:01 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-24 11:23 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
  • fedora-38 : int podman fedora-38 rootless host boltdb
    • 08-22 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 08-27 09:05 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int podman rawhide root host sqlite
    • 08-28 21:55 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-28 13:19 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
    • 08-28 13:19 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28 08:53 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27 21:41 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with customized hostname
    • 08-26 22:23 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test optional env value from missing configmap
    • 08-24 19:00 in TOP-LEVEL [AfterEach] Podman play kube podman play kube --no-host
  • rawhide : int podman rawhide rootless host sqlite
    • 08-28 11:36 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-27 21:40 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-26 20:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube multi doc yaml with multiple services, pods and deployments
    • 08-26 20:35 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24 19:00 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options

Seen in: int podman fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

edsantiago avatar Aug 29 '23 12:08 edsantiago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Sep 29 '23 00:09 github-actions[bot]

  • fedora-37 : int podman fedora-37 root container sqlite
    • 08-28-2023 08:57 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-27-2023 09:08 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-26-2023 18:25 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • fedora-37 : int podman fedora-37 root host sqlite
    • 08-26-2023 18:25 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-24-2023 11:25 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-23-2023 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 08-28-2023 21:50 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test HostAliases
    • 08-28-2023 13:20 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28-2023 08:56 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 08-24-2023 19:02 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-24-2023 19:02 in TOP-LEVEL [AfterEach] Podman kube generate podman generate kube sharing pid namespace
    • 08-24-2023 11:21 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • fedora-38 : int podman fedora-38 root container sqlite
    • 08-24-2023 19:01 in TOP-LEVEL [AfterEach] Podman exec podman exec preserves container groups with --user and --group-add
  • fedora-38 : int podman fedora-38 root host sqlite
    • 08-28-2023 21:52 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 08-28-2023 13:22 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube replace non-existing pod
    • 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27-2023 09:05 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
    • 08-24-2023 19:01 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
  • fedora-38 : int podman fedora-38 rootless host boltdb
    • PR #17831
      • 08-22-2023 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 08-27-2023 09:05 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int podman rawhide root host sqlite
    • 08-28-2023 21:55 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-28-2023 13:19 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
    • 08-28-2023 13:19 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28-2023 08:53 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27-2023 21:41 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with customized hostname
    • 08-26-2023 22:23 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test optional env value from missing configmap
    • 08-24-2023 19:00 in TOP-LEVEL [AfterEach] Podman play kube podman play kube --no-host
  • rawhide : int podman rawhide rootless host sqlite
    • 08-28-2023 11:36 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-27-2023 21:40 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-26-2023 20:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube multi doc yaml with multiple services, pods and deployments
    • 08-26-2023 20:35 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24-2023 19:00 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
  • rawhide : sys podman rawhide rootless host sqlite

Seen in: int/sys fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

edsantiago avatar Sep 29 '23 00:09 edsantiago

I see different errors posted here that are not the same!

The original report says: remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Just to confirm I looked at ip netns which also does a simple bind mount for the netns paths and to delete they also simply call umount with MNT_DETACH followed by unlink() which is exactly what our code does as well.

https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/ip/ipnetns.c#n735 https://github.com/containers/common/blob/6856d56252121a665cb820777982cc3f61f815af/pkg/netns/netns_linux.go#L191-L197

So this is not something I can understand, the unlink should not fail with EBUSY if it is unmounted.

But I also see: failed to unmount NS: at /run/user/3138/netns/netns-fa758949-40a2-a5e2-2226-5523b2a4c0e7: no such file or directory These are two different things. no such file or directory means we are trying to cleanup again after something else has already cleaned that up. This matches with the other error that is logged when this happens: ="Storage for container 86154f69fa1feff30896274cc265e37fa78745adb1bd2778927053f8bbe7be36 has been removed"

The same goes for errors like this:

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"

Here there must be a way that we for whatever reason end up in the cleanup path twice.

Luap99 avatar Sep 29 '23 12:09 Luap99

I do see the device or resource busy error every now and then. Oddly, just recently it seemed to clear itself up, eventually. When I podman unshare I cannot umount the file, nor remove it. mount reports not mounted; rm reports device or resource busy. However, I believe from outside the namespace, I can delete the file. lsns --type=net does not show any corresponding NSFS. Quite odd indeed.

mhoran avatar Sep 29 '23 15:09 mhoran

I see different errors posted here that are not the same!

That is my fault, and I'm sorry. When scanning flakes to assign them to buckets, I look for common patterns but don't always compare everything exactly. I will be more careful.

edsantiago avatar Oct 02 '23 11:10 edsantiago

New flake. Not quite the same error message, but similar enough that I'm assigning here. From f39 rootless:

[+1802s] not ok 599 podman kube --network
...
<+010ms> # $ podman pod rm -t 0 -f test_pod
<+197ms> # time="2023-10-25T08:47:25-05:00" level=error msg="Unable to clean up network for container [infractrid]: \"unmounting network namespace for container [infractrid]: failed to unmount NS: at /run/user/6452/netns/netns-whatever: no such file or directory\""
         # 6b2f9b7e780b9986feb72921c5d8d059687b831537aa1a14d0456b3d67b3d3ee
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

edsantiago avatar Oct 25 '23 14:10 edsantiago

Another ENOENT, in f39 rootless:

[+1763s] not ok 607 [700] podman kube play --replace external storage
...
<+011ms> # $ podman stop -a -t 0
<+451ms> # 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec
         # b48e32245140e2fbe17564503b67b1ff5c1263e2fd4aa3fcdea8afde28c5a626
         #
<+011ms> # $ podman pod rm -t 0 -f test_pod
<+190ms> # time="2023-11-10T13:36:19-06:00" level=error msg="getting rootless network namespace: failed to Statfs \"/run/user/2815/netns/rootless-netns-bfe0fe1f8f170aff795c\": no such file or directory"
         # time="2023-11-10T13:36:19-06:00" level=error msg="Unable to clean up network for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: \"unmounting network namespace for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: failed to unmount NS: at /run/user/2815/netns/netns-6ae41715-c698-4a92-8658-020349c94f6f: no such file or directory\""
         # 19faf1a389d250f9a4a71fcdae449d6f2c38a89e2e51acc9bde5ec912db1093f
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

edsantiago avatar Nov 13 '23 20:11 edsantiago

The latest list. Note that some of these are ENOENT, and some are EBUSY. Until given a reason to treat these as different bugs, I will continue to lump them together.

  • fedora-37 : int podman fedora-37 rootless host sqlite
  • fedora-38 : int podman fedora-38 root host boltdb
    • 12-05 10:02 in Podman stop podman stop --ignore bogus container
    • 11-30 19:39 in Podman stop podman stop --ignore bogus container
    • 11-29 20:10 in Podman stop podman stop --ignore bogus container
    • 11-28 15:37 in Podman pod stop podman pod stop single pod by name
    • 11-06-2023 11:53 in Podman stop podman stop container by id
    • 11-06-2023 11:53 in Podman stop podman stop --ignore bogus container
  • fedora-38 : int podman fedora-38 root host sqlite
  • fedora-38 : int podman fedora-38 rootless host boltdb
  • fedora-38 : int podman fedora-38 rootless host sqlite
  • fedora-38 : sys podman fedora-38 root host boltdb
    • PR #20750
      • 11-22 15:32 in [sys] [050] podman stop print IDs or raw input
  • fedora-39 : int podman fedora-39 root host sqlite
  • fedora-39 : int podman fedora-39 rootless host boltdb
  • fedora-39 : int podman fedora-39 rootless host sqlite
  • fedora-39β : int podman fedora-39β root host boltdb
  • rawhide : int podman rawhide root host sqlite
    • 11-06-2023 07:35 in Podman stop podman stop --ignore bogus container
    • 09-27-2023 21:53 in Podman stop podman stop --ignore bogus container
    • 09-27-2023 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true
  • rawhide : int podman rawhide rootless host sqlite
x x x x x x
int(29) podman(30) fedora-38(20) root(19) host(30) boltdb(16)
sys(1) rawhide(4) rootless(11) sqlite(14)
fedora-39(3)
fedora-39β(2)
fedora-37(1)

edsantiago avatar Dec 06 '23 20:12 edsantiago

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

If that provides any clues, I am observing simmilar behaviour when the container is configured with restart policy unless-stopped.

piotr-kubiak avatar Feb 04 '24 23:02 piotr-kubiak

With #18442, this is now blowing up. Different error messages, but I'm pretty sure it's all the same bug.

New EACCESS variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:21:40-06:00" level=error msg="Unable to clean up network for container 4f387d5ee492fa77fa287eadbbdb6725aa4e24b879de087cd6d89b6f59014e84: \"netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\""

and the old ENOENT variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to get ips for container ID f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="netavark: open container netns: open /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: IO error: No such file or directory (os error 2)"
time="2024-03-04T06:26:19-06:00" level=error msg="Unable to clean up network for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: \"unmounting network namespace for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: failed to remove ns path: remove /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory, failed to unmount NS: at /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory\""

...and an ENOENT variant with a shorter error message:

$ podman [options] stop --all -t 0
time="2024-03-04T06:27:52-06:00" level=error msg="Unable to clean up network for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: \"unmounting network namespace for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: failed to remove ns path: remove /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory, failed to unmount NS: at /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory\""

Here are today's failures, plus one from January

  • fedora-39 : int podman fedora-39 root host sqlite
    • 03-04 07:36 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
  • fedora-39 : sys podman fedora-39 rootless host sqlite
    • PR #21331
  • rawhide : int podman rawhide root host sqlite
    • 03-04 07:35 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
  • rawhide : int podman rawhide rootless host sqlite
    • 03-04 07:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
x x x x x x
int(3) podman(4) fedora-39(2) root(2) host(4) sqlite(4)
sys(1) rawhide(2) rootless(2)

edsantiago avatar Mar 04 '24 13:03 edsantiago

Still happening with brand-new (March 19) VMs, Three failures in just one CI run:

int podman fedora-38 root host boltdb

int podman fedora-39 rootless host sqlite

int podman rawhide root host sqlite

edsantiago avatar Mar 19 '24 16:03 edsantiago

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

edsantiago avatar Mar 20 '24 17:03 edsantiago

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

No that is most likely something different

Luap99 avatar Mar 20 '24 17:03 Luap99

My periodic ping. This seems to be happening a lot more with recent VMs

  • fedora-38 : int podman fedora-38 rootless host boltdb
    • 04-02 12:16 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 03-25 16:49 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
  • fedora-39 : int podman fedora-39 root host sqlite
    • 04-01 15:46 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 03-25 17:55 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 03-20 13:40 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 03-19 17:53 in TOP-LEVEL [AfterEach] Podman start podman start container --filter
  • fedora-39 : int podman fedora-39 rootless host sqlite
    • 03-19 23:26 in TOP-LEVEL [AfterEach] Podman kube play test with reserved PublishAll annotation in yaml
    • 03-19 16:19 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int podman rawhide root host sqlite
    • 04-01 13:31 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 03-26 11:41 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 03-23 21:48 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
    • 03-23 21:48 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
  • rawhide : int podman rawhide rootless host sqlite
    • 04-01 15:46 in TOP-LEVEL [AfterEach] Podman kube play with image data
    • 04-01 13:27 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
    • 03-20 13:37 in TOP-LEVEL [AfterEach] Podman kube play replace
    • 03-20 10:22 in TOP-LEVEL [AfterEach] Podman kube play override with udp should keep tcp from YAML file
    • 03-19 17:39 in TOP-LEVEL [AfterEach] Podman kube play cap add
    • 03-19 17:39 in TOP-LEVEL [AfterEach] Podman kube play override with tcp should keep udp from YAML file
x x x x x x
int(18) podman(18) rawhide(10) rootless(10) host(18) sqlite(16)
fedora-39(6) root(8) boltdb(2)
fedora-38(2)

edsantiago avatar Apr 02 '24 19:04 edsantiago

Where are we on this one? I just saw this failure, f40 root, on VMs from https://github.com/containers/automation_images/pull/349 with netavark 1.10.3-3:

# podman [options] stop --all -t 0
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to get ips for container ID 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="netavark: open container netns: open /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: IO error: No such file or directory (os error 2)"
time="2024-05-08T08:40:48-05:00" level=error msg="Unable to clean up network for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: \"unmounting network namespace for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: failed to remove ns path: remove /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory, failed to unmount NS: at /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory\""

(error in this one is ENOENT, not EBUSY)

edsantiago avatar May 08 '24 15:05 edsantiago

FYI: The reason you see this more is because I enabled the warnings check in AfterEach() https://github.com/containers/podman/pull/18442

So previously we just didn't see them, in the logs above they all failed in AfterEach, as mentioned before and in other issues the problem is that something tries to cleanup twice but I cannot see why and where this would be happening.

Luap99 avatar May 15 '24 14:05 Luap99

Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure)

edsantiago avatar May 15 '24 14:05 edsantiago

Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure)

Not really all these messages are just complaining that they do not find what they would expect from it if you ask for cleanup. The weird part is the sequence, why does it fail only on pod rm because before that we run stop --all -t 0 which should stop all the container and do the cleanup. So then why does pod rm do the cleanup again?

Luap99 avatar May 15 '24 14:05 Luap99

@edsantiago Can you spin this patch through your testing PR:

diff --git a/libpod/container_internal.go b/libpod/container_internal.go
index 7f909b78d..ce5dbe0ef 100644
--- a/libpod/container_internal.go
+++ b/libpod/container_internal.go
@@ -1396,13 +1396,13 @@ func (c *Container) stop(timeout uint) error {
        if cannotStopErr != nil {
                return cannotStopErr
        }
-       if !c.batched {
+       if !c.batched && timeout > 0 {
                c.lock.Unlock()
        }
 
        stopErr := c.ociRuntime.StopContainer(c, timeout, all)
 
-       if !c.batched {
+       if !c.batched && timeout > 0 {
                c.lock.Lock()
                if err := c.syncContainer(); err != nil {
                        if errors.Is(err, define.ErrNoSuchCtr) || errors.Is(err, define.ErrCtrRemoved) {

Not really a fix as I don't understand the cause but this is one place where we unlock during a critical section so if we no longer see the issues with that than at least I know where it starts to go wrong.

Luap99 avatar May 15 '24 14:05 Luap99

Good timing. I just patched & pushed it out. If I see ~5 runs without any of these flakes, I will report back. (Or, of course, if I see the flake even with this patch).

edsantiago avatar May 15 '24 14:05 edsantiago

Sorry

edsantiago avatar May 15 '24 15:05 edsantiago

Sorry

No need to be sorry, knowing that is good enough for me. And also the fact that this is fails on a simple podman create + start test is very good, I was suspecting some weird pod/container dependencies issues or something with the restart policy always but this seem to exclude such possibilities.

Luap99 avatar May 15 '24 16:05 Luap99