e2e: kube play, huge annotation: podman rm hangs
Seeing a new flake recently, so far only in podman-remote root. Not OS-specific:
Podman kube play
test with annotation size within limits
....
# podman-remote [options] kube play /var/tmp/pme2e-1564132875/pm3097852031/kube.yaml
[works fine]
← Exit [It] test with annotation size within limits
→ Enter [AfterEach] TOP-LEVEL
# podman-remote [options] stop --all -t 0
[works fine]
# podman-remote [options] pod rm -fa -t 0
[FAILED] Timed out after 90.000s.
- debian-13 : int remote debian-13 root host sqlite [remote]
- 04-02 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-01 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 03-26 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- fedora-39 : int remote fedora-39 root host sqlite [remote]
- 04-02 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- rawhide : int remote rawhide root host sqlite [remote]
- 04-02 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
| x | x | x | x | x | x |
|---|---|---|---|---|---|
| int(5) | remote(5) | debian-13(3) | root(5) | host(5) | sqlite(5) |
| rawhide(1) | |||||
| fedora-39(1) |
I instrumented my no-retries PR, to dump the yaml, and here it is f39 remote root:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2019-07-17T14:44:08Z"
name: testPod
labels:
app: testPod
annotations:
name: SOMETHING TOO LONG FOR GITHUB TO LET ME PUT IN A COMMENT
spec:
restartPolicy: Never
hostname:
hostNetwork: false
hostAliases:
initContainers:
containers:
- command:
- top
args:
- -d
- 1.5
env:
- name: HOSTNAME
image: quay.io/libpod/testimage:20240123
name: testCtr
imagePullPolicy: missing
securityContext:
allowPrivilegeEscalation: true
privileged: false
readOnlyRootFilesystem: false
ports:
- containerPort:
hostIP:
hostPort:
protocol: TCP
workingDir: /
volumeMounts:
status: {}
# podman-remote [options] kube play /var/tmp/podman-e2e-1368536477/subtest-2044718688/kube.yaml
Pod:
0ca4807ad93b155bc3c542fb835620d492581d937217f3f43cf260a0b90942b7
Container:
7db8ba2673e8a2f2fd99803932a9914b0273438795962bbf2b05b49b14535139
[hang]
Still happening. The recent logs below include a dump of the annotation, in case it helps, but I think it won't.
- debian-13 : int remote debian-13 root host sqlite [remote]
- 07-22 11:54 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-17 17:11 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-11 23:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-04-2024 06:59 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-02-2024 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-01-2024 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 03-26-2024 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- fedora-39 : int remote fedora-39 root host boltdb [remote]
- 07-12 10:23 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- fedora-39 : int remote fedora-39 root host sqlite [remote]
- 05-02-2024 17:18 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-02-2024 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- rawhide : int remote rawhide root host sqlite [remote]
- 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
| x | x | x | x | x | x |
|---|---|---|---|---|---|
| int(11) | remote(11) | debian-13(7) | root(11) | host(11) | sqlite(10) |
| fedora-39(3) | boltdb(1) | ||||
| rawhide(1) |
This one is failing multiple times a day in my no-retry PR. Here's the last two weeks:
- debian-13 : int remote debian-13 root host sqlite [remote]
- PR #23275
- 07-23 20:34 in Podman kube play test with annotation size within limits
- 08-05 12:45 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- PR #23275
- fedora-39 : int remote fedora-39 root host boltdb [remote]
- 08-05 09:46 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-26 07:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- fedora-40 : int remote fedora-40 root host sqlite [remote]
- 08-06 11:34 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-06 10:35 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 20:49 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 13:56 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- rawhide : int remote rawhide root host sqlite [remote]
- 08-06 12:38 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-06 10:36 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-05 09:50 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-01 07:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 23:06 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
| x | x | x | x | x | x |
|---|---|---|---|---|---|
| int(13) | remote(13) | rawhide(5) | root(13) | host(13) | sqlite(11) |
| fedora-40(4) | boltdb(2) | ||||
| fedora-39(2) | |||||
| debian-13(2) |
I'll start poking at this one. Probably something to do with the sheer size of the annotation making our REST API rather angry.
Based on the error here
I see what is happening, the code reads stderr first until EOF then stdout until EOF. There is no error here see stderr is empty but we never get EOF as the crun process must exit in order get EOF. The crun process however needs to write a very big json now to the stdout pipe, however note we do not start reading from the until the crun process exits. So if the json is large enough to exceed the pipe buffer the write on the crun side will block until we start reading from the pipe thus effectively dead locking us.
In fact this is trivial reproduce once we know this. The issue is why this is is flaky because UpdateContainerStatus is never called by default, it gets only called when the crun kill command fails which can happen if the container was already stopped/exited (which is a normal race condition because we unlock during stop)
And this isn't related to remote either, it is likely that remote makes the race to hit this more likely