podman e2e: kube play, huge annotation: podman rm hangs

Seeing a new flake recently, so far only in podman-remote root. Not OS-specific:

Podman kube play
  test with annotation size within limits
....
# podman-remote [options] kube play /var/tmp/pme2e-1564132875/pm3097852031/kube.yaml
[works fine]
← Exit  [It] test with annotation size within limits
→ Enter [AfterEach] TOP-LEVEL
# podman-remote [options] stop --all -t 0
[works fine]
# podman-remote [options] pod rm -fa -t 0

[FAILED] Timed out after 90.000s.

debian-13 : int remote debian-13 root host sqlite [remote]
- 04-02 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-01 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 03-26 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host sqlite [remote]
- 04-02 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
rawhide : int remote rawhide root host sqlite [remote]
- 04-02 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits

x	x	x	x	x	x
int(5)	remote(5)	debian-13(3)	root(5)	host(5)	sqlite(5)
		rawhide(1)
		fedora-39(1)

Apr 03 '24 11:04 edsantiago

I instrumented my no-retries PR, to dump the yaml, and here it is f39 remote root:

           apiVersion: v1
           kind: Pod
           metadata:
             creationTimestamp: "2019-07-17T14:44:08Z"
             name: testPod
             labels:
               app: testPod
         
         
             annotations:
             
               name: SOMETHING TOO LONG FOR GITHUB TO LET ME PUT IN A COMMENT             
         
         
           spec:
             restartPolicy: Never
             hostname: 
             hostNetwork: false
         
             hostAliases:
             initContainers:
             containers:
             - command:
               - top
               args:
               - -d
               - 1.5
               env:
               - name: HOSTNAME
               image: quay.io/libpod/testimage:20240123
               name: testCtr
               imagePullPolicy: missing
               securityContext:
                 allowPrivilegeEscalation: true
                 privileged: false
                 readOnlyRootFilesystem: false
               ports:
               - containerPort: 
                 hostIP: 
                 hostPort: 
                 protocol: TCP
               workingDir: /
               volumeMounts:
           status: {}
           # podman-remote [options] kube play /var/tmp/podman-e2e-1368536477/subtest-2044718688/kube.yaml
           Pod:
           0ca4807ad93b155bc3c542fb835620d492581d937217f3f43cf260a0b90942b7
           Container:
           7db8ba2673e8a2f2fd99803932a9914b0273438795962bbf2b05b49b14535139

[hang]

May 02 '24 22:05 edsantiago

Still happening. The recent logs below include a dump of the annotation, in case it helps, but I think it won't.

debian-13 : int remote debian-13 root host sqlite [remote]
- 07-22 11:54 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-17 17:11 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-11 23:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-04-2024 06:59 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-02-2024 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-01-2024 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 03-26-2024 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host boltdb [remote]
- 07-12 10:23 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host sqlite [remote]
- 05-02-2024 17:18 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-02-2024 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
rawhide : int remote rawhide root host sqlite [remote]
- 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits

x	x	x	x	x	x
int(11)	remote(11)	debian-13(7)	root(11)	host(11)	sqlite(10)
		fedora-39(3)			boltdb(1)
		rawhide(1)

Jul 22 '24 17:07 edsantiago

This one is failing multiple times a day in my no-retry PR. Here's the last two weeks:

debian-13 : int remote debian-13 root host sqlite [remote]
- PR #23275
  - 07-23 20:34 in Podman kube play test with annotation size within limits
  - 08-05 12:45 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host boltdb [remote]
- 08-05 09:46 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-26 07:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-40 : int remote fedora-40 root host sqlite [remote]
- 08-06 11:34 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-06 10:35 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 20:49 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 13:56 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
rawhide : int remote rawhide root host sqlite [remote]
- 08-06 12:38 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-06 10:36 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-05 09:50 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-01 07:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 23:06 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits

x	x	x	x	x	x
int(13)	remote(13)	rawhide(5)	root(13)	host(13)	sqlite(11)
		fedora-40(4)			boltdb(2)
		fedora-39(2)
		debian-13(2)

Aug 06 '24 17:08 edsantiago

I'll start poking at this one. Probably something to do with the sheer size of the annotation making our REST API rather angry.

Aug 06 '24 18:08 mheon

Based on the error here

I see what is happening, the code reads stderr first until EOF then stdout until EOF. There is no error here see stderr is empty but we never get EOF as the crun process must exit in order get EOF. The crun process however needs to write a very big json now to the stdout pipe, however note we do not start reading from the until the crun process exits. So if the json is large enough to exceed the pipe buffer the write on the crun side will block until we start reading from the pipe thus effectively dead locking us.

In fact this is trivial reproduce once we know this. The issue is why this is is flaky because UpdateContainerStatus is never called by default, it gets only called when the crun kill command fails which can happen if the container was already stopped/exited (which is a normal race condition because we unlock during stop)

And this isn't related to remote either, it is likely that remote makes the race to hit this more likely

Aug 16 '24 10:08 Luap99