amazon-linux-2023 [Bug] - Timeouts in execution in forked commands

Describe the bug We had issues running haproxy with a check-command configured in a container on EKS, using AL2023-EKS as our image for the worker nodes.

The result was a timeout on the check-command. We believe this is only a single manifestation of the issue, we suspect it to be present on other processes as well. It. is unclear what exactly goes wrong, but it feels like the process can't fork, or that once forked, the process can't run. We suspected ulimit, but that all seemed correct and sufficiently high. strace pointed to high amount of calls to polling, but it was unclear why that might happen. dmesg didn't have any pointers.

To Reproduce Steps to reproduce the behavior:

Deploy EKS 1.29 cluster
Deploy worker nodes with AL2023-EKS-1.29-20240415
Using helm, install percona operator and the database for mysql/galera: https://docs.percona.com/percona-operator-for-mysql/pxc/helm.html
Observe the behaviour inside the haproxy containers:

proxy layer isn't getting ready, it keeps failing
high CPU load
reports that the external check command timed out

We tried changing the external check command to the simple /bin/true, but that didn't help. This ruled out that the problem was with the specific check script.

When switching to Bottlerocket or AL2, and redeploying exactly the same, all goes well. We verified switching back, and the problems just returned. Various machines types have been tried, same issue. Various versions of the operator/database helm chart (using different images), same result.

Expected behavior Stable environment, all systems online and working. No restarts. As is with Bottlerocket or AL2.

Apr 19 '24 09:04 wonko

I am unsure if related, but I have experienced odd timeouts with EKS and AL23 nodes. At high load, we see networking timeouts just trying to access the kubernetes service from pods as well as trying to get logs from pods. Reverting to AL2 resolved this issues completely for us.

I brought this to AWS Support for EKS and they haven't been able to find anything of note.

Jun 14 '24 02:06 code-eg

any updates here? having to revert down to AL2 isn't a viable long-term solution

Jun 20 '24 14:06 opeologist

Can you specify the Node AMI version that displayed this behavior?

Jun 25 '24 00:06 orsenthil

all i have is "AL2023-EKS-1.29-20240415"

Jun 25 '24 06:06 wonko

@wonko - I launched a curl container inside a pod, that can repeatedly (more than 100 times) communicate with the API Server. We didn't see any connectivity issue on AL2023 and AMI - 1.29.3-20240625

It is perhaps kind of the client that is playing a part here?

Jun 27 '24 21:06 orsenthil

I didn't check with other AMI versions, so maybe the issue was solved in a newer version (seems like a 2 month difference). Did you verify with the provided AMI version where we detected the problem?

Aug 12 '24 11:08 wonko