pty sessions keep growing in /proc/sys/kernel/pty/nr
Description
We are running Flatcar as our host Operating system for Kubernetes cluster nodes in aws, gcp and bare metal instances. Since updating to version 4230.2.3 we have observed that pty sessions in /proc/sys/kernel/pty/nr keep growing and eventually reach a point where we can no longer create new sessions.
Impact
An immediate side effect surfaced because of this is that we can no longer jump into pods:
kubectl --context exp-1-merit -n sys-terraform-applier exec -ti pod/snapshot-test-29340305-n5pdz -- ash
error: Internal error occurred: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "2434a72a7d942d881f3db2057d524cf94eff1f7c8f957ec15452d41d265676f8": OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: no space left on device: unknown
But generally the issue is that pty sessions are exhausted quickly and we need to reboot nodes as a workaround to reset the number to 0.
Environment and steps to reproduce
All our affected nodes are currently running: OS Version: Flatcar Container Linux by Kinvolk 4230.2.4 (Oklo) Kernel 6.6.110-flatcar Kubernetes v1.34.1 (kubelet binary runs as systemd service) Container runtime: containerd://1.7.23
The issue seems to appear since the previous version 4230.2.3
Given enough time, the number of pty sessions in /proc/sys/kernel/pty/nr will grow very high and exhaust available sessions. For example:
core@worker-2 ~ $ cat /proc/sys/kernel/pty/nr
3072
At some point the the pool based on default pty sessions gets exhausted:
core@worker-2 ~ $ sysctl kernel.pty.max
kernel.pty.max = 4096
The pool of PTYs available to non-host processes (like containers) is calculated as kernel.pty.max - kernel.pty.reserve. So in our case, this is 4096 - 1024 = 3072. Then the error surfaces when trying to create sessions:
exec failed: unable to start container process: open /dev/ptmx: no space left on device: unknown
Restarting container runtime didn't seem to help the issue. The only workaround we have so far is to reboot affected nodes
Expected behaviour
In higher environments where we are still running 4230.2.2 - 6.6.100-flatcar, it seems like pty sessions are closed properly and figures look like:
$ cat /proc/sys/kernel/pty/nr
2
Hello @ffilippopoulos,
The issue seems to appear since the previous version 4230.2.4
To be clear: the issue started to appear with 4230.2.4 or with the previous version (4230.2.3) ?
@tormath1 the issue started to appear with the previous version: 4230.2.3. Edited the description to address the typo 👍
Could this be it: https://github.com/giantswarm/roadmap/issues/3997 https://github.com/containerd/containerd/issues/11160 fixed in containerd 1.7.26
@jepio looks like the fix is also included in containerd >= 2.0.2 https://github.com/containerd/containerd/commit/e1b0bb601e33670fa24bd184ce61df80144a247d.
I can see that beta channel is currently on: containerd - 2.0.5. We are happy to try that and report if it mitigates the issue in case that'll help.
@jepio actually the last "working" version is 4230.2.2, which comes with the same containerd: containerd://1.7.23 as the ones where we've seen the issue. Regardless, I will try the latest beta channel and report back, but it's worth to note this.
we should get containerd to 1.7.26 for stable/lts regardless...
I can also confirm that the issue seems resolved in the latest beta version Flatcar Container Linux by Kinvolk 4459.1.0 (Oklo) 6.12.51-flatcar
we should get containerd to 1.7.26 for stable/lts regardless...
I think it might be time for a Stable promotion? I'm aware of one Beta issue (https://github.com/flatcar/Flatcar/issues/1909) right now but we can start think about it. What do you think @sayanchowdhury ?
The next release would include a major Stable release.
What about LTS?
@ffilippopoulos containerd 2..0.x is now available on Stable. Let us know if you still see those growing pty sessions. Thanks!