Flatcar icon indicating copy to clipboard operation
Flatcar copied to clipboard

pty sessions keep growing in /proc/sys/kernel/pty/nr

Open ffilippopoulos opened this issue 2 months ago • 11 comments

Description

We are running Flatcar as our host Operating system for Kubernetes cluster nodes in aws, gcp and bare metal instances. Since updating to version 4230.2.3 we have observed that pty sessions in /proc/sys/kernel/pty/nr keep growing and eventually reach a point where we can no longer create new sessions.

Impact

An immediate side effect surfaced because of this is that we can no longer jump into pods:

kubectl --context exp-1-merit -n sys-terraform-applier exec -ti pod/snapshot-test-29340305-n5pdz -- ash
error: Internal error occurred: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "2434a72a7d942d881f3db2057d524cf94eff1f7c8f957ec15452d41d265676f8": OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: no space left on device: unknown

But generally the issue is that pty sessions are exhausted quickly and we need to reboot nodes as a workaround to reset the number to 0.

Environment and steps to reproduce

All our affected nodes are currently running: OS Version: Flatcar Container Linux by Kinvolk 4230.2.4 (Oklo) Kernel 6.6.110-flatcar Kubernetes v1.34.1 (kubelet binary runs as systemd service) Container runtime: containerd://1.7.23

The issue seems to appear since the previous version 4230.2.3

Given enough time, the number of pty sessions in /proc/sys/kernel/pty/nr will grow very high and exhaust available sessions. For example:

core@worker-2 ~ $ cat /proc/sys/kernel/pty/nr
3072

At some point the the pool based on default pty sessions gets exhausted:

core@worker-2 ~ $ sysctl kernel.pty.max
kernel.pty.max = 4096

The pool of PTYs available to non-host processes (like containers) is calculated as kernel.pty.max - kernel.pty.reserve. So in our case, this is 4096 - 1024 = 3072. Then the error surfaces when trying to create sessions:

exec failed: unable to start container process: open /dev/ptmx: no space left on device: unknown

Restarting container runtime didn't seem to help the issue. The only workaround we have so far is to reboot affected nodes

Expected behaviour

In higher environments where we are still running 4230.2.2 - 6.6.100-flatcar, it seems like pty sessions are closed properly and figures look like:

$ cat /proc/sys/kernel/pty/nr
2

ffilippopoulos avatar Oct 15 '25 08:10 ffilippopoulos

Hello @ffilippopoulos,

The issue seems to appear since the previous version 4230.2.4

To be clear: the issue started to appear with 4230.2.4 or with the previous version (4230.2.3) ?

tormath1 avatar Oct 15 '25 09:10 tormath1

@tormath1 the issue started to appear with the previous version: 4230.2.3. Edited the description to address the typo 👍

ffilippopoulos avatar Oct 15 '25 09:10 ffilippopoulos

Could this be it: https://github.com/giantswarm/roadmap/issues/3997 https://github.com/containerd/containerd/issues/11160 fixed in containerd 1.7.26

jepio avatar Oct 15 '25 10:10 jepio

@jepio looks like the fix is also included in containerd >= 2.0.2 https://github.com/containerd/containerd/commit/e1b0bb601e33670fa24bd184ce61df80144a247d. I can see that beta channel is currently on: containerd - 2.0.5. We are happy to try that and report if it mitigates the issue in case that'll help.

ffilippopoulos avatar Oct 15 '25 10:10 ffilippopoulos

@jepio actually the last "working" version is 4230.2.2, which comes with the same containerd: containerd://1.7.23 as the ones where we've seen the issue. Regardless, I will try the latest beta channel and report back, but it's worth to note this.

ffilippopoulos avatar Oct 15 '25 10:10 ffilippopoulos

we should get containerd to 1.7.26 for stable/lts regardless...

jepio avatar Oct 15 '25 14:10 jepio

I can also confirm that the issue seems resolved in the latest beta version Flatcar Container Linux by Kinvolk 4459.1.0 (Oklo) 6.12.51-flatcar

ffilippopoulos avatar Oct 15 '25 14:10 ffilippopoulos

we should get containerd to 1.7.26 for stable/lts regardless...

I think it might be time for a Stable promotion? I'm aware of one Beta issue (https://github.com/flatcar/Flatcar/issues/1909) right now but we can start think about it. What do you think @sayanchowdhury ?

tormath1 avatar Oct 17 '25 08:10 tormath1

The next release would include a major Stable release.

sayanchowdhury avatar Oct 28 '25 05:10 sayanchowdhury

What about LTS?

jepio avatar Oct 28 '25 09:10 jepio

@ffilippopoulos containerd 2..0.x is now available on Stable. Let us know if you still see those growing pty sessions. Thanks!

tormath1 avatar Nov 13 '25 12:11 tormath1