k3d icon indicating copy to clipboard operation
k3d copied to clipboard

[BUG] In CentOS Stream 9, when a container gets OOM-killed, it kills the cluster

Open charlesbihis opened this issue 1 year ago • 0 comments

Hi all. I'm running some C# applications in a local k3d cluster on my CentOS Stream 9 machine and am noticing some troubling behavior. In short, when a container in my cluster gets OOM-killed, it kills the process, the parent container, the parent pod, all the way up to the K3D node itself, taking out any other pods/containers running on that node and effectively breaking the cluster. This is somewhat mitigated by standing up a multi-node cluster, but the individual nodes will die one by one as this happens until the entire cluster is eventually dead.

I've narrowed this down specifically to when the cgroup driver is systemd (default on CentOS Stream 9). I do not see this behavior when I switch the cgroup driver to cgroupfs.

More details below.

What did you do

I took the following steps to repro...

  1. Create a new k3d cluster
  2. Run a pod in the cluster
  3. Simulate an OOM-kill scenario which should kill the container and restart the pod
  4. Observe that it instead kills the entire node, and in single node clusters, effectively breaks the cluster

The individual commands for the first 3 steps...

# create the cluster
k3d cluster create

# run a pod in the cluster
kubectl run testpod --image=bitnami/dotnet -- tail -f /dev/null

# simulate an OOM-kill scenario by tailing /dev/zero (https://askubuntu.com/a/1188074)
kubectl exec -it testpod -- tail /dev/zero

What happened

The OOM-killer kills the process, the parent container, the parent pod, as well as the parent k3d node, effectively breaking the cluster.

When I try and check the state of my cluster, I see that Docker and k3d report the node as "restarting", and it never recovers...

> docker ps
CONTAINER ID   IMAGE                            COMMAND                  CREATED         STATUS                         PORTS                             NAMES
a0e0a101181e   ghcr.io/k3d-io/k3d-proxy:5.4.9   "/bin/sh -c nginx-pr…"   2 minutes ago   Up 2 minutes                   80/tcp, 0.0.0.0:43481->6443/tcp   k3d-k3s-default-serverlb
aa3536b6b0dd   rancher/k3s:v1.25.7-k3s1         "/bin/k3d-entrypoint…"   2 minutes ago   Restarting (1) 5 seconds ago                                     k3d-k3s-default-server-0

> k3d node list
NAME                       ROLE           CLUSTER       STATUS
k3d-k3s-default-server-0   server         k3s-default   restarting
k3d-k3s-default-serverlb   loadbalancer   k3s-default   running

> k3d cluster list
NAME          SERVERS   AGENTS   LOADBALANCER
k3s-default   1/1       0/0      true

> kubectl get nodes
Unable to connect to the server: EOF

I'm able to confirm with journalctl that the OOM-killer killed the process and all parent processes up until the k3d node...

> sudo journalctl --since "Mar 17 00:00:00" > with-systemd.log

# parsing that log, here are the relevant bits...
...
Mar 17 22:36:49 ip-172-31-15-98.us-west-1.compute.internal kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope/kubepods/besteffort/podd8dcbcc6-5695-4b90-823b-ac11f60809ec/d1d9d0f61ec5d843e253fabee7c19ccd6623b752a2a7cb2de8deda6e7bc067d0,task=tail,pid=20227,uid=0
Mar 17 22:36:49 ip-172-31-15-98.us-west-1.compute.internal kernel: Out of memory: Killed process 20227 (tail) total-vm:799520kB, anon-rss:797192kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1600kB oom_score_adj:1000
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: A process of this unit has been killed by the OOM killer.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17092 (metrics-server) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17105 (metrics-server) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17108 (metrics-server) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17109 (metrics-server) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17110 (metrics-server) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17115 (metrics-server) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17165 (n/a) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17166 (metrics-server) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 16667 (pause) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 16896 (coredns) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 16914 (coredns) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 16915 (coredns) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 16918 (coredns) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 16919 (coredns) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17167 (coredns) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 16685 (pause) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 19736 (pause) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17744 (pause) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 19896 (pause) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 19974 (bash) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 17848 (pause) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 18239 (traefik) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 18259 (traefik) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 18260 (n/a) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 18261 (traefik) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 18262 (traefik) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 18263 (traefik) with signal SIGKILL.
Mar 17 22:36:50 ip-172-31-15-98.us-west-1.compute.internal systemd[1]: docker-fc22a244e4e9a1704d1b2e9d1a5bdf3e2f66c0de47ce462a109fb474ec7da0aa.scope: Killing process 18264 (n/a) with signal SIGKILL.
... (and many more processes)

Note that in the above logs, the OOM-killer kills the problematic tail process, but systemd then proceeds to kill all other processes presumably in the same cgroup hierarchy, which kills all sibling containers, all pods on the k3d node, and the k3d node itself.

What I expected to happen

I expected the problematic process to be killed, and even the parent container and pod, but not the entire node. More importantly, I expected to still have a functional k3d cluster.

I was able to narrow this down specifically to when the cgroup driver is systemd. I do not see this behavior when I switch the cgroup driver to cgroupfs.

When I change the cgroup driver to cgroupfs (see https://stackoverflow.com/a/65870152), I am able to observe nominal behavior again. Executing the same repro steps as above, this is what I observe instead...

> docker ps
CONTAINER ID   IMAGE                            COMMAND                  CREATED              STATUS              PORTS                             NAMES
ee5c17adec35   ghcr.io/k3d-io/k3d-proxy:5.4.9   "/bin/sh -c nginx-pr…"   About a minute ago   Up About a minute   80/tcp, 0.0.0.0:45585->6443/tcp   k3d-k3s-default-serverlb
46af8629abf5   rancher/k3s:v1.25.7-k3s1         "/bin/k3d-entrypoint…"   About a minute ago   Up About a minute                                     k3d-k3s-default-server-0

> k3d node list
NAME                       ROLE           CLUSTER       STATUS
k3d-k3s-default-server-0   server         k3s-default   running
k3d-k3s-default-serverlb   loadbalancer   k3s-default   running

> k3d cluster list
NAME          SERVERS   AGENTS   LOADBALANCER
k3s-default   1/1       0/0      true

> kubectl get nodes
NAME                       STATUS   ROLES                  AGE   VERSION
k3d-k3s-default-server-0   Ready    control-plane,master   96s   v1.25.7+k3s1

And the journalctl logs also show nominal behavior...

> sudo journalctl --since "Mar 17 00:00:00" > with-cgroupfs.log

# once again parsing that log, here are the relevant bits...
...
Mar 17 22:43:49 ip-172-31-15-98.us-west-1.compute.internal kernel: Out of memory: Killed process 30435 (tail) total-vm:901596kB, anon-rss:899060kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:1800kB oom_score_adj:1000

Note that in the above logs, no other processes are killed other than the problematic tail process.

Screenshots of terminal output

I have been able to replicate this by spinning up a vanilla CentOS Stream 9 image in EC2 using the Amazon CentOS Stream 9 AMI. Here are my results from the 2 different configurations, with the repro and status commands highlighted...

NOTE: Can change cgroup drivers by following instructions here - https://stackoverflow.com/a/65870152

With systemd as the cgroup driver

image

With cgroupfs as the cgroup driver

image

Which OS & Architecture

> k3d runtime-info
arch: x86_64
cgroupdriver: systemd
cgroupversion: "2"
endpoint: /var/run/docker.sock
filesystem: xfs
infoname: ip-172-31-15-98.us-west-1.compute.internal
name: docker
os: CentOS Stream 9
ostype: linux
version: 23.0.1

> cat /etc/os-release
NAME="CentOS Stream"
VERSION="9"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="9"
PLATFORM_ID="platform:el9"
PRETTY_NAME="CentOS Stream 9"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:centos:centos:9"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"

Which version of k3d

> k3d version
k3d version v5.4.9
k3s version v1.25.7-k3s1 (default)

Which version of docker

> docker version
Client: Docker Engine - Community
 Version:           23.0.1
 API version:       1.42
 Go version:        go1.19.5
 Git commit:        a5ee5b1
 Built:             Thu Feb  9 19:49:35 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          23.0.1
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.19.5
  Git commit:       bc3805a
  Built:            Thu Feb  9 19:46:32 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

> docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.16.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.14.0-229.el9.x86_64
 Operating System: CentOS Stream 9
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 1.699GiB
 Name: ip-172-31-15-98.us-west-1.compute.internal
 ID: 71b23c0e-f99d-4a0b-8fbc-548459dd9832
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

charlesbihis avatar Mar 18 '23 00:03 charlesbihis