Flatcar Upgrade from Stable 2605.8.0 to 2605.9.0 breaks Kubernetes running with docker container runtime

The current stable release 2605.9.0 has broken all our clusters (AWS / GCP / Bare metal + dev / prod). The issue we observe is that some nodes become NotReady after a while and traffic going to containers on those nodes times out. Moreover it seems that runc processes are stuck in D state (uninterruptible sleep) and a remedy to make the nodes responsive again is to kill those processes.

We are still in the midst of debugging trying to find the root cause, but in the meantime we thought we would raise an Issue detailing as much as we know, and update it as we go. This might also be useful if anyone is experiencing the same problem but haven't managed to formalize the Issue yet (like us).

Impact

Kubernetes cluster nodes enter a NotReady state and containers cannot receive traffic so after a while the cluster eventually breaks down.

Environment and steps to reproduce

Run a kubernetes cluster on Flatcar Container Linux by Kinvolk 2605.9.0 (Oklo) hosts with kubernetes version v1.19.2.
Wait until containerd gets stuck and nodes break.
We haven't tested if an aggressive rotation of pods can bring this to the surface earlier.

Expected behavior Everything should run as usual after the upgrade

Additional information

Logs from kubelet: At some point we notice that kubelet starts complaining about containerd runtime not responding:

Dec 10 08:49:34 ip-10-66-23-28 kubelet[1295931]: E1210 08:49:34.050798 1295931 kubelet.go:1765] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healhy: pleg was last seen active 3m1.826056813s ago; threshold is 3m0s]

Also we can see a lot of timeouts:

Dec 09 20:28:36 ip-10-66-23-47 kubelet[747034]: E1209 20:28:36.860894  747034 remote_runtime.go:357] UpdateContainerResources "fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context
deadline exceeded
Dec 09 20:28:36 ip-10-66-23-47 kubelet[747034]: E1209 20:28:36.860949  747034 cpu_manager.go:418] [cpumanager] reconcileState: failed to update container (pod: calico-node-dmrmn, container: calico-node, container id: fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15
347ef4304, cpuset: "0-1", error: rpc error: code = DeadlineExceeded desc = context deadline exceeded)
Dec 09 20:28:37 ip-10-66-23-47 kubelet[747034]: E1209 20:28:37.816557  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-live -bird-live' from runtime service failed: rpc error: code = Deadline
Exceeded desc = context deadline exceeded
Dec 09 20:28:39 ip-10-66-23-47 kubelet[747034]: E1209 20:28:39.937142  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-ready -bird-ready' from runtime service failed: rpc error: code = Unknow
n desc = operation timeout: context deadline exceeded
Dec 09 20:28:47 ip-10-66-23-47 kubelet[747034]: W1209 20:28:47.221936  747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: graylog-5f7f9ff465-bfdsx, container id: fb82db45342e3b4ba8ae0e9f421d63fc06d2ffba2f897fcdf6e1f0d37854ce0f)
Dec 09 20:28:48 ip-10-66-23-47 kubelet[747034]: W1209 20:28:48.009743  747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: vault-1, container id: bd910527ffde5f1bb6cd1c212653d6af6ffdf39b8185b7df3e93fb22d045f310)
Dec 09 20:28:48 ip-10-66-23-47 kubelet[747034]: W1209 20:28:48.261703  747034 cpu_manager.go:397] [cpumanager] reconcileState: ignoring terminated container (pod: calico-node-dmrmn, container id: bba7f01e56fc3f557ddef9217aadfff5e20d4e5239b98cbb2073a5e739044c97)
Dec 09 20:30:37 ip-10-66-23-47 kubelet[747034]: E1209 20:30:37.817316  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-live -bird-live' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Dec 09 20:30:39 ip-10-66-23-47 kubelet[747034]: E1209 20:30:39.938398  747034 remote_runtime.go:389] ExecSync fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304 '/bin/calico-node -felix-ready -bird-ready' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Dec 09 20:30:48 ip-10-66-23-47 kubelet[747034]: E1209 20:30:48.261900  747034 remote_runtime.go:357] UpdateContainerResources "fc09fafd0c40a9a31198f33fd63569be7e759f97426f056836fff15347ef4304" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Looking at processes for a hang container we can see:

root       10263  0.0  0.0   4368   708 ?        Ds   Dec09   0:00  |   \_ /usr/local/bin/runsvdir -P /etc/service/enabled
root       10878  0.0  0.0   4216   808 ?        Ds   Dec09   0:00  |   |   \_ runsv felix
root       10884  0.7  0.8 1578988 63452 ?       Sl   Dec09   8:12  |   |   |   \_ calico-node -felix
root       10879  0.0  0.0   4216   836 ?        Ds   Dec09   0:00  |   |   \_ runsv bird
root       11318  0.0  0.0    884   656 ?        D    Dec09   0:08  |   |   |   \_ bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
root       10880  0.0  0.0   4216   820 ?        Ds   Dec09   0:00  |   |   \_ runsv monitor-addresses
root       10885  0.0  0.6 1136340 53668 ?       Sl   Dec09   0:03  |   |   |   \_ calico-node -monitor-addresses
root       10881  0.0  0.0   4216   740 ?        Ds   Dec09   0:00  |   |   \_ runsv confd
root       10886  0.0  0.7 1505000 56496 ?       Sl   Dec09   0:14  |   |   |   \_ calico-node -confd
root       10882  0.0  0.0   4216   832 ?        Ds   Dec09   0:00  |   |   \_ runsv bird6
root       11319  0.0  0.0    744     8 ?        D    Dec09   0:05  |   |   |   \_ bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
root       10883  0.0  0.0   4216   852 ?        Ds   Dec09   0:00  |   |   \_ runsv allocate-tunnel-addrs
root       10888  0.0  0.5 1210072 42616 ?       Sl   Dec09   0:02  |   |       \_ calico-node -allocate-tunnel-addrs
root     1111612  0.0  0.2 1232712 22308 ?       Sl   Dec09   0:00  |   \_ runc --root /var/run/docker/runtime-runc/moby --log /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process535500753 --detach --pid-file /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/ff1812165d7c7e84fc1e4dce4e86358c4f244430fea1d0c8445a1123f9e23fcb.pid bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49
root     1111635  0.0  0.2 1085760 18012 ?       Dsl  Dec09   0:00  |   |   \_ runc init
root     1111628  8.7  0.2 1233864 21296 ?       Sl   Dec09  45:51  |   \_ runc --root /var/run/docker/runtime-runc/moby --log /run/docker/libcontainerd/containerd/io.containerd.runtime.v1.linux/moby/bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49/log.json --log-format json --systemd-cgroup update --resources - bd138a9c7020cf1ec127d5934d9307a2af6d1fabfc8d79f8a5972d70cc57aa49

and killing runc processes from the above seems to trigger the node to function properly again.

Note that we are mainly observing this on calico-node pods (as also visible from kubelet logs above) but we will try to reproduce it with other pods.

Also we found this: https://github.com/moby/moby/issues/40817 that describes a somewhat similar behaviour (even though it is quite old and latest stable flatcar comes with a containerd version > 1.4.0 which is blamed there)

Dec 10 '20 09:12 ffilippopoulos

The upgrade from 2605.8.0 to 2605.9.0 upgraded the containerd version from 1.3.7 to 1.4.3. The docker version didn't change much, went from 19.03.12 to 19.03.14.

The containerd issue identified in the linked moby issue seems to be this one: containerd/containerd/issues/4509 (Containers with curl based health checks becoming unresponsive). This issue was introduced in 1.4.0 and fixed with 1.4.1. So, it shouldn't be the same issue. It could be another issue introduced with 1.4.0 that hadn't been found until now, but that begs the question, why it only surfaces with specific Flatcar setups...

Dec 10 '20 12:12 margamanterola

The linked containerd issue has a bunch of strategies to get more information regarding what's going on. Check https://github.com/containerd/containerd/issues/4509#issuecomment-689671931 in particular, which includes a quote on how to create goroutine dumps as well as how to set containerd to generate debug output.

Dec 10 '20 12:12 margamanterola

@marga-kinvolk tyvm for the advise, I followed that comment and got a goroutine dump of the stuck containerd-shim process here: https://gist.githubusercontent.com/ffilippopoulos/72406746b8969e86cc735a97e364faed/raw/a3343972e7b90142621bbc19df3ce89aff216879/goroutine-stack-dump

I am not sure here but it looks like it's stuck waiting to read from a unix socket. Could this be related to changes regarding CVE-2020-15257: https://github.com/containerd/containerd/security/advisories/GHSA-36xw-fx78-c5r4 ?

Dec 11 '20 09:12 ffilippopoulos

A little bit more information from the misbehaving container and runc init process.

● docker-87a4886cbdaba184ef499c9a4f11debee4ae0b871cb5d1bdd65089db14884b76.scope - libcontainer container 87a4886cbdaba184ef499c9a4f11debee4ae0b871cb5d1bdd65089db14884b76
     Loaded: loaded (/run/systemd/transient/docker-87a4886cbdaba184ef499c9a4f11debee4ae0b871cb5d1bdd65089db14884b76.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/docker-87a4886cbdaba184ef499c9a4f11debee4ae0b871cb5d1bdd65089db14884b76.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUShares.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf
     Active: active (running) since Mon 2020-12-07 22:17:00 UTC; 3 days ago
      Tasks: 47 (limit: 8890)
     Memory: 107.6M
        CPU: 6min 34.971s
     CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1eaa232c_8a32_4893_b751_677bd7c942b6.slice/docker-87a4886cbdaba184ef499c9a4f11debee4ae0b871cb5d1bdd65089db14884b76.scope
             ├─  3595 /usr/local/bin/runsvdir -P /etc/service/enabled
             ├─  4477 runsv bird
             ├─  4478 runsv felix
             ├─  4479 runsv allocate-tunnel-addrs
             ├─  4480 runsv confd
             ├─  4481 runsv monitor-addresses
             ├─  4482 runsv bird6
             ├─  4485 calico-node -allocate-tunnel-addrs
             ├─  4487 calico-node -monitor-addresses
             ├─  4897 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
             ├─  4919 bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
             ├─ 13891 calico-node -confd
             ├─ 14090 calico-node -felix
             └─573668 runc init

Dec 07 22:17:00 worker-k8s-exp-1-d54s.c.uw-dev.internal systemd[1]: Started libcontainer container 87a4886cbdaba184ef499c9a4f11debee4ae0b871cb5d1bdd65089db14884b76.

# ps -fe | grep 57366[8]
root      573668  573651  0 Dec08 ?        00:00:00 runc init

# cat /proc/573668/status | grep State
State:  D (disk sleep)

# cat /proc/573668/stack
[<0>] flush_old_exec+0x3e9/0x6e0
[<0>] load_elf_binary+0x32e/0x1640
[<0>] search_binary_handler+0x84/0x1b0
[<0>] __do_execve_file.isra.0+0x4e3/0x830
[<0>] __x64_sys_execve+0x35/0x40
[<0>] do_syscall_64+0x4e/0x120
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

# lsof -p 573668
COMMAND      PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
runc:[2:I 573668 root  cwd       DIR              0,175     4096 2616498 /
runc:[2:I 573668 root  rtd       DIR              0,175     4096 2616498 /
runc:[2:I 573668 root  txt       REG                0,1 11169248 5497341 /memfd:runc_cloned:/proc/self/exe (deleted)
runc:[2:I 573668 root  mem       REG              254,0  1834704   33372 /usr/lib64/libc-2.32.so
runc:[2:I 573668 root  mem       REG              254,0   132968   32953 /usr/lib64/libseccomp.so.2.5.0
runc:[2:I 573668 root  mem       REG              254,0   109128   33373 /usr/lib64/libpthread-2.32.so
runc:[2:I 573668 root  mem       REG              254,0   177624   33357 /usr/lib64/ld-2.32.so
runc:[2:I 573668 root    0r      CHR                1,3      0t0    8338 /dev/null
runc:[2:I 573668 root    1w     FIFO               0,11      0t0 5497311 pipe
runc:[2:I 573668 root    2w     FIFO               0,11      0t0 5497312 pipe
runc:[2:I 573668 root    3u     unix 0xffff89accaf79800      0t0 5498207 type=STREAM
runc:[2:I 573668 root    4w     FIFO               0,11      0t0 5498209 pipe
runc:[2:I 573668 root    6u  a_inode               0,12        0    1029 [eventpoll]
runc:[2:I 573668 root    7r     FIFO               0,11      0t0 5497355 pipe
runc:[2:I 573668 root    8w     FIFO               0,11      0t0 5497355 pipe

The freezer state seems to be stuck in FREEZING:

# pwd
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1eaa232c_8a32_4893_b751_677bd7c942b6.slice/docker-87a4886cbdaba184ef499c9a4f11debee4ae0b871cb5d1bdd65089db14884b76.scope

# cat freezer.state
FREEZING

Frozen by itself:

# cat freezer.parent_freezing
0

# cat freezer.self_freezing
1

I've found that I can recover the container and bring the node back to a Ready state by writing THAWED to the freezer state:

echo THAWED > freezer.state

Dec 11 '20 17:12 ribbybibby

Some more findings.

Firstly, the issue can be reproduced with a very simple exec livenessProbe:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: testcase
spec:
  selector:
    matchLabels:
      app: testcase
  template:
    metadata:
      labels:
        app: testcase
    spec:
      containers:
        - name: testcase
          image: alpine:3.12
          command:
            - sh
            - -c
            - |
              touch /tmp/healthy
              while true; do sleep 86400; done
          livenessProbe:
            exec:
              command:
                - sh
                - -c
                - cat /tmp/healthy
            initialDelaySeconds: 5
            periodSeconds: 5

I wasn't able to reproduce it with an equivalent docker healthcheck though. At least not within a day of soaking it across a number of nodes.

$ docker run -d --name testcase --health-cmd "cat /tmp/healthy" --health-interval 1s --rm alpine:3.12 sh -c "touch /tmp/healthy; while true; do sleep 86400; done"

Secondly, I tried running 2605.9.0 with binaries from the official docker 19.03.14 release as laid out here: https://kinvolk.io/docs/flatcar-container-linux/latest/container-runtimes/use-a-custom-docker-or-containerd-version/.

So far the runc hang hasn't been reproduced on those nodes.

There are quite a few differences between what ships with Flatcar and what comes bundled with the release.

Flatcar:

Client:
 Version:           19.03.14
 API version:       1.40
 Go version:        go1.15.5
 Git commit:        f8db474
 Built:             Tue Dec  1 19:10:28 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.14
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.15.5
  Git commit:       f8db474
  Built:            Tue Dec  1 19:10:28 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        6806845b4f638417933f68721e289c9aeda456b1
 runc:
  Version:          1.0.0-rc92+dev.docker-19.03
  GitCommit:        3d68c79de7184b0eba97946d4f478736f46bf207
 docker-init:
  Version:          0.18.0fec3683b971d9c3ef73f284f176672c44b448662
  GitCommit:

Docker release:

Client: Docker Engine - Community
 Version:           19.03.14
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        5eb3275
 Built:             Tue Dec  1 19:14:24 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.14
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       5eb3275
  Built:            Tue Dec  1 19:21:08 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.9
  GitCommit:        ea765aba0d05254012b0b9e595e995c09186427f
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec368

Perhaps notably, the Go version is different: 1.13.5 vs 1.15.5. There are some comments on the containerd issue around avoiding go 1.15: https://github.com/containerd/containerd/issues/4509#issuecomment-687389617. Perhaps this is relevant?

Finally, the issue doesn't reproduce for us when we use containerd as the runtime for Kubernetes as laid out here: https://kinvolk.io/docs/flatcar-container-linux/latest/container-runtimes/switching-from-docker-to-containerd-for-kubernetes/.

As we'll eventually have to switch to containerd anyway, we're going to push forward with that as a workaround.

Dec 17 '20 16:12 ribbybibby

Another update.

We started getting the same runc hangs on exec probes with containerd after enabling systemd cgroups:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

We were previously enabling the systemd cgroup driver in docker.

After discovering this in containerd, I tried to reproduce the issues with systemd cgroups disabled in docker and I couldn't. So it seems like systemd cgroups are significant here.

Jan 08 '21 16:01 ribbybibby

@ribbybibby The issue you linked was also about the unconsumed shim debug log, did you disable shim debug or did you use the Flatcar defaults which have it enabled (at least for the releases up to now)?

Thanks for finding out that the cgroup driver makes a difference. The default on Flatcar is cgroupfs.

Jan 12 '21 12:01 pothos

Did you set the Kubelet cgroupDriver setting to the same value as the containerd setting?

Jan 12 '21 12:01 pothos

@pothos Yep, the kubelet cgroupDriver setting matched the containerd setting. For a while we were mistakenly running with a mismatch (cgroupDriver : systemd in kubelet, but not in containerd) and the issue wasn't reproduced in that state.

We've been using the flatcar default shim debug setting.

Jan 12 '21 12:01 ribbybibby

Ok, so the question is what freezes the cgroup (and doesn't unfreeze it). Just to make sure, this happens with cgroup v1, right? (This is the default under Flatcar unless one forces it to use a hybrid or v2 cgroup setup.) We only test and ship Docker on Flatcar with the cgroupfs driver and cgroup v1, but eventually it would be good to be able to switch and make sure there are no bugs.

Jan 12 '21 19:01 pothos

Just to make sure, this happens with cgroup v1, right? (This is the default under Flatcar unless one forces it to use a hybrid or v2 cgroup setup.)

Yep, we haven't touched the cgroup version so it must be v1. Not sure how to check that definitively though.

Jan 13 '21 09:01 ribbybibby

We've seen something similar. We have coredumps disabled for security compliance reason but see this in the kernel logs:

Jan 26 21:57:51 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: Core dump to |/dev/null pipe failed
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: INFO: task dockerd:2425 blocked for more than 120 seconds.
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:       Not tainted 4.19.143-flatcar #1
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: dockerd         D    0  2425      1 0x00000080
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: Call Trace:
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  __schedule+0x29f/0x840
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  schedule+0x28/0x80
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  wb_wait_for_completion+0x5e/0x90
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  ? wait_woken+0x80/0x80
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  sync_inodes_sb+0xce/0x280
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  sync_filesystem+0x5e/0x90
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  0xffffffffc041b222
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  sync_filesystem+0x78/0x90
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  generic_shutdown_super+0x22/0x100
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  kill_anon_super+0x14/0x30
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  deactivate_locked_super+0x34/0x70
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  cleanup_mnt+0x3f/0x70
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  task_work_run+0x8a/0xb0
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  exit_to_usermode_loop+0xd3/0xe0
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  do_syscall_64+0xe1/0x100
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: RIP: 0033:0x55d897adf8a0
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: Code: Bad RIP value.
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: RSP: 002b:000000c4240c4f08 EFLAGS: 00000216 ORIG_RAX: 00000000000000a6
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000055d897adf8a0
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c42354fdc0
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: RBP: 000000c4240c4f68 R08: 0000000000000000 R09: 0000000000000000
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: R10: 0000000000000000 R11: 0000000000000216 R12: ffffffffffffffff
Jan 26 21:58:09 vmss-agent-worker1-08prodva7-jrlna000H7R kernel: R13: 0000000000000045 R14: 0000000000000044 R15: 0000000000000049

Feb 01 '21 19:02 dharmab

Does it work with cgroupfs as cgroup driver in containerd/kubelet?

P.S.: Interesting that you are still on a 4.19 kernel ;)

Feb 01 '21 19:02 pothos

We had a mix of 4.19 and 5.4 running while we were debugging #328 :)

We had --exec-opt native.groupdriver=cgroupfs added to the dockerd flags, and use the default cgroupfs in the Kubelet config. I don't think we are setting containerd explicitly as documented here, though: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#containerd. Although we are using k8s 1.18.14 which still uses dockerd by default IIRC?

Feb 01 '21 19:02 dharmab

Thanks for the response, I wonder if this is really the same because the above report is about self_freezing being turned on while you saw a hung task in demsg.

Feb 01 '21 20:02 pothos

I had a suspicion that the downgrade from go 1.15 -> 1.13 for docker binaries in 2605.12.0 might squash this bug but I can still reproduce it.

Feb 09 '21 16:02 ribbybibby

Edit: I'm using K8s 1.18.18 along Kops 1.18.3

I'm not 100% sure if I'm hitting the same issue but it feels similar.

I upgraded a Kubernetes cluster from 2605.12.0 to 2765.2.2. I see that Docker gets updated from 19.03.14 to 19.03.15.

The rollout went fine, but then I found out that (only) Masters nodes start flapping from Ready to NotReady. It happens very fast.

Checking containerd logs I see a lot of these lines repeated

Apr 28 06:45:50 ip-172-15-62-176.ec2.internal env[1924]: time="2021-04-28T06:45:50.092158422Z" level=info msg="shim reaped" id=0e569ec15c912ba007ef5a1d0a2002ae4a4bcbb7003200353a644b9f71a781c3
Apr 28 06:45:50 ip-172-15-62-176.ec2.internal env[1924]: time="2021-04-28T06:45:50.291288272Z" level=warning msg="runtime v1 is deprecated since containerd v1.4, consider using runtime v2"
Apr 28 06:45:50 ip-172-15-62-176.ec2.internal env[1924]: time="2021-04-28T06:45:50.292082108Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/e94648437d3d307b9139cc384e736f77fec0f4cf63e52f8b2f6ac829931291a6" debug=false pid=3505968

The shim reaped does not look good 😇 . Maybe it is time to get rid of the shim and Docker altogether? Or should I revert to the previous Flatcar? Thanks

Apr 28 '21 06:04 jlsan92

I'm seeing the shim reaped / shim started repeating in my containerd logs too. Our prod API mysteriously stops working until docker is restarted, then it works for around 5-24 hours. We're not even using kubernetes in front of it.

May 21 '21 23:05 dukejones

Flatcar Flatcar copied to clipboard

Upgrade from Stable 2605.8.0 to 2605.9.0 breaks Kubernetes running with docker container runtime

Flatcar
Flatcar copied to clipboard