k3s [Debian11/5.10.0-27] Network Controller heartbeat missed

Environmental Info: K3s Version:

Observed with 1.23.9, 1.23.17, 1.28.5
First observed with k3s version v1.23.9+k3s1 (f45cf326) go version go1.17.5

Node(s) CPU architecture, OS, and Version:

Linux master1.glimps-internal.lan 5.10.0-27-amd64 #1 SMP Debian 5.10.205-2 (2023-12-31) x86_64 GNU/Linux
Observed on 5.10.0-27-cloud-amd64 also.

Cluster Configuration: KVM-based virtualised clusters on a Ryzen7-based hosts with 128GB ram (qemu 5.2):

Single-master k3s cluster with the same OS as mentioned here.
1-master 3-compute k3s cluster, all nodes with the same OS as mentioned here.
3-master 3-compute k3s cluster, all nodes with the same OS as mentioned here.

Describe the bug: Once the cluster reaches 200+ pods, last or restarting pods take more than ten minutes to connect to their pod dependencies.

Reverting to kernel 5.10.0-26-amd64 or its cloud counterpart resolves the issue by reducing that delay under two minutes.

We observe Network Controller issues in the journal:

k3s[28851]: E0130 10:36:31.939242 28851 health_controller.go:117] Network Policy Controller heartbeat missed

We observe high iptables serialization delays, varying from 1s to 3.5s (!) depending on the KVM host:

$ time sudo iptables-save | wc -l
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them
1751

real    0m1.462s
user    0m0.017s
sys     0m0.023s

Both nf_tables and legacy modes for iptables were tested with the same result.

Our pods have init-containers waiting for CoreDNS to register dependent services, and waiting for those services to answer (e.g. waiting for an ES index to be accessible). Pods may wait in that state up to 15 minutes before connection will be accepted by the services they depend on, otherwise properly accessible for other pods already running.

Here is an output of top on one example master where the Network Controller heartbeat missed was observed and the iptables-save sample was took, with an idle cluster:

top - 10:45:15 up 24 days, 16:48,  1 user,  load average: 1.59, 1.31, 1.27                                                                 
Tasks: 156 total,   1 running, 155 sleeping,   0 stopped,   0 zombie                                                                       
%Cpu(s):  3.1 us,  1.3 sy,  0.0 ni, 94.9 id,  0.4 wa,  0.0 hi,  0.3 si,  0.0 st
MiB Mem :   3823.0 total,    361.9 free,   2852.2 used,    608.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    724.4 avail Mem

Steps To Reproduce:

Installed K3s (example given on a single-master cluster):

# k3s-install.sh \
        --kubelet-arg max-pods=300 \
        --node-ip 10.48.0.1 --node-external-ip 10.200.1.41 \
        --kubelet-arg registry-qps=10 \
        --flannel-iface enp1s0 \
        --kubelet-arg config=/etc/rancher/k3s/kubelet.config

K3s is started at boot by SystemD:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server \
        --tls-san 10.48.0.1 --tls-san 10.200.1.41 \
        --cluster-init --kubelet-arg max-pods=300 \
        --node-taint CriticalAddonsOnly=true:NoExecute \
        --node-ip 10.48.0.1 --node-external-ip 10.200.1.41 \
        --kubelet-arg registry-qps=10 \
        --flannel-backend wireguard \
        --flannel-iface enp1s0 \
        --kubelet-arg config=/etc/rancher/k3s/kubelet.config

On that instance, kubelet.config is minimal:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
shutdownGracePeriod: 60s
shutdownGracePeriodCriticalPods: 60s

Same goes for k3s.service.env:

K3S_NODE_NAME=master1

Expected behavior: Pods fully start under one or two minutes after they enter Init state, depending on the capabilities of the underlying VM.

Actual behavior: Pods are retained in CrashLoopBackoff attempting to contact services they depend on for more than ten minutes from their respective startup.

Additional context / logs: Reverting to kernel 5.10.0-26-amd64 or its cloud counterpart resolves the issue by reducing that delay under two minutes.

Installing a fresh cluster with 5.10.0-26 or replacing 5.10.0-27 by 5.10.0-26 on a live cluster both resolve the issue. Note that package wireguard pulls the meta-package linux-image-amd64, so dependencies need to be constrained by a kernel hold and a negative package pattern to work properly.

Jan 30 '24 10:01 TallFurryMan

Reverting to kernel 5.10.0-26-amd64 or its cloud counterpart resolves the issue by reducing that delay under two minutes.

Installing a fresh cluster with 5.10.0-26 or replacing 5.10.0-27 by 5.10.0-26 on a live cluster both resolve the issue.

Am I understanding correctly that you've isolated this to an issue with specific kernel versions? If so I'm not sure what we can do on the K3s side.

Observed with 1.23.9, 1.23.17, 1.28.5

Hmm, so two ancient versions and one recent one. That doesn't tell us much.

More recent releases of K3s include some improvements to the network policy controller that may make a difference for you. Can you try with the most recent release of K3s from any currently maintained minor version (1.26 - 1.29) and the --prefer-bundled-bin option to ensure that you're not seeing some weird issues with the host's iptables binary?

Jan 30 '24 20:01 brandond

Am I understanding correctly that you've isolated this to an issue with specific kernel versions? If so I'm not sure what we can do on the K3s side.

Kernel 5.10.0-27 is the kernel embedded in the latest Debian Bullseye cloud images starting Dec. 31. Kernel 5.10.0-26 is the one embedded just before. I'm not sure what you can do with this information neither, but I wish I were able to find a relevant item in the kernel changelog. There are things touching nftables management, but no idea whether it is relevant or not.

More recent releases of K3s include some improvements to the network policy controller that may make a difference for you. Can you try with the most recent release of K3s from any currently maintained minor version (1.26 - 1.29) and the --prefer-bundled-bin option to ensure that you're not seeing some weird issues with the host's iptables binary?

Yes, upgrades to 1.28.5 and backports 1.23.17 were attempted for that purpose. I agree 1.26+ has changes enhancing the routing part, but I should have seen a positive behaviour with 1.28.5. I'll definitely try the --prefer-bundled-bin option, which should be available in 1.23.17.

We also observed that running apt dist upgrade on a single-node system, shifting it from an 5.10.0-22 to 5.10.0-27 (so no k3s change, only OS packages), immediately caused the Network Controller to cough.

Feb 02 '24 17:02 TallFurryMan

It would appear that kernel 5.10.0-28 would fix the behaviour we observed in 5.10.0-27. I'll check their changelog and run more tests before confirming.

However, sorry, I'm lagging on --prefer-bundled-bin and got no info yet on a deployment with this option.

Feb 23 '24 15:02 TallFurryMan

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

Apr 08 '24 20:04 github-actions[bot]

k3s k3s copied to clipboard

[Debian11/5.10.0-27] Network Controller heartbeat missed

k3s
k3s copied to clipboard