containerd
containerd copied to clipboard
Empty cache files causes "KillPodSandbox" errors when deleting pods
Description
I originally reported this issue at https://github.com/aws/amazon-vpc-cni-k8s/issues/2283, but I was asked to post the issue here instead.
I'm running into a problem where a node will occasionally have a bunch of pods stuck in the Terminating state because the pod's sandbox can't be deleted:
Pod event:
Warning FailedKillPod 43s (x229 over 49m) kubelet error killing pod: failed to "KillPodSandbox" for "a7ea2280-7361-4329-8566-89975485b20e" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"
On the containerd side, I see similar logs. First, the reason given is cni plugin not initialized:
time="2023-03-01T20:08:10.167194705Z" level=info msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\""
time="2023-03-01T20:08:10.167346791Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": cni plugin not initialized"
This repeats a few times, then the error message changes to unexpected end of JSON input
time="2023-03-01T20:08:34.067124024Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"
On the host itself, I noticed that the /var/lib/cni/results directory contains a cache file for the container with the ID 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399 that's completely empty (zero bytes):
bash-5.1# ls -alh /var/lib/cni/results | grep 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399
-rw-------. 1 root root 0 Mar 1 06:31 aws-cni-4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399-eth0
Meanwhile, the other aws-cni-*-eth0 files are not empty, and contain valid JSON.
So I believe the unexpected end of JSON input error message is caused by containerd attempting to json.Unmarshal an empty file, which I believe is happening here:
https://github.com/containerd/containerd/blob/a217b5ac8f66cfdbffd730f16b63f13e313013b8/vendor/github.com/containernetworking/cni/pkg/types/create/create.go#L31-L34
I am able to temporarily resolve the issue by deleting the empty cache file. The stuck pods are instantly able to be deleted as soon as I do that.
Another note: it looks like most of the caching logic is actually handled in https://github.com/containernetworking/cni, so perhaps this particular issue needs to be raised upstream. I'm not too sure.
My theory is that some other issue is causing the AWS VPC CNI plugin to crash or experience some issues, and containerd is mistakenly writing an empty cache file to /var/lib/cni/results while this is happening. Then, when the AWS VPC CNI plugin starts working again, this empty cache file is read, causing the error I'm seeing.
Steps to reproduce the issue
I wish I was able to reliably reproduce the underlying issue. The best I've been able to do is to delete the contents of one of the cache files within /var/lib/cni/results, which will reproduce the unexpected end of JSON input error message. However, I have been unable to reproduce the issue that causes the empty cache file to exist in the first place.
Describe the results you received and expected
I would expect for an empty cache file to never be written to /var/lib/cni/results. Or, if this does happen, perhaps containerd could catch this and remove the bad cache file before attempting to read from it.
What version of containerd are you using?
ctr github.com/containerd/containerd 1.6.15+bottlerocket
Any other relevant information
bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4
bash-5.1# uname -a
Linux ip-10-0-143-2.us-west-2.compute.internal 5.15.79 #1 SMP Wed Jan 25 03:35:23 UTC 2023 aarch64 GNU/Linux
bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
CNI Plugin is https://github.com/aws/amazon-vpc-cni-k8s v1.12.0
Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.10-eks-48e63af", GitCommit:"9176fb99b52f8d5ff73d67fea27f3a638f679f8a", GitTreeState:"clean", BuildDate:"2023-01-24T19:17:48Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Show configuration if it is related to CRI plugin.
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
disabled_plugins = [
"io.containerd.internal.v1.opt",
"io.containerd.snapshotter.v1.aufs",
"io.containerd.snapshotter.v1.devmapper",
"io.containerd.snapshotter.v1.native",
"io.containerd.snapshotter.v1.zfs",
]
[grpc]
address = "/run/containerd/containerd.sock"
[plugins."io.containerd.grpc.v1.cri"]
enable_selinux = true
# Pause container image is specified here, shares the same image as kubelet's pod-infra-container-image
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.1-eksbuild.1"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "shimpei"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/cri-base.json"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei.options]
SystemdCgroup = true
BinaryName = "shimpei"
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://docker-registry.redacted.com"]
hi @mrparkers , I think I have the exact same thing as you. I have all the same versions and everything and am even using a registry mirror.
I noticed that the empty aws-cni file seems to have a creation time that seems coincident with a host restart ? (16:36:15 below) . I was wondering if you see the same thing?
bash-5.1# ls -la --time-style=full-iso /var/lib/cni/results/
total 16
drwx------. 2 root root 4096 2023-03-07 05:07:37.644843952 +0000 .
drwx------. 3 root root 4096 2023-03-05 16:12:59.115640029 +0000 ..
-rw-------. 1 root root 2289 2023-03-05 16:37:32.858972330 +0000 aws-cni-aae559259afb19d5086194c3e0fd58fd71a67619a285bf10c33a2d1ceccfb809-eth0
-rw-------. 1 root root 0 2023-03-05 16:36:15.930198021 +0000 aws-cni-b10821e595aef417667637f481b45b5d0183fbc2f6341f53a0e5059d506eaaf5-eth0
bash-5.1# journalctl --list-boots
-1 3a275f535dc245a6ab9f1a0758dc90d8 Sun 2023-03-05 16:12:30 UTC—Sun 2023-03-05 16:35:46 UTC
0 d940a617b7c74bcd9676d2ffd02679c0 Sun 2023-03-05 16:37:15 UTC—Tue 2023-03-07 06:11:56 UTC
The previous boot logs run out at 16:35:46 , and then the new boot log starts at 16:37:15. Somehow the cni results file is created in between that. It feels like a new issue in 1.24 to me, I don't think this happened in 1.23.
Hi @gbucknel, thanks for sharing your experience here. The information you provided is really useful, I didn't think to check the boot logs when I was debugging this last week. However, this was one of the events on the node that I was troubleshooting:
Warning Rebooted 23m kubelet Node ip-10-0-159-84.us-west-2.compute.internal has been rebooted, boot id: c492843a-e20f-4f07-a811-cba4401eed96
So it's possible that this was caused by a reboot of the node too, although I don't have access to the boot logs of this node to confirm (the node is long gone).
Unfortunately, I have been unable to reproduce this manually, but the next time this happens I'll check those boot logs and respond here.
Are you also running Bottlerocket by chance?
hi @mrparkers , yeah we're on Bottlerocket. It almost looks like we're on the same cluster! ;)
Just for completeness :
bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4
bash-5.1# uname -a
Linux 10.208.43.221 5.15.79 #1 SMP Wed Jan 25 03:35:24 UTC 2023 x86_64 GNU/Linux
bash-5.1#
bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"
The thing I changed today was the CNI image, I was on 1.12.0 like you but updated one of my clusters to 1.12.5 today. It seems like some sort of race condition thing to me given it seems to happen during boot?
These two PRs in 1.12.1 look like they are related to initialisation (if you squint hard enough! ) , so thought it was worth a try.
https://github.com/aws/amazon-vpc-cni-k8s/pull/2138
https://github.com/aws/amazon-vpc-cni-k8s/pull/2145
I'll write back once I've tested it some more.
@gbucknel the CNI version should not matter here, as writes to /var/lib/cni/results are all on the containerd side of the contract. I think we'll need someone from containerd to help investigate the behavior around reload
hi @jdn5126 cool, yes, I still see if with cni 1.12.5 . Is there any more info I can gather around this to help ?
I am not sure. At least from a CNI perspective, I cannot think of any. I think we just need a containerd maintainer to help here
xref: https://github.com/containernetworking/cni/issues/1055
xref: https://github.com/k3s-io/k3s/issues/6185
Is this issue solved by https://github.com/containernetworking/cni/pull/1072? If so, can we rev up libcni for testing on the next release? @dims for feedback.
@ryan-beisner that libcni fix looks legit to me, it was pulled into main by https://github.com/containerd/containerd/pull/10106
older containerd release branches are still on 1.1.x of libcni, which doesn't have this fix yet. AFAICT, libcni doesn't do release branches, so I don't see a way to get the fix cherry-picked there. On this end, not sure if #10106 can be picked to release/1.7.
Hi, @mrparkers. I'm Dosu, and I'm helping the containerd team manage their backlog. I'm marking this issue as stale.
Issue Summary
- Pods are getting stuck in the
Terminatingstate due to "KillPodSandbox" errors. - The issue may be linked to network configuration issues and node reboots, suggesting a potential race condition.
- A potential fix related to libcni has been integrated into the main branch but is not yet available in older release branches.
- Further testing with the updated libcni is suggested to verify the fix.
Next Steps
- Please let us know if this issue is still relevant to the latest version of containerd by commenting on this issue.
- If there is no further activity, this issue will be automatically closed in 7 days.
Thank you for your understanding and contribution!
older containerd release branches are still on 1.1.x of libcni, which doesn't have this fix yet. AFAICT, libcni doesn't do release branches, so I don't see a way to get the fix cherry-picked there. On this end, not sure if https://github.com/containerd/containerd/pull/10106 can be picked to
release/1.7.
Here's the diff from 1.1.2 (used in containerd release/1.7 right now) to 1.2.0: https://github.com/containernetworking/cni/compare/v1.1.2...v1.2.0. At 159 commits, it's not small. The release notes cover some relatively big changes including changed behavior for the STATUS verb: https://github.com/containernetworking/cni/releases/tag/v1.2.0
1.7 is an LTS release, and our LTS policy says this:
They may also accept a wider range of patches than non-LTS releases to support the longer term maintainability of the branch, including library dependency, toolchain (including Go) and other version updates which are needed to ensure each release is built with fully supported dependencies.
I think given the policy I'd be open to taking the CNI dependency change to 1.2.0 in release/1.7. Curious what other maintainers think.
cc @dmcgowan @mikebrow @fuweid
Hi folks! is there any traction on this release? this issue is still recurring on my end with EKS v1.31.7-eks-473151a nodes running lots of workloads
More context: this issue happens on
containerd github.com/containerd/containerd 1.7.27 05044ec0a9a75232cad458027ca83437aae3f4da but not on containerd github.com/containerd/containerd 1.7.25 bcc810d6b9066471b0b6fa75f557a15a1cbf31bb
I think given the policy I'd be open to taking the CNI dependency change to 1.2.0 in release/1.7. Curious what other maintainers think.
@samuelkarp +1 from me
Hi, when will cni 1.2.x be added to containerd 1.7?