containerd icon indicating copy to clipboard operation
containerd copied to clipboard

Empty cache files causes "KillPodSandbox" errors when deleting pods

Open mrparkers opened this issue 2 years ago • 10 comments

Description

I originally reported this issue at https://github.com/aws/amazon-vpc-cni-k8s/issues/2283, but I was asked to post the issue here instead.

I'm running into a problem where a node will occasionally have a bunch of pods stuck in the Terminating state because the pod's sandbox can't be deleted:

Pod event:

Warning  FailedKillPod   43s (x229 over 49m)  kubelet  error killing pod: failed to "KillPodSandbox" for "a7ea2280-7361-4329-8566-89975485b20e" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"

On the containerd side, I see similar logs. First, the reason given is cni plugin not initialized:

time="2023-03-01T20:08:10.167194705Z" level=info msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\""
time="2023-03-01T20:08:10.167346791Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": cni plugin not initialized"

This repeats a few times, then the error message changes to unexpected end of JSON input

time="2023-03-01T20:08:34.067124024Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"

On the host itself, I noticed that the /var/lib/cni/results directory contains a cache file for the container with the ID 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399 that's completely empty (zero bytes):

bash-5.1# ls -alh /var/lib/cni/results | grep 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399
-rw-------. 1 root root    0 Mar 1 06:31 aws-cni-4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399-eth0

Meanwhile, the other aws-cni-*-eth0 files are not empty, and contain valid JSON. So I believe the unexpected end of JSON input error message is caused by containerd attempting to json.Unmarshal an empty file, which I believe is happening here:

https://github.com/containerd/containerd/blob/a217b5ac8f66cfdbffd730f16b63f13e313013b8/vendor/github.com/containernetworking/cni/pkg/types/create/create.go#L31-L34

I am able to temporarily resolve the issue by deleting the empty cache file. The stuck pods are instantly able to be deleted as soon as I do that.

Another note: it looks like most of the caching logic is actually handled in https://github.com/containernetworking/cni, so perhaps this particular issue needs to be raised upstream. I'm not too sure.


My theory is that some other issue is causing the AWS VPC CNI plugin to crash or experience some issues, and containerd is mistakenly writing an empty cache file to /var/lib/cni/results while this is happening. Then, when the AWS VPC CNI plugin starts working again, this empty cache file is read, causing the error I'm seeing.

Steps to reproduce the issue

I wish I was able to reliably reproduce the underlying issue. The best I've been able to do is to delete the contents of one of the cache files within /var/lib/cni/results, which will reproduce the unexpected end of JSON input error message. However, I have been unable to reproduce the issue that causes the empty cache file to exist in the first place.

Describe the results you received and expected

I would expect for an empty cache file to never be written to /var/lib/cni/results. Or, if this does happen, perhaps containerd could catch this and remove the bad cache file before attempting to read from it.

What version of containerd are you using?

ctr github.com/containerd/containerd 1.6.15+bottlerocket

Any other relevant information

bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4

bash-5.1# uname -a
Linux ip-10-0-143-2.us-west-2.compute.internal 5.15.79 #1 SMP Wed Jan 25 03:35:23 UTC 2023 aarch64 GNU/Linux

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"

CNI Plugin is https://github.com/aws/amazon-vpc-cni-k8s v1.12.0

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.10-eks-48e63af", GitCommit:"9176fb99b52f8d5ff73d67fea27f3a638f679f8a", GitTreeState:"clean", BuildDate:"2023-01-24T19:17:48Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Show configuration if it is related to CRI plugin.

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
disabled_plugins = [
    "io.containerd.internal.v1.opt",
    "io.containerd.snapshotter.v1.aufs",
    "io.containerd.snapshotter.v1.devmapper",
    "io.containerd.snapshotter.v1.native",
    "io.containerd.snapshotter.v1.zfs",
]

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri"]
enable_selinux = true
# Pause container image is specified here, shares the same image as kubelet's pod-infra-container-image
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.1-eksbuild.1"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "shimpei"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/cri-base.json"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei.options]
SystemdCgroup = true
BinaryName = "shimpei"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://docker-registry.redacted.com"]

mrparkers avatar Mar 02 '23 16:03 mrparkers

hi @mrparkers , I think I have the exact same thing as you. I have all the same versions and everything and am even using a registry mirror.

I noticed that the empty aws-cni file seems to have a creation time that seems coincident with a host restart ? (16:36:15 below) . I was wondering if you see the same thing?

bash-5.1# ls -la --time-style=full-iso /var/lib/cni/results/
total 16
drwx------. 2 root root 4096 2023-03-07 05:07:37.644843952 +0000 .
drwx------. 3 root root 4096 2023-03-05 16:12:59.115640029 +0000 ..
-rw-------. 1 root root 2289 2023-03-05 16:37:32.858972330 +0000 aws-cni-aae559259afb19d5086194c3e0fd58fd71a67619a285bf10c33a2d1ceccfb809-eth0
-rw-------. 1 root root    0 2023-03-05 16:36:15.930198021 +0000 aws-cni-b10821e595aef417667637f481b45b5d0183fbc2f6341f53a0e5059d506eaaf5-eth0
bash-5.1# journalctl --list-boots
 -1 3a275f535dc245a6ab9f1a0758dc90d8 Sun 2023-03-05 16:12:30 UTC—Sun 2023-03-05 16:35:46 UTC
  0 d940a617b7c74bcd9676d2ffd02679c0 Sun 2023-03-05 16:37:15 UTC—Tue 2023-03-07 06:11:56 UTC

The previous boot logs run out at 16:35:46 , and then the new boot log starts at 16:37:15. Somehow the cni results file is created in between that. It feels like a new issue in 1.24 to me, I don't think this happened in 1.23.

gbucknel avatar Mar 07 '23 06:03 gbucknel

Hi @gbucknel, thanks for sharing your experience here. The information you provided is really useful, I didn't think to check the boot logs when I was debugging this last week. However, this was one of the events on the node that I was troubleshooting:

Warning  Rebooted                 23m                   kubelet     Node ip-10-0-159-84.us-west-2.compute.internal has been rebooted, boot id: c492843a-e20f-4f07-a811-cba4401eed96

So it's possible that this was caused by a reboot of the node too, although I don't have access to the boot logs of this node to confirm (the node is long gone).

Unfortunately, I have been unable to reproduce this manually, but the next time this happens I'll check those boot logs and respond here.

Are you also running Bottlerocket by chance?

mrparkers avatar Mar 07 '23 18:03 mrparkers

hi @mrparkers , yeah we're on Bottlerocket. It almost looks like we're on the same cluster! ;)

Just for completeness :

bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4

bash-5.1# uname -a
Linux 10.208.43.221 5.15.79 #1 SMP Wed Jan 25 03:35:24 UTC 2023 x86_64 GNU/Linux
bash-5.1#

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"

The thing I changed today was the CNI image, I was on 1.12.0 like you but updated one of my clusters to 1.12.5 today. It seems like some sort of race condition thing to me given it seems to happen during boot?

These two PRs in 1.12.1 look like they are related to initialisation (if you squint hard enough! ) , so thought it was worth a try.
https://github.com/aws/amazon-vpc-cni-k8s/pull/2138 https://github.com/aws/amazon-vpc-cni-k8s/pull/2145

I'll write back once I've tested it some more.

gbucknel avatar Mar 08 '23 04:03 gbucknel

@gbucknel the CNI version should not matter here, as writes to /var/lib/cni/results are all on the containerd side of the contract. I think we'll need someone from containerd to help investigate the behavior around reload

jdn5126 avatar Mar 08 '23 16:03 jdn5126

hi @jdn5126 cool, yes, I still see if with cni 1.12.5 . Is there any more info I can gather around this to help ?

gbucknel avatar Mar 17 '23 01:03 gbucknel

I am not sure. At least from a CNI perspective, I cannot think of any. I think we just need a containerd maintainer to help here

jdn5126 avatar Mar 17 '23 15:03 jdn5126

xref: https://github.com/containernetworking/cni/issues/1055

dims avatar Feb 14 '24 14:02 dims

xref: https://github.com/k3s-io/k3s/issues/6185

dims avatar Feb 14 '24 22:02 dims

Is this issue solved by https://github.com/containernetworking/cni/pull/1072? If so, can we rev up libcni for testing on the next release? @dims for feedback.

ryan-beisner avatar Jul 29 '24 22:07 ryan-beisner

@ryan-beisner that libcni fix looks legit to me, it was pulled into main by https://github.com/containerd/containerd/pull/10106

older containerd release branches are still on 1.1.x of libcni, which doesn't have this fix yet. AFAICT, libcni doesn't do release branches, so I don't see a way to get the fix cherry-picked there. On this end, not sure if #10106 can be picked to release/1.7.

cartermckinnon avatar Aug 08 '24 16:08 cartermckinnon

Hi, @mrparkers. I'm Dosu, and I'm helping the containerd team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • Pods are getting stuck in the Terminating state due to "KillPodSandbox" errors.
  • The issue may be linked to network configuration issues and node reboots, suggesting a potential race condition.
  • A potential fix related to libcni has been integrated into the main branch but is not yet available in older release branches.
  • Further testing with the updated libcni is suggested to verify the fix.

Next Steps

  • Please let us know if this issue is still relevant to the latest version of containerd by commenting on this issue.
  • If there is no further activity, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Dec 25 '24 16:12 dosubot[bot]

older containerd release branches are still on 1.1.x of libcni, which doesn't have this fix yet. AFAICT, libcni doesn't do release branches, so I don't see a way to get the fix cherry-picked there. On this end, not sure if https://github.com/containerd/containerd/pull/10106 can be picked to release/1.7.

Here's the diff from 1.1.2 (used in containerd release/1.7 right now) to 1.2.0: https://github.com/containernetworking/cni/compare/v1.1.2...v1.2.0. At 159 commits, it's not small. The release notes cover some relatively big changes including changed behavior for the STATUS verb: https://github.com/containernetworking/cni/releases/tag/v1.2.0

1.7 is an LTS release, and our LTS policy says this:

They may also accept a wider range of patches than non-LTS releases to support the longer term maintainability of the branch, including library dependency, toolchain (including Go) and other version updates which are needed to ensure each release is built with fully supported dependencies.

I think given the policy I'd be open to taking the CNI dependency change to 1.2.0 in release/1.7. Curious what other maintainers think.

cc @dmcgowan @mikebrow @fuweid

samuelkarp avatar May 28 '25 00:05 samuelkarp

Hi folks! is there any traction on this release? this issue is still recurring on my end with EKS v1.31.7-eks-473151a nodes running lots of workloads

More context: this issue happens on containerd github.com/containerd/containerd 1.7.27 05044ec0a9a75232cad458027ca83437aae3f4da but not on containerd github.com/containerd/containerd 1.7.25 bcc810d6b9066471b0b6fa75f557a15a1cbf31bb

MeNsaaH avatar Jul 09 '25 13:07 MeNsaaH

I think given the policy I'd be open to taking the CNI dependency change to 1.2.0 in release/1.7. Curious what other maintainers think.

@samuelkarp +1 from me

dims avatar Jul 22 '25 19:07 dims

Hi, when will cni 1.2.x be added to containerd 1.7?

jade710 avatar Sep 08 '25 16:09 jade710