amazon-eks-ami
amazon-eks-ami copied to clipboard
kubelet - PLEG is not healthy - flapping between ready and notReady
What happened:
kubelet is flapping between NodeReady
and NodeNotReady
, because of the following error:
"Skipping pod synchronization" err="PLEG is not healthy: pleg was last seen active 3m11.637412161s ago; threshold is 3m0s"
There is no high load or resource usage on the node, also docker is responsive:
07:31:31 up 8 days, 15:01, 1 user, load average: 0.76, 0.77, 0.73
time docker ps
real 0m0.040s
user 0m0.022s
sys 0m0.016s
After restarting dockerd the flapping stops, but without that the issue keeps happening.
I have checked kubelet
and docker
logs, but there is nothing which would suggest the cause of the issue.
Container runtime versions:
$ containerd -v
containerd github.com/containerd/containerd 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
$ runc -v
runc version 1.1.4
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.18.6
libseccomp: 2.4.1
$ docker -v
Docker version 20.10.17, build 100c701
Note: this is on govcloud so our ami was made FIPS ready, which can also play into the issue.
What you expected to happen:
I would expect PLEG and dockerd to recover from the error without restarting dockerd.
How to reproduce it (as minimally and precisely as possible):
I did not find any way to reproduce it yet, it occurs randomly it seems.
Anything else we need to know?:
Environment:
- AWS Region:
us-gov-west-1
- Instance Type(s):
r5.xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
):"eks.10"
- Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
):"1.22"
- AMI Version:
v20230217
- Kernel (e.g.
uname -a
):5.4.228-132.418.amzn2.x86_64
- Release information (run
cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-0b23a4a7e969b46f0"
BUILD_TIME="Fri Feb 17 21:59:24 UTC 2023"
BUILD_KERNEL="5.4.228-132.418.amzn2.x86_64"
ARCH="x86_64"
Any help is appreciated.