amazon-eks-ami
amazon-eks-ami copied to clipboard
Nodes are experiencing PLEG issues with recent AMI
What happened: Multiple nodes using ami-0506f8cf28abec02d
/amazon-eks-node-1.17-v20210628
have been experiencing PLEG issues (PLEG is not healthy: pleg was last seen active 3m54.717586412s ago; threshold is 3m0
)
What you expected to happen: Node remains healthy
How to reproduce it (as minimally and precisely as possible): Not entirely clear, possibly a very large pods (ie, 63 CPUs, 490 GiB)
Anything else we need to know?: Region , instance type
Environment:
- AWS Region:
us-east-1a
- Instance Type(s):
r5.16xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
):eks.8
- Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
):1.17
- AMI Version:
ami-0506f8cf28abec02d
/amazon-eks-node-1.17-v20210628
- Kernel (e.g.
uname -a
): N/A - Release information (run
cat /etc/eks/release
on a node):
N/A (no shell access)
I have upgraded to the nodegroup to 1.17.12-20210722
. The other 4 node groups do not see this issue (this is the only nodegroup seeing this issue, and the only nodegroup using this ami (some newer some older), and the only r5.16xlarge
nodegroup).
Comment https://github.com/awslabs/amazon-eks-ami/issues/195#issuecomment-833901688 suggests creating a new issue if this resurfaces (and it did for us)
Hi @codyharris-h2o-ai what is your container runtime? Docker or containerd?
@rajakshay: We're using the docker runtime (19.3.13)
Facing a similar issue. #837
@sedflix, If you are using Calico, we found that we had to increase the memory limit and that pretty much resolved the issue
@codyharris-h2o-ai what is the approximate resources you gave to the calico node, typha, and controller? This is my current config:
componentResources:
- componentName: Node
resourceRequirements:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 256Mi
- componentName: Typha
resourceRequirements:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 200m
memory: 512Mi
- componentName: KubeControllers
resourceRequirements:
requests:
cpu: 500m
memory: 3Gi
limits:
cpu: 500m
memory: 3Gi
We increased KubeControllers's resources and we observed a decrease in PLEG issues. Now we are trying to increase calico-node resources.
Thanks a ton for the response.
We specified the following values for calico:
calico:
node:
resources:
requests:
memory: "256Mi"
limits:
memory: "256Mi"
After that change, we have gone from seeing multiple PLEG issues per week down to maybe one in 3 months.