amazon-eks-ami icon indicating copy to clipboard operation
amazon-eks-ami copied to clipboard

Nodes are experiencing PLEG issues with recent AMI

Open codyharris-h2o-ai opened this issue 3 years ago • 6 comments

What happened: Multiple nodes using ami-0506f8cf28abec02d/amazon-eks-node-1.17-v20210628 have been experiencing PLEG issues (PLEG is not healthy: pleg was last seen active 3m54.717586412s ago; threshold is 3m0)

What you expected to happen: Node remains healthy

How to reproduce it (as minimally and precisely as possible): Not entirely clear, possibly a very large pods (ie, 63 CPUs, 490 GiB)

Anything else we need to know?: Region , instance type

Environment:

  • AWS Region: us-east-1a
  • Instance Type(s): r5.16xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.8
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.17
  • AMI Version: ami-0506f8cf28abec02d / amazon-eks-node-1.17-v20210628
  • Kernel (e.g. uname -a): N/A
  • Release information (run cat /etc/eks/release on a node):
N/A (no shell access)

I have upgraded to the nodegroup to 1.17.12-20210722. The other 4 node groups do not see this issue (this is the only nodegroup seeing this issue, and the only nodegroup using this ami (some newer some older), and the only r5.16xlarge nodegroup).

Comment https://github.com/awslabs/amazon-eks-ami/issues/195#issuecomment-833901688 suggests creating a new issue if this resurfaces (and it did for us)

codyharris-h2o-ai avatar Aug 09 '21 22:08 codyharris-h2o-ai

Hi @codyharris-h2o-ai what is your container runtime? Docker or containerd?

rajakshay avatar Aug 31 '21 23:08 rajakshay

@rajakshay: We're using the docker runtime (19.3.13)

codyharris-h2o-ai avatar Sep 13 '21 18:09 codyharris-h2o-ai

Facing a similar issue. #837

sedflix avatar Mar 30 '22 10:03 sedflix

@sedflix, If you are using Calico, we found that we had to increase the memory limit and that pretty much resolved the issue

codyharris-h2o-ai avatar Apr 01 '22 21:04 codyharris-h2o-ai

@codyharris-h2o-ai what is the approximate resources you gave to the calico node, typha, and controller? This is my current config:

  componentResources:
   - componentName: Node
     resourceRequirements:
       requests:
         cpu: 200m
         memory: 256Mi
       limits:
         cpu: 500m
         memory: 256Mi
   - componentName: Typha
     resourceRequirements:
       requests:
         cpu: 200m
         memory: 512Mi
       limits:
         cpu: 200m
         memory: 512Mi
   - componentName: KubeControllers
     resourceRequirements:
       requests:
         cpu: 500m
         memory: 3Gi
       limits:
         cpu: 500m
         memory: 3Gi

We increased KubeControllers's resources and we observed a decrease in PLEG issues. Now we are trying to increase calico-node resources.

Thanks a ton for the response.

sedflix avatar Apr 04 '22 06:04 sedflix

We specified the following values for calico:

calico:
  node:
    resources:
      requests:
        memory: "256Mi"
      limits:
        memory: "256Mi"

After that change, we have gone from seeing multiple PLEG issues per week down to maybe one in 3 months.

codyharris-h2o-ai avatar Apr 04 '22 13:04 codyharris-h2o-ai