cloud-platform icon indicating copy to clipboard operation
cloud-platform copied to clipboard

Configure worker node Image Garbage Collection threshold

Open sj-williams opened this issue 8 months ago • 0 comments

Background

We have been responding to higher frequency of root volume capacity high priority alarms in recent months by increasing the size of worker node root volume size.

Given that we know many users applications have large sized container images, and that as our cluster node count increases, we are likely increasing the chance of nodes with a larger amount of cached images existing.

This might be addressed by scaling up our node recycle frequency, or alternatively we could (should) look into the viability of tuning our node garbage collection thresholds. By default the GC kicks in at 85% volume usage, perhaps given the size of some of the much larger container images, we are getting into trouble with getting too close to 100% before cleanup can occur.

Approach

Test editing our worker node kube-config to set the thresholds to a lower value. Guidance here:

https://repost.aws/knowledge-center/eks-worker-nodes-image-cache

Which part of the user docs does this impact

Communicate changes

  • [ ] post for #cloud-platform-update
  • [ ] Weeknotes item
  • [ ] Show the Thing/P&A All Hands/User CoP
  • [ ] Announcements channel

Questions / Assumptions

Definition of done

  • [ ] readme has been updated
  • [ ] user docs have been updated
  • [ ] another team member has reviewed
  • [ ] smoke tests are green
  • [ ] prepare demo for the team

Reference

How to write good user stories

sj-williams avatar May 29 '24 15:05 sj-williams