cloud-platform
cloud-platform copied to clipboard
Configure worker node Image Garbage Collection threshold
Background
We have been responding to higher frequency of root volume capacity high priority alarms in recent months by increasing the size of worker node root volume size.
Given that we know many users applications have large sized container images, and that as our cluster node count increases, we are likely increasing the chance of nodes with a larger amount of cached images existing.
This might be addressed by scaling up our node recycle frequency, or alternatively we could (should) look into the viability of tuning our node garbage collection thresholds. By default the GC kicks in at 85% volume usage, perhaps given the size of some of the much larger container images, we are getting into trouble with getting too close to 100% before cleanup can occur.
Approach
Test editing our worker node kube-config to set the thresholds to a lower value. Guidance here:
https://repost.aws/knowledge-center/eks-worker-nodes-image-cache
Which part of the user docs does this impact
Communicate changes
- [ ] post for #cloud-platform-update
- [ ] Weeknotes item
- [ ] Show the Thing/P&A All Hands/User CoP
- [ ] Announcements channel
Questions / Assumptions
Definition of done
- [ ] readme has been updated
- [ ] user docs have been updated
- [ ] another team member has reviewed
- [ ] smoke tests are green
- [ ] prepare demo for the team