gnomad-browser
gnomad-browser copied to clipboard
Work on API memory exhaustion / stability issues
I've been working on this in the background for a while, I just wanted to have an issue to have on the board for it.
- [x] #1445
- [x] #1438
- [x] #1410
What we've been seeing, and trying to mitigate:
- pods had a JS heap that was larger than the container memory limit, so pods were getting OOMKilled before they did any garbage collection
- Setting a heap size below the container limit resulted in heap allocation failures, since that heap wasn't large enough (3GB), and caused CPU usage to go nuts, since we were constantly garbage collecting
- Setting a 7GB heap resulted in fewer heap allocation failures, and improved CPU utilization, but we were still getting the odd container kill on very large short term heap allocations
- Setting a 10GB heap seems to fix most of the heap allocation failures, except for some outliers.
Still tracking down some of our other pod restarts -- not all of these seem to be memory related, and we get pod restarts about ~twice per day with the latest resource limit increase.
Instability patterns that we're seeing regularly:
- pods fail their health check, which removes them from the load balancing pool. All requests then go to the other API server, which overwhelm it, and cause it to also become unhealthy. We swap back and forth throughout the day
- API pods crash with a RangeError stacktrace. The common cause of this appears to be attempting to serialize very large JSON objects into strings (https://the-tgg.slack.com/archives/C03P7FA3W3T/p1710772444516509)