gnomad-browser Work on API memory exhaustion / stability issues

Work on API memory exhaustion / stability issues

Open sjahl opened this issue 10 months ago • 0 comments

I've been working on this in the background for a while, I just wanted to have an issue to have on the board for it.

[x] #1445
[x] #1438
[x] #1410

What we've been seeing, and trying to mitigate:

pods had a JS heap that was larger than the container memory limit, so pods were getting OOMKilled before they did any garbage collection
Setting a heap size below the container limit resulted in heap allocation failures, since that heap wasn't large enough (3GB), and caused CPU usage to go nuts, since we were constantly garbage collecting
Setting a 7GB heap resulted in fewer heap allocation failures, and improved CPU utilization, but we were still getting the odd container kill on very large short term heap allocations
Setting a 10GB heap seems to fix most of the heap allocation failures, except for some outliers.

Still tracking down some of our other pod restarts -- not all of these seem to be memory related, and we get pod restarts about ~twice per day with the latest resource limit increase.

Instability patterns that we're seeing regularly:

pods fail their health check, which removes them from the load balancing pool. All requests then go to the other API server, which overwhelm it, and cause it to also become unhealthy. We swap back and forth throughout the day
API pods crash with a RangeError stacktrace. The common cause of this appears to be attempting to serialize very large JSON objects into strings (https://the-tgg.slack.com/archives/C03P7FA3W3T/p1710772444516509)

Apr 03 '24 18:04 sjahl

gnomad-browser gnomad-browser copied to clipboard

Work on API memory exhaustion / stability issues

gnomad-browser
gnomad-browser copied to clipboard