zoekt-webserver has a memory leak
Hi there friends 👋
We noticed over at GitLab that zoekt-webserver seems to have a pretty obvious memory leak that is correlated with an increase in searches.
We resolved the incident for now by simply restarting pods and allocating more memory. You can read more about the incident here: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16192
Browsing through the recent commits, I don't see anything related to memory specifically, but we'll go ahead and update to a newer version of zoekt to see if that helps.
In the meantime, I thought I'd report ASAP in case this wasn't known. Have y'all seen this before?
Happy to contribute 🤝
Zoekt version: 5f25b3073480520aae1cd145d9f3f57226ff7fbc
Hello @binarymason. Zoekt uses mmap, so depending on how you measure memory usage this might not be representative of the go heap but rather the file cache that the kernel will go and evict from safely. cc @ggilmore
We do export the go runtime metrics to prometheus which should include a metric around heap size. Additionally I am not sure which docker images you use, if you use the dockerfile in this repository I do believe we have some tweaks around GOGC which we haven't tweaked in a while. Given the amount of work that has happened to the go garbage collection that may be worth revisiting. cc @stefanhengl
@keegancsmith Thanks for your response!
Zoekt uses mmap, so depending on how you measure memory usage this might not be representative of the go heap but rather the file cache that the kernel will go and evict from safely.
We are currently using this for calculating memory saturation:
container_memory_working_set_bytes / kube_pod_container_resource_limits
We do export the go runtime metrics to prometheus which should include a metric around heap size.
I see proc_metrics_memory_map_current_count and proc_metrics_memory_map_max_limit. Will definitely take a look at those.
Additionally I am not sure which docker images you use, if you use the dockerfile in this repository I do believe we have some tweaks around GOGC
We are using the default value of GOGC. I see that y'all set GOGC=25 here . We will try that out as well.
We'll report back if we have any insights to share. 👍
We are currently using this for calculating memory saturation:
container_memory_working_set_bytes / kube_pod_container_resource_limits
If I am not mistaken this will include memory that can be evicted. To test that theory you can run a command like echo 1 > /proc/sys/vm/drop_caches to see if memory use goes down.
Now I am not sure which metric the oom-killer is monitoring, so in practice this may be an important metric. This will depend on your setup.
I see
proc_metrics_memory_map_current_countandproc_metrics_memory_map_max_limit. Will definitely take a look at those.
Those are useful as well. The metric I was thinking about is what the go runtime reports around what it believes it has asked from the OS runtime. That is go_memstats_heap_sys_bytes. If there is a huge discrepency between that value and container_memory_working_set_bytes that implies to me there isn't a leak here but this is us using mmap as intended.
I think we can close this issue. Thank you! 🤝