[BUG] - Conda store workers maintain high memory usage even when not currently building an environment
Describe the bug
I noticed no builds were going on, but the conda store worker was using 12 GB of memory. I killed the pod and the new conda store worker pod that was created used only ~2.2GB of memory.
Expected behavior
Ideally, conda store workers should not retain high memory usage when not building anything.
OS and architecture in which you are running Nebari
Linux x86-64
How to Reproduce the problem?
I haven't tried to reproduce, but I imagine this would work.
Deploy on GCP, build some conda envs, notice the memory usage doesn't return back to what it does when it's a new worker.
Command output
No response
Versions and dependencies used.
No response
Compute environment
GCP
Integrations
conda-store
Anything else?
https://github.com/nebari-dev/nebari/pull/2384 scaling workers down to 0 when not in use may avoid this issue for Nebari though does not solve the underlying issue.
We need to verify that the pod is actually using that much memory, and not just caching things in memory for future use. If it is caching, it is probably a good thing since that makes future builds faster.
It appears the memory usage is RSS, not cache according to the table in the grafana dashboard. I deleted the conda-store worker at the end and started a new one and that's why the memory drops off greatly at the end.
Something we could also check out, is that celery has a way to automatically terminate the workers in based on some limiters such as resources consumption of number of tasks received. I remember talking with Chris regarding this a while back, and we implemented something in there, thought I can't remember what was the event we used to restart the works on that time. This might also be something we could explore in case the workers do have memory leaks.
Looks like max-tasks-per-child-setting or max-memory-per-child-setting could be useful if we can't track down the cause.
Update: I've confirmed setting "--max-tasks-per-child=10", mitigates the leak (by restarting the worker process periodically).
Some people have suggested that setting a time limit on tasks causes a leak (see here or here) though those comments are a few years old. We are setting a time limit on the task_solve_conda_environment as seen below.
tasks.task_solve_conda_environment.apply_async(
args=[solve.id],
time_limit=settings.conda_max_solve_time,
task_id=task_id,
)
Update: I tested commenting out the time_limit= line, but it didn't help with the memory leak.
I might try listing the size of all the variables after each celery task finishes to help determine why memory is accumulating.