nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[BUG] - Conda store workers maintain high memory usage even when not currently building an environment

Open Adam-D-Lewis opened this issue 1 year ago • 6 comments

Describe the bug

I noticed no builds were going on, but the conda store worker was using 12 GB of memory. I killed the pod and the new conda store worker pod that was created used only ~2.2GB of memory.

Expected behavior

Ideally, conda store workers should not retain high memory usage when not building anything.

OS and architecture in which you are running Nebari

Linux x86-64

How to Reproduce the problem?

I haven't tried to reproduce, but I imagine this would work.

Deploy on GCP, build some conda envs, notice the memory usage doesn't return back to what it does when it's a new worker.

Command output

No response

Versions and dependencies used.

No response

Compute environment

GCP

Integrations

conda-store

Anything else?

https://github.com/nebari-dev/nebari/pull/2384 scaling workers down to 0 when not in use may avoid this issue for Nebari though does not solve the underlying issue.

Adam-D-Lewis avatar Apr 19 '24 15:04 Adam-D-Lewis

We need to verify that the pod is actually using that much memory, and not just caching things in memory for future use. If it is caching, it is probably a good thing since that makes future builds faster.

dcmcand avatar Apr 29 '24 12:04 dcmcand

It appears the memory usage is RSS, not cache according to the table in the grafana dashboard. I deleted the conda-store worker at the end and started a new one and that's why the memory drops off greatly at the end.

image

image

Adam-D-Lewis avatar Jun 05 '24 14:06 Adam-D-Lewis

Something we could also check out, is that celery has a way to automatically terminate the workers in based on some limiters such as resources consumption of number of tasks received. I remember talking with Chris regarding this a while back, and we implemented something in there, thought I can't remember what was the event we used to restart the works on that time. This might also be something we could explore in case the workers do have memory leaks.

viniciusdc avatar Jun 18 '24 14:06 viniciusdc

Looks like max-tasks-per-child-setting or max-memory-per-child-setting could be useful if we can't track down the cause.

Update: I've confirmed setting "--max-tasks-per-child=10", mitigates the leak (by restarting the worker process periodically).

Adam-D-Lewis avatar Jul 02 '24 01:07 Adam-D-Lewis

Some people have suggested that setting a time limit on tasks causes a leak (see here or here) though those comments are a few years old. We are setting a time limit on the task_solve_conda_environment as seen below.

        tasks.task_solve_conda_environment.apply_async(
            args=[solve.id],
            time_limit=settings.conda_max_solve_time,
            task_id=task_id,
        )

Update: I tested commenting out the time_limit= line, but it didn't help with the memory leak.

Adam-D-Lewis avatar Jul 02 '24 01:07 Adam-D-Lewis

I might try listing the size of all the variables after each celery task finishes to help determine why memory is accumulating.

Adam-D-Lewis avatar Jul 02 '24 19:07 Adam-D-Lewis