bullmq icon indicating copy to clipboard operation
bullmq copied to clipboard

Close Isolated Workers processes when idle for X minutes

Open lucasavila00 opened this issue 10 months ago • 5 comments

Is your feature request related to a problem? Please describe. I have a BullMQ deployment with dozens of queues. All of them use isolated processors. Each queue is configured to run with a high concurrency, to handle load spikes. On average the queues are mostly idle. Whenever there is a spike, worker processes are created, and they never finish/close. This makes the BullMQ deployment leak memory until we restart the docker image.

Describe the solution you'd like An isolated worker process that has been idle for X minutes should be finished.

Describe alternatives you've considered I tried not using isolated workers, and instead deploy many BullMQ 'parent processes' but I had lots of stalling jobs. I need to use isolated workers to avoid one queue impacting another queue. When I add a new queue to the deployment it is usually broken, and it's good that it doesn't hurt the other queues.

Additional context I created a repository that shows my issue: https://github.com/lucasavila00/bull-worker-leak/tree/main The repository Readme contains instructions on how to run it. The repository shows a list of processes spawned by BullMQ (https://github.com/lucasavila00/bull-worker-leak/blob/main/top_results.txt) and the processes are never finished, even after minutes of being idle.

lucasavila00 avatar Feb 25 '25 17:02 lucasavila00

Whenever there is a spike, worker processes are created, and they never finish/close. This makes the BullMQ deployment leak memory until we restart the docker image.

Why would this leak memory? currently the way BullMQ works is that it keeps a pool of processes that equals the concurrency factor. This pool lives as long as the worker lives but it is not leaking memory it is just keeping the processes and reuse them as new jobs are coming to the queue.

manast avatar Feb 25 '25 22:02 manast

Btw, something to consider, if the jobs are very CPU intensive, you should not use a larger concurrency factor than your actual CPU cores, because that would just add overhead (in form of memory consumption and in context switching between processes) with no other performance gain. If there is a part of the work that is not CPU intensive but IO bound, then what you could do is divide the process in a CPU intensive part and one that is just IO intensive, and use different queues for every part of the process, ideally in different machines, so you have machines allocated for CPU intensive tasks with low concurrency and machines allocated for IO tasks with very high concurrency (and not using sandboxed processors for those IO tasks as that would not optimally use the available resources), in a machine dedicated for IO you could instead run several processes in parallel to exploit both the CPU cores and the concurrency per worker, using something like pm2.

manast avatar Feb 26 '25 10:02 manast

Thank you for your response.

My BullMQ usage has dozens of logically different queues.

For example, Queue 1 imports data from partners APIs and writes them to Database 1. Queue 2 extracts data from Database 1 then writes it to Database 2 and so on.

Why would this leak memory?

Each queue is configured with a high concurrency factor to handle load spikes. Once the spike is done the isolated workers are not finished, thus from my POV that RAM is allocated never to be used again (until there is another spike), therefore to me it is a RAM leak.

and use different queues for every part of the process, ideally in different machines

The cost and complexity of moving this data around different machines is too much, we cannot split the job.

(and not using sandboxed processors for those IO tasks as that would not optimally use the available resources)

We have a big team working on the BullMQ project, and we value a newly added queue not breaking or exhausting the resource usage of the other established queues.

For example, with this approach, if someone makes a mistake and adds a CPU intensive part to the I/O job, it all breaks.


Overall, we use BullMQ because it makes sense regarding DevOPS and onboarding and testing and so on. We don't mind the overhead. But we cannot add complexity like splitting jobs or deployments or risking that new queues break all other queues.

BullMQ has been perfect for us, besides the isolated worker code resource usage, which is what I hope could be fixed 😃

lucasavila00 avatar Feb 26 '25 11:02 lucasavila00

Each queue is configured with a high concurrency factor to handle load spikes. Once the spike is done the isolated workers are not finished, thus from my POV that RAM is allocated never to be used again (until there is another spike), therefore to me it is a RAM leak.

Yeah, but thats not a leak really. A memory leak happens when the application is literally "leaking" memory, as the memory consumption is increasing without any bound and ends up consuming all the available memory.

We could add an option so that processes are not reused, but if we do that you must know that they will take longer to start, as starting a new process is a heavy operation, thats the way Node works we cannot do anything about that. Still I think that even if this feature existed, then there will be a moment in time where the memory could exhaust if there are several workers running at the same time in a peak load scenario.

For example, with this approach, if someone makes a mistake and adds a CPU intensive part to the I/O job, it all breaks.

Yes but I mean, it is not a lot to ask to a developer to make sure he know what he is doing right? :)

manast avatar Feb 26 '25 14:02 manast

then there will be a moment in time where the memory could exhaust if there are several workers running at the same time in a peak load scenario.

Agreed. Closing the isolated workers after a while of being idle (or not re-using them) wouldn't guarantee that there will never be a resource exhaustion.

A perfect solution for our usage would be to have a global limit of child processes, that would prevent resource exhaustion.

And for better resource usage, for us, a way of running multiple jobs (of the same queue) on a single isolated worker. In our case we need queue-level isolation, not really job-level isolation.

Yes but I mean, it is not a lot to ask to a developer to make sure he know what he is doing right? :)

In our case we have SLAs and are not willing to risk 😄

lucasavila00 avatar Feb 26 '25 15:02 lucasavila00