bullmq icon indicating copy to clipboard operation
bullmq copied to clipboard

[Bug]: Scheduled jobs queue randomly stop running all job schedulers

Open adamplumb opened this issue 2 months ago • 13 comments

Version

v5.61.2

Platform

NodeJS

What happened?

We are building a new bullmq-based job running system for a project with a mix of scheduled and non-scheduled jobs. So far we have a "Schedule" queue that only run the job schedulers and other queues for non-scheduled jobs. We currently have 35 job schedulers using a mix of every and pattern for defining the schedule.

What's happening is the jobs will run fine for some hours and then all of a sudden the workers will not run new jobs. This seems to randomly happen and doesn't fix itself.

Some notes about what we've tried:

  • We're listening for queue events and don't see any events/errors triggered. Nothing shows up in our log output or in stdout/stderr logs.
  • We see that when this happens, calling getJobCounts() on the Schedule queue shows that 35 delayed jobs are present.
  • Calling getJobSchedulers() on the queue shows 35 job schedulers
  • The delayed job timestamp + delay have them running "in the past". So they are delayed and the time to run is in the past, meaning they should have already run but aren't.
  • Calling getWorkers() shows correctly that there is one worker, which is running and accepting jobs from another queue.

More importantly, I was able to manually add a non-delayed job to the Schedule queue and it ran. And once the job ran, all the delayed jobs all of a sudden were running again and the scheduled jobs resumed running normally. So adding a new job "unclogged" whatever was preventing the jobs from running.

One other note. There is only one server/worker pulling in schedule jobs, but there are other servers/workers that handle other queues.

How to reproduce.

I don't have a way to explicitly/reliably reproduce this issue.

Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

adamplumb avatar Oct 24 '25 14:10 adamplumb

hi @adamplumb, which redis version are you using?

roggervalf avatar Oct 25 '25 02:10 roggervalf

Hi @roggervalf it is an Elasticache redis cluster using v7.1.0

adamplumb avatar Oct 25 '25 13:10 adamplumb

I was looking at this page (https://docs.bullmq.io/guide/redis-tm-hosting/aws-elasticache) and just noticed the section about maxmemory-policy. I didn't change the policy to no-eviction like it mentions there, but I just made the change and will check if that makes a difference.

Update: Ok so I did make that change but the problem still happens, so that probably isn't it.

adamplumb avatar Oct 27 '25 13:10 adamplumb

There seems to be multiple bugs related to this #3500 and #3499. Unfortunately there has been very little response

reginsmol avatar Oct 27 '25 20:10 reginsmol

by any chance did you try any version before v5.58.8. I would like to determine if this is something that could be replicated even with our last refactor to our job schedulers

roggervalf avatar Oct 28 '25 04:10 roggervalf

Hey we originally started with v5.56.10 and were seeing the problem there, it didn't change after we upgraded to 5.61.2.

  • Would it be worth testing on an older version of redis to see if that's the issue?
  • I can access the redis-cli monitor, though I don't see anything obvious in the output. Is there any further info I could pull for you?

adamplumb avatar Oct 28 '25 14:10 adamplumb

hi @adamplumb could you pls validate which values are bein passed to bzpopmin command when it gets stuck

roggervalf avatar Oct 28 '25 14:10 roggervalf

Here's what I see (replaced ip addresses):

# redis-cli --tls -h xyz monitor | grep -i bzpopmin
1761662641.623045 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662646.660051 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662651.698127 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662656.734718 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"

FYI our scheduled jobs are running on the bull:schedule queue and I'm not seeing that here, but I do see the schedule one when the jobs are running.

adamplumb avatar Oct 28 '25 14:10 adamplumb

do you have any queue with priority name?

roggervalf avatar Oct 28 '25 14:10 roggervalf

Here's what I see (replaced ip addresses):

# redis-cli --tls -h xyz monitor | grep -i bzpopmin
1761662641.623045 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662646.660051 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662651.698127 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662656.734718 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"

FYI our scheduled jobs are running on the bull:schedule queue and I'm not seeing that here, but I do see the schedule one when the jobs are running.

Is that the output of Redis-cli monitor unfiltered? in other words, no other commands are being send to Redis other than bzpopmin?

manast avatar Oct 28 '25 15:10 manast

do you have any queue with priority name?

Yeah we have four queues: standard, priority, events, schedule. I only was seeing the priority one show up in that monitoring though. But normally I see all four.

Is that the output of Redis-cli monitor unfiltered? in other words, no other commands are being send to Redis other than bzpopmin?

I was grepping the bzpopmin string in that command, so yeah everything else was filtered out. If you need more, unfortunately we're not in this broken state at the moment anymore but I can try to monitor unfiltered when it happens again.

One other thing I'm trying this morning is testing against redis running on an ec2 instance to see if that has any effect on the behavior. I'll report back about that.

adamplumb avatar Oct 28 '25 15:10 adamplumb

@adamplumb I would be interested in all the commands from that particular queue, I would like to know if there are other commands issued besides bzpopmin for that queue.

manast avatar Oct 28 '25 16:10 manast

Hi all, just a quick follow-up. I tested using redis v7.0.15 installed on an ec2 instance with a basically default configuration and have been running that successfully since friday without any additional "pauses". Last night I pointed back to the elasticache instance and ran that over night and it "paused" again. So I'm going to move on from this and stick with the ec2-backed service, since that is working for me. Feel free to close this issue if you want since my issue is resolved. Thanks!

adamplumb avatar Nov 04 '25 18:11 adamplumb