[Bug]: Scheduled jobs queue randomly stop running all job schedulers
Version
v5.61.2
Platform
NodeJS
What happened?
We are building a new bullmq-based job running system for a project with a mix of scheduled and non-scheduled jobs. So far we have a "Schedule" queue that only run the job schedulers and other queues for non-scheduled jobs. We currently have 35 job schedulers using a mix of every and pattern for defining the schedule.
What's happening is the jobs will run fine for some hours and then all of a sudden the workers will not run new jobs. This seems to randomly happen and doesn't fix itself.
Some notes about what we've tried:
- We're listening for queue events and don't see any events/errors triggered. Nothing shows up in our log output or in stdout/stderr logs.
- We see that when this happens, calling
getJobCounts()on the Schedule queue shows that 35 delayed jobs are present. - Calling
getJobSchedulers()on the queue shows 35 job schedulers - The delayed job timestamp + delay have them running "in the past". So they are delayed and the time to run is in the past, meaning they should have already run but aren't.
- Calling
getWorkers()shows correctly that there is one worker, which is running and accepting jobs from another queue.
More importantly, I was able to manually add a non-delayed job to the Schedule queue and it ran. And once the job ran, all the delayed jobs all of a sudden were running again and the scheduled jobs resumed running normally. So adding a new job "unclogged" whatever was preventing the jobs from running.
One other note. There is only one server/worker pulling in schedule jobs, but there are other servers/workers that handle other queues.
How to reproduce.
I don't have a way to explicitly/reliably reproduce this issue.
Relevant log output
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
hi @adamplumb, which redis version are you using?
Hi @roggervalf it is an Elasticache redis cluster using v7.1.0
I was looking at this page (https://docs.bullmq.io/guide/redis-tm-hosting/aws-elasticache) and just noticed the section about maxmemory-policy. I didn't change the policy to no-eviction like it mentions there, but I just made the change and will check if that makes a difference.
Update: Ok so I did make that change but the problem still happens, so that probably isn't it.
There seems to be multiple bugs related to this #3500 and #3499. Unfortunately there has been very little response
by any chance did you try any version before v5.58.8. I would like to determine if this is something that could be replicated even with our last refactor to our job schedulers
Hey we originally started with v5.56.10 and were seeing the problem there, it didn't change after we upgraded to 5.61.2.
- Would it be worth testing on an older version of redis to see if that's the issue?
- I can access the redis-cli monitor, though I don't see anything obvious in the output. Is there any further info I could pull for you?
hi @adamplumb could you pls validate which values are bein passed to bzpopmin command when it gets stuck
Here's what I see (replaced ip addresses):
# redis-cli --tls -h xyz monitor | grep -i bzpopmin
1761662641.623045 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662646.660051 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662651.698127 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
1761662656.734718 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"
FYI our scheduled jobs are running on the bull:schedule queue and I'm not seeing that here, but I do see the schedule one when the jobs are running.
do you have any queue with priority name?
Here's what I see (replaced ip addresses):
# redis-cli --tls -h xyz monitor | grep -i bzpopmin 1761662641.623045 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5" 1761662646.660051 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5" 1761662651.698127 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5" 1761662656.734718 [0 abc:38234] "bzpopmin" "bull:priority:marker" "5"FYI our scheduled jobs are running on the
bull:schedulequeue and I'm not seeing that here, but I do see the schedule one when the jobs are running.
Is that the output of Redis-cli monitor unfiltered? in other words, no other commands are being send to Redis other than bzpopmin?
do you have any queue with priority name?
Yeah we have four queues: standard, priority, events, schedule. I only was seeing the priority one show up in that monitoring though. But normally I see all four.
Is that the output of Redis-cli monitor unfiltered? in other words, no other commands are being send to Redis other than bzpopmin?
I was grepping the bzpopmin string in that command, so yeah everything else was filtered out. If you need more, unfortunately we're not in this broken state at the moment anymore but I can try to monitor unfiltered when it happens again.
One other thing I'm trying this morning is testing against redis running on an ec2 instance to see if that has any effect on the behavior. I'll report back about that.
@adamplumb I would be interested in all the commands from that particular queue, I would like to know if there are other commands issued besides bzpopmin for that queue.
Hi all, just a quick follow-up. I tested using redis v7.0.15 installed on an ec2 instance with a basically default configuration and have been running that successfully since friday without any additional "pauses". Last night I pointed back to the elasticache instance and ran that over night and it "paused" again. So I'm going to move on from this and stick with the ec2-backed service, since that is working for me. Feel free to close this issue if you want since my issue is resolved. Thanks!