[Bug]: Worker stops processing jobs
Version
^5.41.3
Platform
NodeJS
What happened?
Some workers will eventually stop processing jobs until the node process is restarted, at which point all jobs will start running again.
The worker list for the queue returns a connection, but its idle time and age are the same which makes me thing the worker is created but is stalled for some reason except no jobs are actually stalled when I check the queue status using bull-board.
Sorry if this isn't the right forum to post this
How to reproduce.
I don't really know how to reproduce this issue
Relevant log output
// return of queue.getWorkers
[
{
id: '1711546',
addr: 'redacted',
laddr: 'redacted',
fd: '29',
name: 'myjob',
age: '11336',
idle: '11336',
flags: 'b',
db: '0',
sub: '0',
psub: '0',
ssub: '0',
multi: '-1',
qbuf: '0',
'qbuf-free': '0',
'argv-mem': '40',
'multi-mem': '0',
rbs: '1024',
rbp: '0',
obl: '0',
oll: '0',
omem: '0',
'tot-mem': '2608',
events: 'r',
cmd: 'bzpopmin',
user: 'default',
redir: '-1',
resp: '2',
rawname: 'bull:SW1hcFNjYW5NYWlsYm94'
}
]
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
We need more information to give you a proper answer:
- Redis host (elasticache, dragonfly, etc).
- Type of jobs?
- Are there jobs in Active status?
- Screenshot of taskforce.sh for the queue showing the counts in the different statuses
- options used for the jobs
- etc. anything that can help understand your particular case.
I'm indeed using AWS's Elasticache
This is how the queue and workers are being created. options.
import { Redis } from 'ioredis';
const taskName = "myQueue"
const connection = new Redis({
host: process.env.BULLMQ_REDIS_PORT,
port: parseInt(process.env.BULLMQ_REDIS_HOST),
maxRetriesPerRequest: null,
retryStrategy(times: number) {
return Math.max(Math.min(Math.exp(times), 20000), 1000);
},
})
const queue = new BullMQ.Queue(taskName, {
connection: this.defaultConnection ?? connection,
defaultJobOptions: { backoff: { delay: 1000 * 60 * 5, type: 'fixed' }, attempts: 10 },
});
const worker = new BullMQ.Worker(taskName, handler, {
concurrency,
connection: connection ?? this.defaultConnection,
removeOnComplete: {
age: 0,
count: 0,
},
removeOnFail: {
age: 60 * 60,
count: 50,
},
});
worker.on('error', console.error);
worker.on('ready', () => {
console.log(`[bullmq]: '${taskName}' worker is ready`);
});
worker.on('ioredis:close', () => {
console.log('[bullmq]: redis has closed');
});
worker.on('closed', () => {
console.log('[bullmq]: worker closed');
});
jobs are published like this
queue.add("some-name", { id: accountId }, { deduplication: { id: someid }, delay: 1000 * 60 * 15 })
The handler is a self-publishing job that opens a connection to an imap server every 15 minutes, does its thing, closes the connection and then publishes the next task with a 15 minutes delay into the same queue. I couldnt use a repeatable job because the way it schedules the next job first wouldn't work for my use case.
I'll be on the watch out for when it happens again so I can take a screenshot and post whatever I can find.
Anyway, I suspect it has something to do with my Elasticache instance, but I'm not really sure what I should be looking for to confirm it.
We have the same behavior with Redis Sentinel in the k8s, using
"ioredis": "^5.6.0",
"bullmq": "^5.44.4",
"@nestjs/bullmq": "^10.2.3",
We have several processors and they just stop processing the jobs one by one at random time. We have checked that there are no Active jobs in the queue when it stuck.
In logs we periodically observe Error: Connection is closed. from Redis, and All sentinels are unreachable errors, but they do not always lead to stuck jobs. Also, our other services, that are using ioredis are working fine and not experiencing any issues with Redis.
Appreciate it if you could advise how we could debug such case.
In general queues do not get stuck, so the issue is mostly related to the workers not reconnecting properly. You can listen to other events such as the "ready" event from the worker, to see if it is actually reconnecting after these disconnection error, you can also place debug logs on the reconnection events, on the retryStrategy callback. You can also call getWorkers or use a dashboard to check if there are indeed workers online when they stop processing jobs.
In general queues do not get stuck, so the issue is mostly related to the workers not reconnecting properly. You can listen to other events such as the "ready" event from the worker, to see if it is actually reconnecting after these disconnection error, you can also place debug logs on the reconnection events, on the retryStrategy callback. You can also call getWorkers or use a dashboard to check if there are indeed workers online when they stop processing jobs.
so we have added listeners on these events:
@OnWorkerEvent('ready')
async onWorkerReady() {
Logger.log(`${this.queueName}: detected worker ready event`, UpdateGenericProcessor.name);
}
@OnWorkerEvent('paused')
async onWorkerPaused() {
Logger.error(`${this.queueName}: detected worker paused event`, UpdateGenericProcessor.name);
}
@OnWorkerEvent('closed')
async onWorkerClose() {
Logger.error(`${this.queueName}: detected worker closed event`, UpdateGenericProcessor.name);
}
@OnWorkerEvent('ioredis:close')
async onRedisClose() {
Logger.error(`${this.queueName}: detected worker ioredis:close event`, UpdateGenericProcessor.name);
}
We have checked, that after we restarted the Redis master node we received ioredis:close and ready logs events.
Periodically we see errors like this in logs:
Error: All sentinels are unreachable. Retrying from scratch after 1000ms.
And after that, immediately, the worker ready event is raised.
But when the queue stuck, we don't see this ready event for this queue. However, at the same time, other workers have received ready event and was continue to working fine.
Ok, so it seems like the reconnect mechanism in IORedis sometimes fails when sentinels are involved, or what is your conclusion?
probably, but how bullmq worker is behaving in that case, when the ioredis for any reason dropped the connection and didn't reconnect? should the worker monitor it and try to reinit connection or make a restart?
The worker relies on IORedis for all the reconnection logic, we cannot build a new reconnection logic on top of that we need to assume it is working. In your particular case, you could write a new logic on top of it, if there is a disconnect event and no ready event after a certain amount of time, close the worker and create a new one.
@evheniyt let me know if you are able to implement a working workaround.
In general queues do not get stuck, so the issue is mostly related to the workers not reconnecting properly. You can listen to other events such as the "ready" event from the worker, to see if it is actually reconnecting after these disconnection error, you can also place debug logs on the reconnection events, on the retryStrategy callback. You can also call getWorkers or use a dashboard to check if there are indeed workers online when they stop processing jobs.
So, I've attached listeners to the following events:
redis
redis.on('close', () => {
console.log('[redis]: closed');
});
redis.on('reconnecting', () => {
console.log('[redis]: reconnecting');
});
redis.on('connect', () => {
console.log('[redis]: connected');
});
redis.on('error', (err) => {
console.log('[redis]: error', err);
});
redis.on('connecting', () => {
console.log('[redis]: connecting');
});
redis.on('ready', () => {
console.log('[redis]: ready');
});
retryStrategy: function (times) {
const value = Math.max(Math.min(Math.exp(times), 20000), 1000);
console.log(`[redis]: retry stretegy callback value: ${value}`);
return value;
}
worker.js
worker.on('ready', () => {
console.log(`[bullmq]: '${taskName}' worker is ready`);
});
worker.on('error', (err) => {
console.error('[bullmq]: error', err);
});
worker.on('ioredis:close', () => {
console.log('[bullmq]: redis has closed');
});
worker.on('closed', () => {
console.log('[bullmq]: worker closed');
});
Every couple of days I'll see logs from the retryStrategy callback followed by the worker's ready event and nothing else. Like this:
2025-04-07T01:14:15.007565411Z [redis]: retry strategy callback value: 1000
2025-04-07T01:14:16.014949486Z [bullmq]: 'Queue' worker is ready
but the jobs are actually 'stuck' in the delayed state until I restart the server, at which point everything works as expected again.
I'm wondering if there's anything particular to elasticache or my implementation where this worker self publishes after completing a job and I shouldnt be doing this.
But are there online workers?
But are there online workers?
I've checked using .getWorkers and apparently yes ( idle/age will have the same value ). .getWorkersCount will also return one.
So the facts are:
- the queue has delayed jobs, seem like they are delayed like 5 minutes.
- at least one worker is online
- there is a disconnection event
- jobs are not processed anymore
Based on this facts it would be very useful to have a reproducible test case. Also something to test is using Redis monitor command after the worker stops processing, to see what is going on in Redis, as normally you should see a BZPOPMIN command (and others too) every 5 seconds when the queue is idle.
Also seeing the same on our workers, we are however not using Sentinels but are using delayed/repeated jobs.
We see the BZPOPMIN command no longer being executed (see graph) and the memory usage of the worker steadily increasing.
We're using:
"@nestjs/bullmq": "^11.0.2",
"bullmq": "^5.48.1"
So the facts are:
- the queue has delayed jobs, seem like they are delayed like 5 minutes.
- at least one worker is online
- there is a disconnection event
- jobs are not processed anymore
Based on this facts it would be very useful to have a reproducible test case. Also something to test is using Redis monitor command after the worker stops processing, to see what is going on in Redis, as normally you should see a BZPOPMIN command (and others too) every 5 seconds when the queue is idle.
I still havent found a way to reproduce this reliably, but when I monitor my redis instance with
redis-cli -h {host} monitor | grep -i "MyQueue"
I'll see the following on repeat every 5 or so seconds:
1745335462.617480 [0 172.29.1.40:54472] "evalsha" "7ee422a91ed052c944eea0a4f5784e2ceb37b278" "9" "bull:MyQueue:stalled" "bull:MyQueue:wait" "bull:MyQueue:active" "bull:MyQueue:failed" "bull:MyQueue:stalled-check" "bull:MyQueue:meta" "bull:MyQueue:paused" "bull:MyQueue:marker" "bull:MyQueue:events" "1" "bull:MyQueue:" "1745335462617" "30000"
1745335462.617523 [0 lua] "EXISTS" "bull:MyQueue:stalled-check"
1745335462.617530 [0 lua] "SET" "bull:MyQueue:stalled-check" "1745335462617" "PX" "30000"
1745335462.617537 [0 lua] "HGET" "bull:MyQueue:meta" "opts.maxLenEvents"
1745335462.617544 [0 lua] "XTRIM" "bull:MyQueue:events" "MAXLEN" "~" "10000"
1745335462.617549 [0 lua] "SMEMBERS" "bull:MyQueue:stalled"
1745335462.617555 [0 lua] "LRANGE" "bull:MyQueue:active" "0" "-1"
I see no BZPOPMIN commands at all. I do, however, see it for the queues that arent "stuck".
I'll post here again when I find away to replicate this behavior
I think that on the queues that are stuck, probably the last command sent to Redis was a BZPOPMIN, after that there was a disconnection event and then when the connection came back online the command got stuck forever. I have seen this some years ago and we had to improve the reconnection system so that it would not happen. At the root of it is a bug in IORedis, that is not handling properly the blocked call in the event of a disconnection. However I am not able to reproduce this anymore since we resolved it several years ago.
In our case, we have found a correlation between the time when workers are stuck and CPU spikes on the pod. We had a job that periodically spiked CPU, and during that time, other workers could get stuck. After we reduced the CPU spikes, the issue was resolved.
So probably, it could be reproduced by processing some CPU intanceve tasks while having other workers processing jobs.
We have no CPU spikes, but are seeing this happening consistently.
We use Google MemoryStore with Redis version 7.0
This has just happened again today and MONITOR shows these commands being issued but nothing is happening on the Worker. There are no reconnection/connection logs.
We tried downgrading to BullMQ 5.34.8 but no change.
1745485588.882165 [0 10.128.0.18:31952] "evalsha" "cc4eded989ba9b04d25cc2407a1142f33a30400e" "1" "bull:analytics:stalled" "bull:analytics:" "\x91\xd9*90456d59-b067-4142-b169-f51cae98bdc9:14757" "\x91\xa6305903" "30000"
1745485588.882223 [0 lua] "GET" "bull:analytics:305903:lock"
1745485588.882239 [0 lua] "SET" "bull:analytics:305903:lock" "90456d59-b067-4142-b169-f51cae98bdc9:14757" "PX" "30000"
1745485588.882252 [0 lua] "SREM" "bull:analytics:stalled" "305903"
The code above seems to be the renewal of a lock for a job, so that particular job seems to still be processing, what concurrency factor are you using for this worker that stopped processing new jobs?
Another thing, when this happens, do getWorkers() return any workers?
@manast Concurrency factor is just the default, so 1.
getWorkers returns [ { name: 'GCP does not support client list' } ]
getWorkersCount returns 1
We only have one worker running and you're right, it just seems to be renewing the lock endlessly but printing no errors and doing no work.
We're using a regular NestJS implementation.
Bull getActive() reports one active job for a repeating job.
We are using two Workers on the same NestJS application, when we did that we started to see this issue (we also started using repeated jobs for the first time with the second worker). Both workers/queues get blocked when this happens. Could that be related?
Ok, so the issue here is that the job is still processing, so no new jobs are processed because this one job is consuming the maximum concurrency. It is not uncommon to write processor functions that keeps dangling for ever, maybe some promise that never resolves or something like that. There are several ways to debug this, usually putting logs before and after every async call will help you find the call that is hanging. Another alternative that is not very optimal as it does not solve the underlying issue is implementing a timeout: https://docs.bullmq.io/patterns/timeout-jobs
@manast I'm surprised that this would be the issue given the application code but I will investigate nonetheless - would a stuck/dangling processor/job affect the other queue though? Both queues are being affected by this that are running on the same process.
Good day, @manast! We’ve been encountering a similar issue for about a week. The worker has stopped consuming jobs, and we don't see any BZPOPMIN commands in the logs after that, despite other queues continuing to work. What's interesting is that this is random behavior - any worker could stop at any time, not just that one.
The worker itself is fairly simple—it just sends messages to Kafka. It's unlikely that this is related to the async function, as our logs confirm that it completes successfully.
We're running on an AWS ElastiCache Cluster (Valkey 8.0.1) with 2 shards.
We're running service with DEBUG:ioredis*, so I can share anything you need that could help resolve this issue.
Versions:
"@nestjs/bullmq": "^11.0.0",
"bullmq": "^5.53.0",
"ioredis": "^5.6.1",
What we did a week ago:
Updated BullMQ from "^5.51.1" to "^5.52.1"
Updated NodeJS from 23.11 to 24.1
@imwexpex when this happens, do you have any active jobs at all?
@imwexpex I think the issue may be related to the worker loosing connection and for some reason unable to reconnect again.
@manast @imwexpex when this happens, do you have any active jobs at all?, Yes, I have active jobs in other queues, and a lot of delayed jobs in queue that are not working
worker loosing connection - theoretically, yes. But there are no logs from ioredis or worker about this at all(
I have more details: we're running Stage and Prod environments with exactly the same infrastructure, except that on Stage we're using a cluster with only 1 shard, and on Prod a cluster with 2 shards. We're running tests and can't reproduce the issue on Stage.
Moreover, what's strange is that the Bull Dashboard displays different numbers of "Blocked Clients" — it constantly changes from 0 to 7 (the number of queues) on Prod. On Stage, this count remains static - 7 and doesn't fluctuate.
And when the worker bugs - we see decreased numbers of Blocked clients, which again shows issue with connection