bullmq
bullmq copied to clipboard
[Bug]: Delayed jobs don't move to waiting state after some days
Version
v5.7.3
Platform
NodeJS
What happened?
Hello, I have a node application that schedules delayed jobs with a delay from 2s to 1 hour. When the job is finished, I remove it from the queue and add a new one (with the same id/name) and with a new delay (depending on the result).
Everything works fine during some days (1 to 3) and then without any reasons, the worker stops to run jobs: no more jobs are processed. But my nodeJs application still answers to Web requests so is still alive.
I added logs to all event handlers. I didn't notice any errors.
But, the event "waiting" from the queueEvents is not fired at the time a job need to be launched.
What is strange is that if some hours after (or any time), I add manually a new job to the queue, the worker "wakes-up" and runs all these old delayed jobs.
-
How to debug this case ? --> As said, I put an event listener to all Queue, Worker and QueueEvents events, but I didn't see something different.
-
What could be the reason to not move a job to the 'waiting' state when it is the time to handle it ?
Thanks for your help
How to reproduce.
No response
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
hi @oanguenot we are tracking that in this issue https://github.com/taskforcesh/bullmq/issues/2466
btw what it would help us, is to see which values is passed to bzpopmin command
Thanks @roggervalf for your quick answer! In one side, I'm happy to see that problem seems not in my code because I spent days to track it without success but in other side, this is still a problem in front of us :-) I added a comment to the #2466 and will be happy to help one way or another.
I don't know what is bzpopmin. How or where can I find the values ? Thanks
hi @oanguenot in order to see your commands You may need to get into your redis instances with redis-cli and then use monitor command
Is it what you need ?
Should I let the monitor opens until it blocks and should I see if I got a timeout of zero ?
yeah we would like to know which value is blocking that command as we we're doing some fixes to prevent passing 0
also the value that is blocking that command could be a different value than 0, that's what we want to know
hey @oanguenot, btw which are your queue settings or which values are you using for adding delayed jobs?
Hi @roggervalf,
Here are my settings:
queue = new Queue("services", {
connection: {
host: CONFIG().redisDbUrl,
port: CONFIG().redisDbPort,
},
});
I use the following when adding new jobs:
const job = await queue.add(
`${service}-${instance.id}`,
{
userId: instance.userId,
instanceId: instance.id,
serviceId: service,
immediate: false,
retriedCounter,
},
{
jobId: `${service}-${instance.id}`,
removeOnComplete: true,
removeOnFail: true,
delay: delay + randomDelay,
}
);
I think, nothing really special.
On my own and after around 60 hours, all jobs have been proceeded on time (redis monitoring active).
thank you @oanguenot, pls let us know if it happens again. One last questions, before how frequent it happened?
It happened every 2 or 3 days, but I can't remember when it started. It seems to have worked very well a few versions ago or I didn't notice due to other manual restarts done on my own
Everything has been running smoothly for the past 6 days. No problem so far.
thank you @oanguenot, also we release a new performance change regarding this topic. You can try version 5.7.6. Pls let us know how it goes
I would recommend upgrading to 5.7.7 even, as it will mitigate a potential issue we have discovered with IORedis in the case of network partitions.
looks like this issue is already resolved