bullmq [Bug]: Delayed jobs don't move to waiting state after some days

Version

v5.7.3

Platform

NodeJS

What happened?

Hello, I have a node application that schedules delayed jobs with a delay from 2s to 1 hour. When the job is finished, I remove it from the queue and add a new one (with the same id/name) and with a new delay (depending on the result).

Everything works fine during some days (1 to 3) and then without any reasons, the worker stops to run jobs: no more jobs are processed. But my nodeJs application still answers to Web requests so is still alive.

I added logs to all event handlers. I didn't notice any errors.

But, the event "waiting" from the queueEvents is not fired at the time a job need to be launched.

What is strange is that if some hours after (or any time), I add manually a new job to the queue, the worker "wakes-up" and runs all these old delayed jobs.

How to debug this case ? --> As said, I put an event listener to all Queue, Worker and QueueEvents events, but I didn't see something different.
What could be the reason to not move a job to the 'waiting' state when it is the time to handle it ?

Thanks for your help

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Apr 21 '24 11:04 oanguenot

hi @oanguenot we are tracking that in this issue https://github.com/taskforcesh/bullmq/issues/2466

Apr 21 '24 15:04 roggervalf

btw what it would help us, is to see which values is passed to bzpopmin command

Apr 21 '24 15:04 roggervalf

Thanks @roggervalf for your quick answer! In one side, I'm happy to see that problem seems not in my code because I spent days to track it without success but in other side, this is still a problem in front of us :-) I added a comment to the #2466 and will be happy to help one way or another.

I don't know what is bzpopmin. How or where can I find the values ? Thanks

Apr 21 '24 15:04 oanguenot

hi @oanguenot in order to see your commands You may need to get into your redis instances with redis-cli and then use monitor command

Apr 21 '24 16:04 roggervalf

Is it what you need ?

Should I let the monitor opens until it blocks and should I see if I got a timeout of zero ?

Apr 21 '24 18:04 oanguenot

yeah we would like to know which value is blocking that command as we we're doing some fixes to prevent passing 0

Apr 21 '24 19:04 roggervalf

also the value that is blocking that command could be a different value than 0, that's what we want to know

Apr 21 '24 19:04 roggervalf

hey @oanguenot, btw which are your queue settings or which values are you using for adding delayed jobs?

Apr 23 '24 04:04 roggervalf

Hi @roggervalf,

Here are my settings:

queue = new Queue("services", {
    connection: {
      host: CONFIG().redisDbUrl,
      port: CONFIG().redisDbPort,
    },
  });

I use the following when adding new jobs:

 const job = await queue.add(
        `${service}-${instance.id}`,
        {
          userId: instance.userId,
          instanceId: instance.id,
          serviceId: service,
          immediate: false,
          retriedCounter,
        },
        {
          jobId: `${service}-${instance.id}`,
          removeOnComplete: true,
          removeOnFail: true,
          delay: delay + randomDelay,
        }
      );

I think, nothing really special.

On my own and after around 60 hours, all jobs have been proceeded on time (redis monitoring active).

Apr 23 '24 18:04 oanguenot

thank you @oanguenot, pls let us know if it happens again. One last questions, before how frequent it happened?

Apr 24 '24 13:04 roggervalf

It happened every 2 or 3 days, but I can't remember when it started. It seems to have worked very well a few versions ago or I didn't notice due to other manual restarts done on my own

Apr 24 '24 18:04 oanguenot

Everything has been running smoothly for the past 6 days. No problem so far.

Apr 27 '24 19:04 oanguenot

thank you @oanguenot, also we release a new performance change regarding this topic. You can try version 5.7.6. Pls let us know how it goes

Apr 27 '24 20:04 roggervalf

I would recommend upgrading to 5.7.7 even, as it will mitigate a potential issue we have discovered with IORedis in the case of network partitions.

Apr 30 '24 09:04 manast

looks like this issue is already resolved

Aug 02 '24 05:08 roggervalf

bullmq bullmq copied to clipboard

[Bug]: Delayed jobs don't move to waiting state after some days

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct

bullmq
bullmq copied to clipboard