bullmq [Bug]: waitUntilFinished() got stuck and resolved 1,5h later for

Version

5.45.2

Platform

NodeJS

What happened?

Hi everyone!

In short, we use a setup like this:

  try {
    const result = await job.waitUntilFinished(queueObj.queueEvents);
    return result;
  } catch (error) {
    Log.warn(`callOnWorker() - Job ${jobId} on queue ${queue.name} failed!`, { prefix: 'QUEUE' });
    throw error;
  }

I am aware that using waitUntilFinished() in production is not recommended, but we have been using this setup for years without issues (before with bull!).

We encountered a very strange behavior today:

Somehow, roughly 10 jobs got stuck over a longer period of time (at least 20mins!), but then randomly resolved at 10:46 today, almost all of them in the same 100 ms! So it can't be a TTL issue, because, like I said, the jobs were added to queue over a timespan of at least 20mins.

Does anyone have an idea what could have happened here?

We analyzed everything we can think of, but we can't see anything out of the ordinary for our server, elasticache, EFS drive, ... we checked everything.

Thanks, best Patrick

How to reproduce.

No response

Relevant log output

Code of Conduct

[x] I agree to follow this project's Code of Conduct

Jun 11 '25 10:06 Twisterking

Quick update on this: It seems like we completely lost connection to Redis at some point or something ... we are on AWS using ElastiCache. Has anyone ever seen an error like this? We have been using this setup for years now and have never seen it - it really happened out of the blue!

Any input is much appreciated!

Jun 11 '25 21:06 Twisterking

Any updates on this issue? We ran into the same problem yesterday—BullMQ isn’t processing any jobs at all now.

Jun 12 '25 13:06 Mario2206

@Mario2206 really interesting! For us, all is fine again to be honest since yesterday 10:46.

But if you had the same errors YESTERDAY, it might really have been an AWS issue?

Do you also host on AWS / ElastiCache by any chance?

Jun 12 '25 13:06 Twisterking

Yes, we also host on AWS/ElastiCache. We redeployed our ECS multiple times yesterday. It worked for about 10 minutes before we encountered this error. There were no updates on our side, so we believe it is an AWS issue.

Jun 12 '25 13:06 Mario2206

hi guys, actually we are recommending to not use that method anymore https://blog.taskforce.sh/do-not-wait-for-your-jobs-to-complete/ here are some recommendations

Jun 13 '25 02:06 roggervalf

Experiencing the same issues with BullMQ suddenly not working in AWS ElastiCache Redis OSS but we are not using waitUntilFinished function. Started happening today so must be related to a recent release. Anybody solve this or find out the root issue? Also received the ECONNRESET error

Jun 17 '25 01:06 aeftink

We did not make any updates on neither the redis on Elasticache, nor on our bullmq package. So I honestly also do not understand why we suddenly encounter these issues.

We are still on Redis 5.0.6 btw - are you by any chance also still on an old redis version @aeftink & @Mario2206 ?!

Jun 18 '25 12:06 Twisterking

Quick update on this: It seems like we completely lost connection to Redis at some point or something ... we are on AWS using ElastiCache. Has anyone ever seen an error like this? We have been using this setup for years now and have never seen it - it really happened out of the blue!
Any input is much appreciated!

Yes. One of the reasons we do not recommend the use of this method is because you do not get any guarantees. We make best effort to try to resolve the promise but we cannot guarantee it. In your case you are mentioning a disconnection, which is an edge case that potentially could make this method not resolving, like the job could succeed or fail while this disconnection did happen and after reconnection the event could have been missed. This is not a bug, it is just that we do not offer any guarantees, and therefore the behaviour you are experiencing is expected.

Jun 18 '25 13:06 manast

Experiencing the same issues with BullMQ suddenly not working in AWS ElastiCache Redis OSS but we are not using waitUntilFinished function. Started happening today so must be related to a recent release. Anybody solve this or find out the root issue? Also received the ECONNRESET error

if you are not using waitUntilFinished then by definition is not the same issue. Please open a new issue with the specifics for your particular issue, otherwise we will not be able to help.

Jun 18 '25 13:06 manast

We did not make any updates on neither the redis on Elasticache, nor on our bullmq package. So I honestly also do not understand why we suddenly encounter these issues.

We are still on Redis 5.0.6 btw - are you by any chance also still on an old redis version @aeftink & @Mario2206 ?!

5.0.6 is an old and not fully supported version of Redis by latest versions of BullMQ, you should use 6.2.0 or higher version. We cannot provide the same guarantees on older versions of Redis. 5.0.6 was released 6 years ago, and it is possible to upgrade to the newest versions without breaking backwards compatibility.

Jun 18 '25 13:06 manast

[Bug]: waitUntilFinished() got stuck and resolved 1,5h later for > 10 jobs in 100ms

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct