[Bug]: waitUntilFinished() got stuck and resolved 1,5h later for > 10 jobs in 100ms
Version
5.45.2
Platform
NodeJS
What happened?
Hi everyone!
In short, we use a setup like this:
try {
const result = await job.waitUntilFinished(queueObj.queueEvents);
return result;
} catch (error) {
Log.warn(`callOnWorker() - Job ${jobId} on queue ${queue.name} failed!`, { prefix: 'QUEUE' });
throw error;
}
I am aware that using waitUntilFinished() in production is not recommended, but we have been using this setup for years without issues (before with bull!).
We encountered a very strange behavior today:
Somehow, roughly 10 jobs got stuck over a longer period of time (at least 20mins!), but then randomly resolved at 10:46 today, almost all of them in the same 100 ms! So it can't be a TTL issue, because, like I said, the jobs were added to queue over a timespan of at least 20mins.
Does anyone have an idea what could have happened here?
We analyzed everything we can think of, but we can't see anything out of the ordinary for our server, elasticache, EFS drive, ... we checked everything.
Thanks, best Patrick
How to reproduce.
No response
Relevant log output
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Quick update on this: It seems like we completely lost connection to Redis at some point or something ... we are on AWS using ElastiCache. Has anyone ever seen an error like this? We have been using this setup for years now and have never seen it - it really happened out of the blue!
Any input is much appreciated!
Any updates on this issue? We ran into the same problem yesterday—BullMQ isn’t processing any jobs at all now.
@Mario2206 really interesting! For us, all is fine again to be honest since yesterday 10:46.
But if you had the same errors YESTERDAY, it might really have been an AWS issue?
Do you also host on AWS / ElastiCache by any chance?
Yes, we also host on AWS/ElastiCache. We redeployed our ECS multiple times yesterday. It worked for about 10 minutes before we encountered this error. There were no updates on our side, so we believe it is an AWS issue.
hi guys, actually we are recommending to not use that method anymore https://blog.taskforce.sh/do-not-wait-for-your-jobs-to-complete/ here are some recommendations
Experiencing the same issues with BullMQ suddenly not working in AWS ElastiCache Redis OSS but we are not using waitUntilFinished function. Started happening today so must be related to a recent release. Anybody solve this or find out the root issue? Also received the ECONNRESET error
We did not make any updates on neither the redis on Elasticache, nor on our bullmq package. So I honestly also do not understand why we suddenly encounter these issues.
We are still on Redis 5.0.6 btw - are you by any chance also still on an old redis version @aeftink & @Mario2206 ?!
Quick update on this: It seems like we completely lost connection to Redis at some point or something ... we are on AWS using ElastiCache. Has anyone ever seen an error like this? We have been using this setup for years now and have never seen it - it really happened out of the blue!
Any input is much appreciated!
Yes. One of the reasons we do not recommend the use of this method is because you do not get any guarantees. We make best effort to try to resolve the promise but we cannot guarantee it. In your case you are mentioning a disconnection, which is an edge case that potentially could make this method not resolving, like the job could succeed or fail while this disconnection did happen and after reconnection the event could have been missed. This is not a bug, it is just that we do not offer any guarantees, and therefore the behaviour you are experiencing is expected.
Experiencing the same issues with BullMQ suddenly not working in AWS ElastiCache Redis OSS but we are not using
waitUntilFinishedfunction. Started happening today so must be related to a recent release. Anybody solve this or find out the root issue? Also received the ECONNRESET error
if you are not using waitUntilFinished then by definition is not the same issue. Please open a new issue with the specifics for your particular issue, otherwise we will not be able to help.
We did not make any updates on neither the redis on Elasticache, nor on our bullmq package. So I honestly also do not understand why we suddenly encounter these issues.
We are still on Redis
5.0.6btw - are you by any chance also still on an old redis version @aeftink & @Mario2206 ?!
5.0.6 is an old and not fully supported version of Redis by latest versions of BullMQ, you should use 6.2.0 or higher version. We cannot provide the same guarantees on older versions of Redis. 5.0.6 was released 6 years ago, and it is possible to upgrade to the newest versions without breaking backwards compatibility.
Any input is much appreciated!