bullmq icon indicating copy to clipboard operation
bullmq copied to clipboard

After update to bullmq: Error: could not renew lock for job

Open Twisterking opened this issue 10 months ago • 27 comments

Version

v5.34.2

Platform

NodeJS

What happened?

We used the predecessor bull successfully and very heavily over many months at our company. Now, we updated to bullmq and, to be honest, we have quite some issues.

Our queues get stuck quite frequently (we never had this issue!), and we sometimes run into this error:

Error: could not renew lock for job xyz

This just continues like that, it never resolves, until we do a restart. This is quite bad for us.

I also could not really find something here in any other issues? What can we do here?

How to reproduce.

No response

Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

Twisterking avatar Feb 05 '25 15:02 Twisterking

As far as I know BullMQ is as stable as Bull having a much larger test suite, so in general you should not have more issues, but less. However it is possible that during migration you made some assumptions on how BullMQ works which may not hold true coming from Bull. The best would be if you could post a case that reproduces those issues so we can give you hints or look deeper into it if it happens to be a bug.

Furthermore, you mentioned that you run into a given error. That error is only produced via an event, and is triggered only if a lock cannot be renewed for a given job, this is quite unusual, so probably related to the migration work. I also wonder, are you using typescript?

manast avatar Feb 05 '25 16:02 manast

I am having the ssame issue. It happens randomly and I cannot even destroy the queue. I have to restart each time.

melihplt avatar Feb 06 '25 00:02 melihplt

hi folks, just for curiosity, how did you migrate to bullmq from bull? Did you create new queues for bullmq or did you use a different prefix?

roggervalf avatar Feb 06 '25 02:02 roggervalf

Hello everyone,

No, we do not use typescript, just vanilla JS. The thing is, that we continue to run into this issue. We now even implemented these 2 options for our workers:

{
  maxStalledCount: 0, // do NOT allow to "retry" a stalled job. This CAN lead to a situation, where MULTIPLE workers work on the same job!
  stalledInterval: 1 * 60 * 1000 // 1 minute
}

... and we continue to have this issue. We need to restart the whole node instance (docker container) to make it startup again.

We run into the failed event with the error: Error: could not renew lock for job xyz.

We are already trying to not overload the CPU as best as we can. Of course we "fluctuate around 100%", but this, to me, does not mean that we really leave ZERO headroom for the CPU to even renew the lock. :/

Migration:

We are using a different prefix (bullmq). So we kinda discontinued the old bull queue and deployed all our instances in the right order to make the transition as smooth as possible.

Twisterking avatar Feb 06 '25 08:02 Twisterking

9 out of 10 the errors of this nature steams in wrong passing of options or arguments when not using Typescript, specially coming from Bull which does not have the same signatures.

It is difficult to asses if your issue is related to high CPU usage, as you mentioned that you are sometimes up in 100%. Without more information about the specifics of your use case and some test case that shows the problem we really have not a lot of chances to help you.

manast avatar Feb 06 '25 09:02 manast

I am having the ssame issue. It happens randomly and I cannot even destroy the queue. I have to restart each time.

It is highly unlikely that you are having "the same issue", specially when we do not even know yet what the issue it. So please if you have an issue, post a reproducible case in a new issue and we will look into it.

manast avatar Feb 06 '25 09:02 manast

Thanks for the reply @manast . Could you please add more details on the "wrong passing of options"?

I don't understand how some "code bug" on our end could trigger this very error? My understanding was, that the Error: could not renew lock for job xyz error in 90% of cases should actually not happen, but IF it does, it is most often triggered by a stalled job. Maybe I got this wrong?

Sidenote: We do use TS checks in our VS code setup and do not get any errors for like "wrong passed options" to e.g. Queue or Worker or something like that. Us passing completely wrong options somewhere seems unlikely to me.

On that note: we do have a queueEvent on the stalled event and do NOT see the logging of it.

It is almost impossible for me to give you a reproduction example, also because of the randomness how/when the issue occurs for us.

Our usecase:

We use the queue to connect our main (MeteorJs) App to our workers (plain nodejs apps). These workers handle huge amounts of data imports. Basically all our jobs in the queue consist of reading data from files or APIs, create mongoDB update bulk operations, and running these bulk operations on our mongoDb.

Twisterking avatar Feb 06 '25 09:02 Twisterking

But when this happens, what is the status of the job that could not renew the lock?

manast avatar Feb 06 '25 09:02 manast

We will implemented some more logging today and get back to you. Thanks a lot for your responsiveness, highly appreciated!

We do use bullboard, and for some reason, I can not find these jobIds in our "failed list". So I am also confused, where these jobs disappear to.

Twisterking avatar Feb 06 '25 09:02 Twisterking

How many jobs do you usually run concurrently?

manast avatar Feb 06 '25 09:02 manast

On these workers, only 1! We do have 2 docker containers, but both only run 1 job at a time. so "in total" you could say, 2 jobs might run concurrently, but the 2 jobs run in separate docker containers on separate node processes.

Twisterking avatar Feb 06 '25 09:02 Twisterking

Are these jobs blocking NodeJS event loop? Did you try using sandboxes instead?

manast avatar Feb 06 '25 10:02 manast

@Twisterking to find a pattern, I want to ask if you have the same setup with me.

  • Are you using Heroku or some container?
  • Do you add new jobs inside a worker?
  • Do you connect a websocket inside the worker?

melihplt avatar Feb 06 '25 11:02 melihplt

Quick update from my end:

It looks like that, indeed, we identified some nested for loops and such, that block the event loop. It just took us a very long time to find them. 😬

Will report back when I know (even) more!

@melihplt

  • we use self hosted docker containers on AWS EC2
  • yes, we do add jobs inside the workers
  • actually yes we do. We have Meteor's DDP connection in place to connect to our main server.

Twisterking avatar Feb 06 '25 16:02 Twisterking

According to some logging, in my case, the job stucks at connecting to Discord via socket, "sometimes". But I don't understand why I cannot force worker process to be killed. I will dig more too. Thanks @Twisterking .

melihplt avatar Feb 09 '25 13:02 melihplt

@melihplt could it be a bug in NodeJS where the connection enters an infinite loop? Have you tried with a different runtime such as Bun to see if you get the same result?

manast avatar Feb 09 '25 21:02 manast

Hello again,

I have some updates! We could improve the situation a bit, but we still run into the could not renew lock error very frequently.

What we do not understand at all, is this: We have the following 2 settings set on ALL our workers of the affected queue:

{
      maxStalledCount: 0, // do NOT allow to "retry" a stalled job. This CAN lead to a situation, where MULTIPLE workers work on the same job!
      stalledInterval: 3 * 60 * 1000 // 3 minutes
} 

We did just realize, that there are also 2 more options we did NOT change and are therefore set to default: lockDuration and lockRenewTime.

But even with that, so having e.g. lockDuration set to 30 seconds, how is it possible that we see the Error in our logs that often (see the timestamp at the very left!):

Image

I please ask for your input! We need to give our workers enough time to NOT make the jobs stall. We DO have some parts in the code that sometimes lock node's event loop over 30 seconds, which is fine for us.

But we seem confused about which settings we need to make this possible.

Twisterking avatar Feb 10 '25 08:02 Twisterking

Why don't you use sandboxed processors which are precisely designed for handling cases where you keep the nodejs event loop busy?

manast avatar Feb 10 '25 10:02 manast

@manast

I tried this back then with bull and ran into huge issues. Since bull was an "oldschool" require() package, we ran into issues inside our "type": "module" ESM node app.

For this, and other reasons, I would like to avoid doing this.

Twisterking avatar Feb 10 '25 12:02 Twisterking

@melihplt could it be a bug in NodeJS where the connection enters an infinite loop? Have you tried with a different runtime such as Bun to see if you get the same result?

Yes it can be. I added some more log to catch it. Btw, I also realized I did not register a global error listener on worker creation. All my logs were inside the worker.

So I also added this one:

worker.on('error', err => {
    console.error(err);
  });

@Twisterking FYI.

melihplt avatar Feb 13 '25 10:02 melihplt

@manast

I tried this back then with bull and ran into huge issues. Since bull was an "oldschool" require() package, we ran into issues inside our "type": "module" ESM node app.

For this, and other reasons, I would like to avoid doing this.

BullMQ has much more advanced logic to handle the loading of sandboxed processors, if you are using plain JS I think it will work easily, worth trying in my opinion.

manast avatar Feb 13 '25 13:02 manast

just started having this too.... no idea

afonsomatos avatar Feb 27 '25 09:02 afonsomatos

looking into how to reproduce but also encountering this issue locally after leaving the worker process running for a while. unrecoverable until restart.

kylealwyn avatar Mar 06 '25 01:03 kylealwyn

Important here is that you use v5.40+ as there is a know issue before that version that could produce stalled jobs when closing workers gracefully.

manast avatar Mar 09 '25 16:03 manast

+1 have same issue

could not renew lock for job 2214Worker error: could not renew lock for job 2214
could not renew lock for job 2214Worker error: could not renew lock for job 2214
could not renew lock for job 2214Worker error: could not renew lock for job 2214
could not renew lock for job 2214Worker error: could not renew lock for job 2214
could not renew lock for job 2214Worker error: could not renew lock for job 2214
could not renew lock for job 2214Worker error: could not renew lock for job 2214
Error processing job ID: 2214, at Fri May 02 2025 18:16:36 GMT+0000 (Coordinated Universal Time): TimeoutError: Timed out after waiting 30000ms
Missing lock for job 2214. moveToFinishedWorker error: Missing lock for job 2214. moveToFinished

kosiakMD avatar May 02 '25 18:05 kosiakMD

@kosiakMD how do you know it is the same issue? you are getting similar errors, but it is not possible to know if what caused the poster issue is what caused your issue. The most useful thing to do if you want a quick and effective resolution is to create a new issue with all the information you can provide for your particular use case and especially useful a code snippet that reproduces the issue you have.

manast avatar May 02 '25 19:05 manast

Same happening here. In my case is very easy to replicate.

  1. create a job and a worker
  2. get the job and log a message
  3. call extendLock on the job to 60000
  4. update progress to 100
  5. call moveToCompleted

Then I got the message Error: could not renew lock for job 45 every few seconds.

I "fixed" it by adding these options to worker:

	skipLockRenewal: true,
	skipStalledCheck: true,

xjrcode avatar May 29 '25 22:05 xjrcode