bull Possible infinite loop if BRPOPLPUSH fails ?

Description

Hello today we faced a problem that seemed to be an infinite loop on Bull library. Our service is hosted on Heroku with the Redis addon and today we reached the memory quota of the Redis DB. What happened is that we had an enormous log stack saying this:

BRPOPLPUSH { ReplyError: OOM command not allowed when used memory > 'maxmemory'. 
    at parseError (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:179:12) 
    at parseType (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:302:14) 
  command: 
   { name: 'brpoplpush', 
     args: 
      [ 'bull:<name-of-our-job>:wait', 
        'bull:<name-of-our-job>:active', 
        '5' ] } }

The log stack took up to 1gb in a few minutes until we fix the quota issue.

By looking a little bit to the code in lib/queue.js, it seems that the error on BRPOPLPUSH is ignored in Queue.prototype.getNextJob. So I guess that what happened is that the loop searching for new jobs to process was infinitely popping the error.

I don't have enough knowledge about how Bull internally works to propose a fix, but I think this is something that should be handled, maybe by detecting when there is several errors on BRPOPLPUSH and add a waiting duration when this happens to frequently.

Bull version

v3.7.0

(just seen that 3.8 is out ; it doesn't seem that it would be fixed in this version by reading the changelog)

May 02 '19 12:05 smennesson

exactly same issue on the same circumstances: Heroku, Redis (Premium 0 instance - v3.2.12), BRPOPLPUSH :(

what are the steps to prevent REDIS reach the 50Mb limit ... what can I do after a job is done?

P.S. I've upgraded my Redis instance to 4.0.14 ... I'll report if I get the same issue

May 09 '19 22:05 balexandre

@balexandre actually reaching the Redis limit is not the real problem here. There could be a multiple reason causing your Redis db to be full and there is a few tools to show you what keys are taking the most memory space. So you can either clean a little bit or upgrade your plan.

Problem is that the Bull library is in an infinite loop when this occurs, generating a huge volume of logs, and ignoring the error, so we have no way to be warned this is occurring.

May 10 '19 07:05 smennesson

I have the same issue.....

Is there some way to catch this error? And log a different message?

I am not even sure where it's thrown from

Sep 14 '19 00:09 kand617

Seems to have been introduced in this commit: https://github.com/OptimalBits/bull/blob/f8ff3684915dd034492f10ba91fe5c2d5df0dbef/lib/queue.js#L893-L896

Perhaps the fix is simply to not swallow any errors here? Wouldn't this fire an error event on the queue if not swallowed?

Dec 27 '19 22:12 mmelvin0

@mmelvin0 Did you receive a response or have any updates on this? Thanks!

Mar 09 '20 19:03 masiamj

actually I think it would be as easy as only swalling "Connection is closed" error, since this error happens when we force to close the connection with redis, should be an easy fix.

Mar 09 '20 21:03 manast

This blew up our logging pipeline too. Filled up our Papertrail daily limit in two minutes. I love Bull and would like to move on this quickly. @manast how can I or my team help fix this? We actually have an open ticket about this on our backlog because it can potentially break our production at https://checklyhq.com when this happens and we don't intervene manually

Mar 11 '20 10:03 tnolet

if you provide a PR I will merge it quickly and make a release. It is as mentioned as simple as only swalling the error in case of connection is closed, and maybe add a small 5 seconds delay before trying again.

Mar 11 '20 11:03 manast

@manast I will make a PR

Mar 11 '20 12:03 tnolet

@manast I think this will do it. I refrained from adding a 5s delay because of unclear impact. I think not spamming the log is the first issue here. https://github.com/OptimalBits/bull/pull/1674

Mar 11 '20 12:03 tnolet

Would love feedback or insights into the PR I mentioned above. This is still a significant risk to overwhelm any logging pipeline when things go south.

Mar 27 '20 08:03 tnolet

This is high prio, will try to merge today.

Mar 31 '20 08:03 manast-apsis

@manast-apsis thanks, let me know if I can help anywhere.

Mar 31 '20 19:03 tnolet

@manast @manast-apsis Let me know what I need to do to get this merged. Thanks

Apr 06 '20 15:04 tnolet

@manast @manast-apsis hope all is well, let me know if I can assist with merging the PR for this?

Apr 20 '20 16:04 tnolet

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

Apr 20 '20 20:04 manast

@manast what is your suggestion for moving this ticket forward? I have the feeling we are going in circles a bit. I would be happy to remove/close my PR in favour of a better solution.

The reason I'm pushing this is that it is still a risk for many folks using logging solutions with hard limits.

As mentioned, I'm ready to help or have my team also look at it.

Apr 21 '20 08:04 tnolet

same problem here, only in production, using docker

May 18 '20 00:05 danielkv

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

@manast what do you mean when you say "we need to just delay some seconds".

What are we delaying? The console.log()
What happens after the delay? We just log? Or do we take some other action?

I think I'm just really confused on the "delay" part. This will not solve spamming the logging channel. It will just spam with a delay...

May 18 '20 10:05 tnolet

I posted a possible solution to this problem in #1746

May 23 '20 01:05 dobesv

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 12 '21 14:07 stale[bot]

@tnolet not sure if you are aware but this issue has been addressed some time ago https://github.com/OptimalBits/bull/pull/1974

Let me know if that is not the case.

Jul 14 '21 14:07 manast

@manast I was not aware of this. Thanks for mentioning and fixing this! I will remove my thumbs down.

Jul 15 '21 08:07 tnolet

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

Apr 12 '23 21:04 Jimmycon210

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

If Redis is full, then the best is to delete some data by hand or upgrade the instance to one with more memory, most hosting provides seamless upgrade procedures.

Apr 12 '23 21:04 manast

bull bull copied to clipboard

Possible infinite loop if BRPOPLPUSH fails ?

Description

Bull version

bull
bull copied to clipboard