bull icon indicating copy to clipboard operation
bull copied to clipboard

Possible infinite loop if BRPOPLPUSH fails ?

Open smennesson opened this issue 5 years ago • 25 comments

Description

Hello today we faced a problem that seemed to be an infinite loop on Bull library. Our service is hosted on Heroku with the Redis addon and today we reached the memory quota of the Redis DB. What happened is that we had an enormous log stack saying this:

BRPOPLPUSH { ReplyError: OOM command not allowed when used memory > 'maxmemory'. 
    at parseError (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:179:12) 
    at parseType (/app/node_modules/ioredis/node_modules/redis-parser/lib/parser.js:302:14) 
  command: 
   { name: 'brpoplpush', 
     args: 
      [ 'bull:<name-of-our-job>:wait', 
        'bull:<name-of-our-job>:active', 
        '5' ] } } 

The log stack took up to 1gb in a few minutes until we fix the quota issue.

By looking a little bit to the code in lib/queue.js, it seems that the error on BRPOPLPUSH is ignored in Queue.prototype.getNextJob. So I guess that what happened is that the loop searching for new jobs to process was infinitely popping the error.

I don't have enough knowledge about how Bull internally works to propose a fix, but I think this is something that should be handled, maybe by detecting when there is several errors on BRPOPLPUSH and add a waiting duration when this happens to frequently.

Bull version

v3.7.0

(just seen that 3.8 is out ; it doesn't seem that it would be fixed in this version by reading the changelog)

smennesson avatar May 02 '19 12:05 smennesson

exactly same issue on the same circumstances: Heroku, Redis (Premium 0 instance - v3.2.12), BRPOPLPUSH :(

enter image description here

what are the steps to prevent REDIS reach the 50Mb limit ... what can I do after a job is done?

P.S. I've upgraded my Redis instance to 4.0.14 ... I'll report if I get the same issue

balexandre avatar May 09 '19 22:05 balexandre

@balexandre actually reaching the Redis limit is not the real problem here. There could be a multiple reason causing your Redis db to be full and there is a few tools to show you what keys are taking the most memory space. So you can either clean a little bit or upgrade your plan.

Problem is that the Bull library is in an infinite loop when this occurs, generating a huge volume of logs, and ignoring the error, so we have no way to be warned this is occurring.

smennesson avatar May 10 '19 07:05 smennesson

I have the same issue.....

Is there some way to catch this error? And log a different message?

I am not even sure where it's thrown from

kand617 avatar Sep 14 '19 00:09 kand617

Seems to have been introduced in this commit: https://github.com/OptimalBits/bull/blob/f8ff3684915dd034492f10ba91fe5c2d5df0dbef/lib/queue.js#L893-L896

Perhaps the fix is simply to not swallow any errors here? Wouldn't this fire an error event on the queue if not swallowed?

mmelvin0 avatar Dec 27 '19 22:12 mmelvin0

@mmelvin0 Did you receive a response or have any updates on this? Thanks!

masiamj avatar Mar 09 '20 19:03 masiamj

actually I think it would be as easy as only swalling "Connection is closed" error, since this error happens when we force to close the connection with redis, should be an easy fix.

manast avatar Mar 09 '20 21:03 manast

This blew up our logging pipeline too. Filled up our Papertrail daily limit in two minutes. I love Bull and would like to move on this quickly. @manast how can I or my team help fix this? We actually have an open ticket about this on our backlog because it can potentially break our production at https://checklyhq.com when this happens and we don't intervene manually

tnolet avatar Mar 11 '20 10:03 tnolet

if you provide a PR I will merge it quickly and make a release. It is as mentioned as simple as only swalling the error in case of connection is closed, and maybe add a small 5 seconds delay before trying again.

manast avatar Mar 11 '20 11:03 manast

@manast I will make a PR

tnolet avatar Mar 11 '20 12:03 tnolet

@manast I think this will do it. I refrained from adding a 5s delay because of unclear impact. I think not spamming the log is the first issue here. https://github.com/OptimalBits/bull/pull/1674

tnolet avatar Mar 11 '20 12:03 tnolet

Would love feedback or insights into the PR I mentioned above. This is still a significant risk to overwhelm any logging pipeline when things go south.

tnolet avatar Mar 27 '20 08:03 tnolet

This is high prio, will try to merge today.

manast-apsis avatar Mar 31 '20 08:03 manast-apsis

@manast-apsis thanks, let me know if I can help anywhere.

tnolet avatar Mar 31 '20 19:03 tnolet

@manast @manast-apsis Let me know what I need to do to get this merged. Thanks

tnolet avatar Apr 06 '20 15:04 tnolet

@manast @manast-apsis hope all is well, let me know if I can assist with merging the PR for this?

tnolet avatar Apr 20 '20 16:04 tnolet

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

manast avatar Apr 20 '20 20:04 manast

@manast what is your suggestion for moving this ticket forward? I have the feeling we are going in circles a bit. I would be happy to remove/close my PR in favour of a better solution.

The reason I'm pushing this is that it is still a risk for many folks using logging solutions with hard limits.

As mentioned, I'm ready to help or have my team also look at it.

tnolet avatar Apr 21 '20 08:04 tnolet

same problem here, only in production, using docker

danielkv avatar May 18 '20 00:05 danielkv

The thing is that the current PR does not do anything valuable. I am not sure which solution is the best here, basically I think we need to just delay some seconds in the case of error, maybe can be configurable. But sometimes the connection will error with connection closed error since we are manually closing the queue, and in this case we do not want to delay so we need to check if the closing promise is set and in that case no delay.

@manast what do you mean when you say "we need to just delay some seconds".

  • What are we delaying? The console.log()
  • What happens after the delay? We just log? Or do we take some other action?

I think I'm just really confused on the "delay" part. This will not solve spamming the logging channel. It will just spam with a delay...

tnolet avatar May 18 '20 10:05 tnolet

I posted a possible solution to this problem in #1746

dobesv avatar May 23 '20 01:05 dobesv

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 12 '21 14:07 stale[bot]

@tnolet not sure if you are aware but this issue has been addressed some time ago https://github.com/OptimalBits/bull/pull/1974

Let me know if that is not the case.

manast avatar Jul 14 '21 14:07 manast

@manast I was not aware of this. Thanks for mentioning and fixing this! I will remove my thumbs down.

tnolet avatar Jul 15 '21 08:07 tnolet

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

Jimmycon210 avatar Apr 12 '23 21:04 Jimmycon210

Hello, I am currently running into this. Finding that if the redis memory is full, BRPOPLPUSH fails spamming logs. This also prevents our app from processing jobs in the queue, causing the queue to remain full.

Seems to be failing here: https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L1202

And the catch block doesn't catch anything, no error is thrown. We get stuck in an infinite loop.

If Redis is full, then the best is to delete some data by hand or upgrade the instance to one with more memory, most hosting provides seamless upgrade procedures.

manast avatar Apr 12 '23 21:04 manast