bull Intermittent queue addJob rejections "disconnected before secure TLS connection was established"

Description

Bull.js occasionally rejecting new jobs added to the queue. I have a dedicated service hosted on Amazon ECS that interfaces with Bull.js with Amazon managed Redis.

I send jobs to this service via an HTTP API.

Occasionally, jobs will be rejected with a mysterious TLS rejection:

Error: Client network socket disconnected before secure TLS connection was established

It's only happening intermittently, which is making me think that this is a configuration issue. I'm not sure if the blocked clients # in the Bull dashboard has something to do with it:

Minimal, Working Test code to reproduce the issue.

(An easy to reproduce test case will dramatically decrease the resolution time.)

Sorry, it's a high volume intermittent failure on infrastructure, not something I can code up in an example.

Bull version

4.5.1

Additional information

Queue hosted on Amazon ECS with Fargate, Amazon Redis as the memory store.

Mar 21 '22 14:03 ctrlaltdylan

I'm having this issue as well. Queue hosted on Heroku with the heroku-redis add on. After processing some records in the queue it looks like the worker stops picking up jobs off of the queue. Restarting the worker seems to temporarily alleviate this. This happened once during the afternoon when I redeployed the app, after which 5 unprocessed jobs were immediately ran. I also manually restarted the workers a few minutes ago to get the last batch of records processed (3 more). Occasionally getting the TLS error as mentioned above which is what triggered the team to have us look at it.

Mar 30 '22 12:03 CassandraTrilus

@CassandraTrilus I tried await'ing the Queue.add call, and I was really really hoping that was it but no luck.

I'm wondering if there's some kind of non-Promise-ified callback within the Redis library or the Queue.add call, which is causing a race condition between your HTTP request processing and the job being added.

I'm using a serverless setup, but it sounds like you're using a regular Heroku dyno for your producer of Queue jobs?

Mar 30 '22 13:03 ctrlaltdylan

@ctrlaltdylan We are using a regular heroku dyno. The weird thing is how difficult this issue is to reproduce - I have two apps in different stages using the same code branch, same bull version, same heroku-redis add on, etc, and only one is experiencing this issue. I am also curious if this behavior is seen when using bullmq. I know support for bull is limited now and some of the recommendations for issues seen here have been to use bullmq instead. Really curious as to what is causing this as it seemed to come out of nowhere.

Mar 30 '22 14:03 CassandraTrilus

@CassandraTrilus good call, I think the Job's API is just about the same across both bull & bullmq, I might just try bullmq or downgrade bull, it's clearly a new bug and I never encountered this in bull v3.x.

Mar 30 '22 15:03 ctrlaltdylan

@CassandraTrilus when the workers stop processing new jobs, you can check if 1) there are active jobs, meaning the workers are stuck processing jobs and not completing them for some reason, 2) use the getWorkers() API or redis CLIENT LIST command to see if there are any actual workers connected to Redis.

Mar 30 '22 15:03 manast

@manast what's your recommendation for avoiding this error on both processing and producing of jobs?

Should we somehow increase the number of connections or workers?

Mar 30 '22 15:03 ctrlaltdylan

I do not have any recommendation because I do not know why this happens for you...

Mar 30 '22 16:03 manast

Btw, if you know about an older version of bull that did not exhibit this problem it would be useful to know it to try to narrow the code that produces it.

Mar 31 '22 02:03 manast

Just an update, in my case switching to bullmq: 1.78.2 fixed the issue for me it appears.

Another silly mistake to check is to make sure you're await'ing every single Queue.add call, otherwise your HTTP API handler will close before the job is enqueued.

I'm still receiving this error on processing but I attribute this to some async actions taking too long and the handlers just timing out, because of Vercel's timeout limits.

Apr 07 '22 13:04 ctrlaltdylan

Any updates on this issue?

May 12 '22 17:05 MarcusHSmith

@MarcusHSmith what issue are you having, and can you provide reproducible code?

May 13 '22 01:05 manast

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 12 '22 22:07 stale[bot]

Each time I ran into this issue, it's because I have hit my current heroku redis connection limit tier. Check out data.heroku.com and it will show you the number of connections being used. Each time upgrading the redis level has improved our situation.

Jul 13 '22 18:07 nicholasoxford

Thanks for the tip!

I'm using Elasticache, I'll check my connection limits and share any graphs. I hope it's something this simple.

On Wed, Jul 13, 2022, 2:59 PM N @.***> wrote:

Each time I ran into this issue, it's because I have hit my current heroku redis connection limit tier. Check out data.heroku.com and it will show you the number of connections being used. Each time upgrading the redis level has improved our situation.

— Reply to this email directly, view it on GitHub https://github.com/OptimalBits/bull/issues/2328#issuecomment-1183572223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUR4TRS2UWP7DZKEW574NTVT4G23ANCNFSM5RH3AE5A . You are receiving this because you were mentioned.Message ID: @.***>

Jul 13 '22 20:07 ctrlaltdylan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 11 '22 23:09 stale[bot]

bull bull copied to clipboard

Intermittent queue addJob rejections "disconnected before secure TLS connection was established"

Description

Minimal, Working Test code to reproduce the issue.

(An easy to reproduce test case will dramatically decrease the resolution time.)

Bull version

Additional information

bull
bull copied to clipboard