bullmq icon indicating copy to clipboard operation
bullmq copied to clipboard

[Bug]: Parent task part of a flow doesn't get scheduled

Open vicpara opened this issue 7 months ago • 1 comments

Version

v5.8.2

Platform

NodeJS

What happened?

I'm running a flow where one job depends directly on one of 15 other children jobs. On 5% of the cases the processing of the children jobs stops (let's say 12 completed 3 never started) and the parent tree of jobs never gets schedulled.

I'm running BullMQ using AWS Redis MemoryDB and AWS Fargate.

In the logs i get no errors indicating any failed jobs. I have extensive logging for failures of many kinds. The only bizarre two lines that appear after the last child job is executed are:

[2024-07-11 11:22:39.008][15][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 7 cluster(s), remove 0 cluster(s)
[2024-07-11 11:22:39.008][15][info][upstream] [source/common/upstream/cds_api_helper.cc:69] cds: added/updated 0 cluster(s), skipped 7 unmodified cluster(s)

I also checked the initial flow job definition and everything seems fine. The flow looks correct. If i restart the flow everything works fine. This has been a recurring pattern for the last months.

What kind of logging can I add to the children jobs to track the progression of the children job towards completion, how many children jobs are still pending for the parent to become active? The children jobs are scheduled on a separate queue that works under a rate limit schedule because of external dependencies.

How can I diagnose this?

How to reproduce.

?

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

vicpara avatar Jul 11 '24 16:07 vicpara