bullmq [Bug]: Flow jobs child failure handling is broken

Version

v4.6.3

Platform

NodeJS

What happened?

This is a follow up for #1469.

If a child job is failing, worker.on('failed', () => { ... } is invoked for the child job.
It is not invoked for the parent job failing, even if failParentOnFailure is set on the child.
Redis event stream shows: event: failed, jobId: xxx, failedReason: child xyz failed, prev: waiting-children, so the event is created, but the callback is never invoked.

According to #1469, this was fixed by #1481. However, it seems a regression was introduced with #1953.

Additionally, failParentOnFailure does not work with removeOnFail. Expected behaviour: If failParentOnFailure is set for a child and the parent has removeOnFail, the parent should be removed from the queue.

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Oct 13 '23 09:10 CodingMeSwiftly

hi @CodingMeSwiftly, for the first point, worker failed events is emitted only for the current job that is moved to failed, so if the parent is also moved to failed, only the child as the current job will be send in the worker failed event, it's more useful a queueEvents instance in this case. For the 3rd point we have a test case where the event is invoked https://github.com/taskforcesh/bullmq/blob/master/tests/test_flow.ts#L1616-L1624, you can take a look

Oct 14 '23 05:10 roggervalf

btw, the prs and issues your are linking are related to events that are used by queueEvents instance, workers don't read our event streams, workers only emit events for the current job that is being processed

Oct 14 '23 05:10 roggervalf

Thank you for clearing things up about the difference between worker events and queue events.

Do you prefer to keep this issue open to track the failParentOnFailure / removeOnFail case, or should I open a dedicated issue for that (closing this issue)?

Oct 14 '23 07:10 CodingMeSwiftly

Additionally, failParentOnFailure does not work with removeOnFail. Expected behaviour: If failParentOnFailure is set for a child and the parent has removeOnFail, the parent should be removed from the queue.

I am experiencing this too, with BullMQ v4.12.3

Oct 15 '23 14:10 matpen

This 'issue' might be more tricky to solve than initially thought.

Consider this scenario:

A parent waiting for two children which are running simultaneously
Child A fails, while Child B continues to run
If failParentOnFailure / removeOnFail would attempt to remove the parent, Child B could not be removed, because it is locked

According to https://docs.bullmq.io/guide/flows#jobs-removal:

If any of the jobs that would be removed happen to be locked, none of the jobs will be removed, and an exception will be thrown.

Oct 18 '23 13:10 CodingMeSwiftly

I must say upfront that I just started using BullMQ, but my experience is that, when failParentOnFailure is set, the parent fails only once all children complete (and any one of them fails): it does not fail immediately. I especially tested this, because it is an important scenario in our use case.

In consideration of the scenario described by @CodingMeSwiftly, I would therefore dare to say that the expected behavior is the sequence as follows:

Child A starts
Child B starts
Child A fails (note that nothing happens to the parent at this stage)
Child B continues to run (this is the desired and current behavior I remarked above)
Child B succeeds
Parent fails in virtue of failParentOnFailure
Parent shall be removed in virtue of removeOnFail: according to https://docs.bullmq.io/guide/flows#jobs-removal, this should recursively remove all children, which have meanwhile completed and therefore hold no lock on the parent anymore.

The key would be to defer the failure and removal to the end of the process. My understanding is that the former is already implemented this way, so the hope is that the expert maintainers can tap into the same mechanism to implement the latter.

Oct 18 '23 15:10 matpen

hi @matpen, is Child A having failParentOnFailure as true as well? also for the case of removeOnFail we are still thinking on it

Oct 20 '23 03:10 roggervalf

@roggervalf in my actual configuration, yes: all children have both failParentOnFailure and removeOnFail.

Please find my actual config within, if it can help

await flowProducer.add({
    name: 'batch0',
    queueName: queue.name,
    data: { step: 0 },
    opts: {
        jobId: `${job.data.type}_${job.data.procedure}_batch0`,
        failParentOnFailure: true,
        removeOnComplete: true,
        removeOnFail: true,
    },
    children: childJobs.map(procedure => ({
        name: procedure,
        queueName: queue.name,
        data: { procedure },
        opts: {
            jobId: `${job.data.type}_${procedure}`,
            attempts: 2,
            backoff: {
                type: 'fixed',
                delay: 300_000,
            },
            failParentOnFailure: true,
            removeOnComplete: true,
            removeOnFail: true,
        },
    })),
});

For the general case I described in https://github.com/taskforcesh/bullmq/issues/2229#issuecomment-1768739131, my guess is that it would make sense to fail the parent if just one of the children has failParentOnFailure and fails.

Oct 20 '23 07:10 matpen

hi @matpen, I modified the test case a little bit and I cannot replicate your case https://github.com/taskforcesh/bullmq/pull/2241 as you can see from there, the failed event from the parent is emitted and one of the children is still active

Oct 21 '23 03:10 roggervalf

pls try to modify one of our test cases to see if you can replicate your case

Oct 21 '23 03:10 roggervalf

Thank you, @roggervalf for trying to reproduce the case I described. By looking at the code you linked, I see now that there is a misunderstanding:

when you say "the parent fails", you seem to mean "the job failed event is emitted", and additionally perhaps "the job is moved to failed set"
when I said "the parent fails" in my comment above I meant "the parent prevents the children from running"

IIUC this means that, differently from what I described in my comment, the current sequence of events is

Child A starts
Child B starts
Child A fails ~(note that nothing happens to the parent at this stage)~ (the system immediately fails the parent at this stage)
Parent fails in virtue of failParentOnFailure, and should be removed in virtue of removeOnFail: this gives an error because Child B is still running (as documented here)
Child B continues to run (this is the desired and current behavior I remarked above)
Child B succeeds
System is now in a bad state

If I got this right, the next question would be whether this approach is by design, or simply because it is (understandably) the most straightforward solution. And if so, whether it is possible to complicate the system and defer the removal of parents (as per the sequence in my previous comment).

Otherwise, it seems to me that failParentOnFailure cannot be really used effectively together with removeOnFail, unless we are sure that there is only one child.

Oct 21 '23 08:10 matpen

when I said "the parent fails" in my comment above I meant "the parent prevents the children from running"

Same problem here. I would like prevent next items to be processed.

Mar 10 '24 15:03 klawdyo