[Bug]: Flow jobs child failure handling is broken
Version
v4.6.3
Platform
NodeJS
What happened?
This is a follow up for #1469.
- If a child job is failing,
worker.on('failed', () => { ... }is invoked for the child job. - It is not invoked for the parent job failing, even if
failParentOnFailureis set on the child. - Redis event stream shows:
event: failed, jobId: xxx, failedReason: child xyz failed, prev: waiting-children, so the event is created, but the callback is never invoked.
According to #1469, this was fixed by #1481. However, it seems a regression was introduced with #1953.
Additionally, failParentOnFailure does not work with removeOnFail.
Expected behaviour: If failParentOnFailure is set for a child and the parent has removeOnFail, the parent should be removed from the queue.
How to reproduce.
No response
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
hi @CodingMeSwiftly, for the first point, worker failed events is emitted only for the current job that is moved to failed, so if the parent is also moved to failed, only the child as the current job will be send in the worker failed event, it's more useful a queueEvents instance in this case. For the 3rd point we have a test case where the event is invoked https://github.com/taskforcesh/bullmq/blob/master/tests/test_flow.ts#L1616-L1624, you can take a look
btw, the prs and issues your are linking are related to events that are used by queueEvents instance, workers don't read our event streams, workers only emit events for the current job that is being processed
Thank you for clearing things up about the difference between worker events and queue events.
Do you prefer to keep this issue open to track the failParentOnFailure / removeOnFail case, or should I open a dedicated issue for that (closing this issue)?
Additionally, failParentOnFailure does not work with removeOnFail. Expected behaviour: If failParentOnFailure is set for a child and the parent has removeOnFail, the parent should be removed from the queue.
I am experiencing this too, with BullMQ v4.12.3
This 'issue' might be more tricky to solve than initially thought.
Consider this scenario:
- A parent waiting for two children which are running simultaneously
-
Child Afails, whileChild Bcontinues to run - If
failParentOnFailure/removeOnFailwould attempt to remove the parent,Child Bcould not be removed, because it is locked
According to https://docs.bullmq.io/guide/flows#jobs-removal:
If any of the jobs that would be removed happen to be locked, none of the jobs will be removed, and an exception will be thrown.
I must say upfront that I just started using BullMQ, but my experience is that, when failParentOnFailure is set, the parent fails only once all children complete (and any one of them fails): it does not fail immediately. I especially tested this, because it is an important scenario in our use case.
In consideration of the scenario described by @CodingMeSwiftly, I would therefore dare to say that the expected behavior is the sequence as follows:
-
Child Astarts -
Child Bstarts -
Child Afails (note that nothing happens to the parent at this stage) -
Child Bcontinues to run (this is the desired and current behavior I remarked above) -
Child Bsucceeds -
Parentfails in virtue offailParentOnFailure -
Parentshall be removed in virtue ofremoveOnFail: according to https://docs.bullmq.io/guide/flows#jobs-removal, this should recursively remove all children, which have meanwhile completed and therefore hold no lock on the parent anymore.
The key would be to defer the failure and removal to the end of the process. My understanding is that the former is already implemented this way, so the hope is that the expert maintainers can tap into the same mechanism to implement the latter.
hi @matpen, is Child A having failParentOnFailure as true as well? also for the case of removeOnFail we are still thinking on it
@roggervalf in my actual configuration, yes: all children have both failParentOnFailure and removeOnFail.
Please find my actual config within, if it can help
await flowProducer.add({
name: 'batch0',
queueName: queue.name,
data: { step: 0 },
opts: {
jobId: `${job.data.type}_${job.data.procedure}_batch0`,
failParentOnFailure: true,
removeOnComplete: true,
removeOnFail: true,
},
children: childJobs.map(procedure => ({
name: procedure,
queueName: queue.name,
data: { procedure },
opts: {
jobId: `${job.data.type}_${procedure}`,
attempts: 2,
backoff: {
type: 'fixed',
delay: 300_000,
},
failParentOnFailure: true,
removeOnComplete: true,
removeOnFail: true,
},
})),
});
For the general case I described in https://github.com/taskforcesh/bullmq/issues/2229#issuecomment-1768739131, my guess is that it would make sense to fail the parent if just one of the children has failParentOnFailure and fails.
hi @matpen, I modified the test case a little bit and I cannot replicate your case https://github.com/taskforcesh/bullmq/pull/2241 as you can see from there, the failed event from the parent is emitted and one of the children is still active
pls try to modify one of our test cases to see if you can replicate your case
Thank you, @roggervalf for trying to reproduce the case I described. By looking at the code you linked, I see now that there is a misunderstanding:
- when you say "the parent fails", you seem to mean "the job failed event is emitted", and additionally perhaps "the job is moved to failed set"
- when I said "the parent fails" in my comment above I meant "the parent prevents the children from running"
IIUC this means that, differently from what I described in my comment, the current sequence of events is
-
Child Astarts -
Child Bstarts -
Child Afails ~(note that nothing happens to the parent at this stage)~ (the system immediately fails the parent at this stage) -
Parentfails in virtue offailParentOnFailure, and should be removed in virtue ofremoveOnFail: this gives an error becauseChild Bis still running (as documented here) -
Child Bcontinues to run (this is the desired and current behavior I remarked above) -
Child Bsucceeds - System is now in a bad state
If I got this right, the next question would be whether this approach is by design, or simply because it is (understandably) the most straightforward solution. And if so, whether it is possible to complicate the system and defer the removal of parents (as per the sequence in my previous comment).
Otherwise, it seems to me that failParentOnFailure cannot be really used effectively together with removeOnFail, unless we are sure that there is only one child.
- when I said "the parent fails" in my comment above I meant "the parent prevents the children from running"
Same problem here. I would like prevent next items to be processed.