elsa-core icon indicating copy to clipboard operation
elsa-core copied to clipboard

Retry faulted workflows

Open sfmskywalker opened this issue 2 years ago • 8 comments

Like in Elsa 2, in Elsa 3 we should be able to retry faulted workflows.

sfmskywalker avatar Apr 07 '23 19:04 sfmskywalker

Is there a way to not retry faulted workflows in Elsa 2?

Also, in Elsa 2, when it retried faulted workflows, it overwrite the journal of the previous run. So, we can't trace why it faulted in the first place.

programatix avatar Jun 09 '23 04:06 programatix

Is there a way to not retry faulted workflows in Elsa 2?

Yes, retries aren't automatic, so as long as you don't explicitly retry a faulted workflow, it will not be retried.

Also, in Elsa 2, when it retried faulted workflows, it overwrite the journal of the previous run. So, we can't trace why it faulted in the first place.

That sounds like a bug - the journal is not supposed to be overwritten.

sfmskywalker avatar Jun 10 '23 10:06 sfmskywalker

I followed the guide for dashboard + host to setup the app. Faulted workflow indeed doesn't retry but if I stop the host and restart, faulted workflow will rerun. I had a few incidents where the workflow faulted and caused the host to terminate. When I start the host again, it will run the faulted workflow again and crash. The only way to resolve is to open the Elsa DB and delete the workflow from on of the table.

The host crash with access violation. I'm not sure what went wrong but I think I mistakenly use HTTP Response activity with its content set to JavaScript but I wrote plain text in it. When running on Azure web app, it keep on crashing infinitely as the web app restart and it crashed again due to the workflow being rerun.

programatix avatar Jun 10 '23 10:06 programatix

When the host crashes, and you inspect the DB, can you see if the workflow instance's Status column is indeed Faulted (4)? The expected behaviour is that when the app starts, only workflow instances with the Running (2) status are automatically resumed.

sfmskywalker avatar Jun 10 '23 11:06 sfmskywalker

I'll have to try to reproduce this again this Monday and report back the findings.

But I do remember that faulted workflows which wasn't crashing the app (status seen in the Elsa Dashboard) did rerun on next run as I was running debugging while developing custom activity and the breakpoint hits. Anyway, I'll try to reproduce this again and report back with more detailed findings.

programatix avatar Jun 10 '23 11:06 programatix

@sfmskywalker, I can confirm that the workflow was re-run because there is no FaultedAt in the WorkflowInstance Table. image

Somehow my app crashed when an exception in Elsa workflow occurred when it shouldn't. So, this may have caused the workflow to rerun. But I can't replicate the crash in a new project though.

programatix avatar Jun 10 '23 13:06 programatix

Should a faulted workflow not be considered failed?

Whereas the individual activities should have a retry policy typical such as found in Polly with your exponential back-off and jitter retry strategies.

A faulted workflow should then be reduced to configuration issues and plain old bugs.

Question is, is there no way for the incident management strategy of a faulted workflow to execute an alternate workflow.

Possible use case would be:

  1. A record is added to the application data store which invokes a post-save event on the application event pipeline
  2. The event pipeline invokes the appropriate Elsa workflow
  3. The workflow faults due to an issue requiring human intervention and fails to complete leaving the application in an inconsistent state

Is it not possible for an appropriate fails management workflow to then execute that puts the record into an appropriate failed state and escalates to a human from generating a Jira ticket to pushing a Slack message?

Vaughan-mci avatar Jun 09 '24 09:06 Vaughan-mci

I think my above comment is mostly addressed in issue #4325

Vaughan-mci avatar Jun 09 '24 09:06 Vaughan-mci