elsa-core icon indicating copy to clipboard operation
elsa-core copied to clipboard

Workflow Execution Recovery

Open sfmskywalker opened this issue 1 year ago • 8 comments

Overview: In the event of an application crash, currently running workflow executions are lost. There's a critical need for a reliable method to restart these abruptly interrupted workflows. This feature introduces a robust solution to address this challenge.

Key Mechanism: The cornerstone of this feature is the utilisation of the workflow instance's Status field. Under normal operations, a workflow instance transitions through various states like "Finished", "Suspended", or "Faulted". However, if an instance is unexpectedly terminated due to an application crash, its Status remains as "Running". This state indicates that the workflow was active at the time of the crash and did not conclude naturally.

Terminology: To ensure clarity and avoid confusion with existing processes, we introduce the term "Restart" for this feature. This term is distinct from "Resume", which is already used for restarting suspended workflows. "Restart" specifically refers to the process of restarting workflows that were actively running and got interrupted due to an application crash.

Restarting Methods: This feature encompasses two primary methods for restarting interrupted workflows:

  1. Alteration-Based Recovery:
    • This method allows for the manual initiation of the recovery process.
    • It involves altering specific parameters or settings to trigger the restart of the interrupted workflow. For example, additional input and whether to run the workflow synchronously or asynchronously.
    • This option provides control and flexibility, particularly useful in scenarios where selective recovery is needed.
  2. Automatic Recovery During Application Startup:
    • This is an automated approach designed to streamline the recovery process.
    • Upon application restart, the system automatically scans for workflows with a "Running" status but were actually halted due to the crash.
    • These identified workflows are then automatically recovered, ensuring minimal disruption and swift continuation of business processes.

Conclusion: This feature is a significant step towards enhancing the resilience and reliability of our workflow management. By accurately identifying and efficiently restarting interrupted workflows, we ensure continuity and reduce the impact of unexpected application crashes.

Tasks

  • [ ] #4831
  • [ ] #4832
  • [ ] #4835

sfmskywalker avatar Jan 24 '24 19:01 sfmskywalker

Hello @sfmskywalker, I agree with what you are proposing, in fact from what I do on tests, I think that the resumption of suspended workflows due to a service restart or service crash for example is not possibl with what exist on main branche and v3, because before the end of workflow or orchestration there are nothing that is saved in the database (I tested with mongodb and Sql Server), i mean the payload and all the workflow sent will only be available in the database only after the workflow has finished (Finished or failed...).

Example: On the sample : https://github.com/elsa-workflows/elsa-core/blob/main/src/bundles/Elsa.Server.Web/Endpoints/DynamicWorkflows/Post/Endpoint.cs


    public override async Task HandleAsync(CancellationToken ct)
    {
        var workflow = new Workflow
        {
            Identity = new WorkflowIdentity("DynamicWorkflow1", 1, "DynamicWorkflow1:v1"),
            Root = new Sequence
            {
                Activities =
                {
                    new OneActivity
                    {
                        ServiceName = new Input<string>("ServiceOne"),
                    },
                    new OneActivity
                    {
                        ServiceName = new Input<string>("ServiceTwo"),
                    }
                }
            }
        };

        await workflowRegistry.RegisterAsync(workflow, ct);
        await workflowRuntime.StartWorkflowAsync("DynamicWorkflow1", new StartWorkflowRuntimeOptions());
    }
}

public class OneActivity : CodeActivity
{
    public required Input<string> ServiceName { get; set; }

    protected override async ValueTask ExecuteAsync(ActivityExecutionContext context)
    {
        var serviceName = ServiceName.Get(context);
    }
}

Juste put a breakpoint at the OneActivity and stop debug at Activity call 2. When checking database collection there ara no data related to our orchestration, no payload ..... I was tested with many implementations

bbenameur avatar Feb 09 '24 16:02 bbenameur

Thanks for the input @bbenameur , we can use your test case to verify the feature proposed here 👍🏻

sfmskywalker avatar Feb 09 '24 17:02 sfmskywalker

Hello @sfmskywalker, I noticed that this feature has been omitted in Elsa 3.1, which appears to be quite critical for our requirements. Could you please inform me if there's an anticipated timeline for its reintroduction ? Thank you

hsnsalhi avatar Mar 04 '24 14:03 hsnsalhi

Hi @hsnsalhi , Indeed, unfortunately we will not be able to include this capability on time for the 3.1 release which is slated for this month. It will be picked up shortly thereafter, which means it will be included with 3.2, which will be released in June. And, as always, the feature will be part of the normal preview builds once it's available. Sorry for the delay on this one.

sfmskywalker avatar Mar 04 '24 19:03 sfmskywalker

Hi! Is there any ETA for this feature? I see it's been moved to the 3.3 milestone.

rosca-sabina avatar Jun 19 '24 07:06 rosca-sabina

Hi @rosca-sabina , Unfortunately this feature has been pushed down again due to other priorities. It's unknown at this point when this can be picked up.

sfmskywalker avatar Jun 19 '24 07:06 sfmskywalker

If you do it on 3.4, will it be done by end of year?

edward-yuen-tfs avatar Aug 01 '24 16:08 edward-yuen-tfs

It depends on the situation. Features driven by customer requests are typically implemented quickly, while other features are developed more organically, making them harder to plan for.

sfmskywalker avatar Sep 30 '24 19:09 sfmskywalker

Hi @sfmskywalker,

I noticed that this feature (Workflow Execution Recovery) was recently removed from the Elsa 3.4 milestone. Could you please clarify if there's still an anticipated timeline for this feature?

This capability seems essential for building resilient applications, especially in production environments where workflow interruptions due to crashes must be reliably recoverable. Without this functionality, it would be challenging to confidently deploy workflows in scenarios where no data loss or execution interruptions can be tolerated.

Thanks for your continued efforts—looking forward to your feedback!

Ramedlaw-knil avatar Jul 18 '25 07:07 Ramedlaw-knil

Hi @Ramedlaw-knil,

This capability seems essential for building resilient applications, especially in production environments where workflow interruptions due to crashes must be reliably recoverable. Without this functionality, it would be challenging to confidently deploy workflows in scenarios where no data loss or execution interruptions can be tolerated.

100% agreed!

The following two features are crucial to that, and are already completed and part of 3.4:

  • https://github.com/elsa-workflows/elsa-core/issues/4832
  • https://github.com/elsa-workflows/elsa-core/issues/4835

This means that interrupted workflows are automatically restarted in case they got interrupted.

One other feature that is equally important to prevent loss of work is the Graceful Shutdown feature. Although I cannot give an exact timeline, this is currently being designed.

  • https://github.com/elsa-workflows/elsa-core/issues/6401

sfmskywalker avatar Jul 18 '25 08:07 sfmskywalker

Hey @sfmskywalker,

do you have a rough ETA for this feature?

Thanks in advance! :)

lecramr avatar Oct 07 '25 08:10 lecramr

We haven’t fully committed to the Graceful Shutdown feature yet, but odds are that development will start somewhere within the next 4 to 8 weeks. From that point, it will likely take another 4 weeks to complete from start to end.

The other sub features mentioned in this feature are available today as part of 3.5.

sfmskywalker avatar Oct 07 '25 09:10 sfmskywalker