orchestrator-core icon indicating copy to clipboard operation
orchestrator-core copied to clipboard

[Feature]: Discuss / Design Options for workflow roll back

Open srichmond opened this issue 1 year ago • 2 comments

Contact Details

[email protected]

What should we build?

We would like to look at what options are available and how difficult it would be to implement roll back for a given workflow.

Relevant pseudo code

No response

srichmond avatar Dec 04 '24 15:12 srichmond

I've been looking at this and have some ideas. First though, it's worth elaborating on the problem at hand.

Workflows are currently designed to always run to completion. Failure in a step is expected to be both transient and isolated, and therefore a workflow can be restarted from the failed step. The problem occurs when the assumptions mentioned before don't hold anymore. If a workflow fails in an unrecoverable way, subscriptions can be stuck out-of-sync (and potentially in an invalid state) with no easy way to back out. This is the main problem, as an out-of-sync, invalid subscription can potentially be impossible to fix. If you have a subscription that is

  • Out-of-sync,
  • Invalid, and
  • used by other subscriptions

Then it cannot be validated or terminated. While it can be forced to be in-sync, if it's in an invalid state that will cause problems. Fixing the subscription requires knowing what is broken and that depends on which workflow failed and why. This is where rollback comes in.

Rolling back a workflow should return the state of the system to what it was before the workflow was started. Doing this fully automatically is difficult, since we want to also relinquish any resources acquired by the workflow before it failed.

Inverse Steps

There needs to be way for a developer to define the inverse of a step. During rollback, the inverse of the step would be executed. This is likely to be necessary for any useful implementation of rollback. This can either be done using a new step variant:

def undo_thing(...): ...

@rollbackstep("Do Thing", undo_thing)
def do_thing(...): ...

or by "registering" the inverse function as such with the step:

@step("Do Thing")
def do_thing(...): ...

@do_thing.rollback("Undo Thing")
def undo_thing(...): ...

Rolling Back

The most obvious way to handle rolling back is that when a workflow fails, then one of the actions you can perform on it is "rollback". This fits neatly with the "abort" and "retry" actions already present and seems like the most intuitive option to me.

For running the inverse steps, constructing an ad-hoc workflow to run in the existing engine is good option. It allows the rollback to be retried in the face of transient errors (e.g. the ims being unavailable temporarily) and takes advantage of the full power of the engine.

Rollback Opt-In?

Some steps implemented in the core make sense to have inverses, which would mean that almost every workflow would be have at least one rollback-capable step. If we allow rollback for any workflow that has rollback-capable steps, then it could be misleading for users of existing workflows since very little meaningful rollback would be performed and worse could cause hidden resource leaks if the created/updated reference to an external resource is rolled-back. Requiring opt-in to rollback functionality avoids this problem, but at the cost of implementation overhead.

James-REANNZ avatar Jun 27 '25 00:06 James-REANNZ

@James-REANNZ nice writeup, thanks :)

I think it makes a lot of sense to define the rollback at step-level. There is a certain similarity to defining downgrade paths for migrations of (SQL) databases.

Making this opt-in is also a good point as this is a fairly advanced way of using WFO. It could become a setting on the workflow's definition. One that, if enabled, would require all steps of workflow to have a rollback step defined (unless explicitly set to a "no-op" rollback).

Mark90 avatar Jul 01 '25 07:07 Mark90