cadence Proposal: Reset to a reset point with pending ChildWorkflows

Is your feature request related to a problem? Please describe. Right now it's not allowed to reset workflows when the current run has pending childWorkflows or users want to reset to a point that has pending ChildWorkflows. This is painful to use Cadence with ChildWorkflows

There are lots of complains from Cadence users. E.g: https://uber-cadence.slack.com/archives/CL22WDF70/p1610966432021400

Proposed Solution

using StartChildWorkflowExecutionFailedEvent and ChildWorkflowExecutionFailedEvent https://github.com/uber/cadence/blob/3a1e1bcd79bb5bbe76c572e41be73fbdb458fe41/proto/public/uber/cadence/api/v1/history.proto#L77

When resetting a parent workflow to a point that has pending childWFs, disregarding the current or future of the childWF, always fail the childWFs using StartChildWorkflowExecutionFailedEvent/ChildWorkflowExecutionFailedEvent, with additional information that the failure reason = reset
- If the pending workflow hasn't started at the reset point, then use StartChildWorkflowExecutionFailedEvent. The history is like below after a reset:

StartChildWorkflowExecutionInitiated  
...
DecisionTaskStarted
DecisionTaskFailed(cause=reset)
StartChildWorkflowExecutionFailed(cause=reset)
DecisionScheduled
DecisionTaskStarted

If the pending workflow has started, then use ChildWorkflowExecutionFailedEvent . The history is like below after a reset:

ChildWorkflowExecutionStarted
...
DecisionTaskStarted
DecisionTaskFailed(cause=reset)
ChildWorkflowExecutionFailed(cause=reset)
DecisionScheduled
DecisionTaskStarted

Reset will also terminate all pending workflows in the base run and current run(so that starting new ones won't fail).
- As an edge case, we may let terminating in a recursive way if there are multiple levels of childWFs to terminate.
Update client SDK to handle the error of failure reason = reset when starting childWF + getting results from childWF
Or we can ask user update workflow to handle the error of failure reason = reset

basically from

childFuture := Workflow.ExecuteChildWorkflow(...)
runID, err := childFuture.getRunID() // returns when childWF started or started fails
if err != nil{ return err}
results, err := childFuture.getResults() // returns when childWF completed

to (we could to it in the client SDK)

var childFuture 
var results
var err
var runID
for { // use a loop in case of multiple reset
    if err == ERR_FAILURE_BY_RESET{
        // user can decide if reset will start a a new childWF from beginning, or by resetting the previous childWF
        childFuture = Workflow.ExecuteChildWorkflowAfterReset(...) 
    }else{
        childFuture = Workflow.ExecuteChildWorkflow(...)
    }
    
    runID, err = childFuture.getRunID() // returns when childWF started or started fails
    if err == ERR_FAILURE_BY_RESET{
          continue
    }else if err != nil{
          return err
    }
    results, err = childFuture.getResults() // returns when childWF completed
    if err == ERR_FAILURE_BY_RESET{
          continue
    }else if err != nil{
          return err
    }
    // finally get a workflow result and runID
    break
}

Note that in ExecuteChildWorkflowAfterReset, workflow code can decide either start a a new childWF from beginning, or by resetting the previous childWF

Below are old stuff. Just for reference.

It's super hard to come up with a simple solution to meet all user scenarios.

The problem with pending childWorkflows is that after reset, the new parent will probably not get the results of the children of old parent. Because they are two different workflows, probably in two different shards, it's very difficult to make an atomic operation. There is some edge case like at the moment of resetting parent, the children are closing, or already closed(but not reported to old parent because old parent probably closed earlier), or even not really started yet.

Allow passing different options from user when resetting could be the a good idea. Option 1: Bypass checking and let it be. So new parent will not get the results if they are waiting. Option 2: Always starting new children Option 3: Best effort to save children: If not started then continue to start children, if not finished then let children report to new parent when finish. But if children already finished, then let new parent will get the old results(reapply history events). Option 4: Combination of opt2 and opt3 --- on top of opt3, if children already finished, then starting new children. Option 5: From @meiliang86 , instead of per reset request, based on Option 1,2,3,4, we allow workflow to provide different reset options for different childWFID in their code or in request. ~~Option 6: see https://github.com/uber/cadence/issues/3914#issuecomment-814638943~~ : doesn't work for some cases Option 7: use StartChildWorkflowExecutionFailedEvent and ChildWorkflowExecutionFailedEvent see details in https://github.com/uber/cadence/issues/3914#issuecomment-815283618

Challenge in opt2 is that old parent workflow history may already have ChildWorkflowStarted event which records the runID of childWorkflow. It's impossible to overwrite because immutability is the basic invariance of Cadence history protocol. To fix this:

We dont provide runID after childWF started --- but this is a breaking change. A mitigate of that breaking changes could be that we provide runID when ChildWF completed.
Or we just annouce that childFuture.getRunID() is not guaranteed under reset. That will work. But just that feature becomes less useful.
Introduce another event like ChildWorkflowRunUpdated to update the new runID. And client should provide a new runID after receiving this event. This idea may also solve the problem for continueAsNew of childWorkflows.

Overall:

opt1 is small effort to implement
opt3 is a big effort
opt2 is also big effort depends on how we deal with the RunID updates issue, it may require to design protocol.
opt4 will be easy if we have opt2 and opt3.
opt5 will be based on 1,2,3,4
~~opt6 is only slightly larger than opt1~~
opt 7 is slightly larger than opt1

Additional context N/A

Jan 22 '21 05:01 longquanzheng

How exactly do we treat activities that are started-and-running at the point we reset to? They have the same lifecycle: schedule, then start, then complete, they can fail, and you can reset to "between" any of those. I assume they eventually time out, as the response will never be accepted? Or are they eagerly failed?

Either way, it seems like we can treat children-workflows the same way. In-progress ones are abandoned (their behavior follows ParentClosePolicy), and in principle their Future/attempt resolves (but may be forgotten). If they have retries (... do we have child-workflow-retry-policy? I feel like no), a retry is spawned - if not, it isn't.

Children workflows can error or fail to start, it's not like this is a new possibility, though this will likely increase exposure to it. Races between child-complete and parent-abandoning-child could be considered the same as races with activities: it's a race, there is no wrong decision. Like any RPC, you can write to your local DB and "succeed" but fail to tell your caller.

More generally, from thinking about this in the past, I'm kinda convinced that the core problem is that we have child workflows at all.

If these were "regular" workflows, and you could get a Future for other workflows, it would both be far more flexible in general and resolve this kind of ambiguity: if the "parent" is reset, this is just a WorkflowResetError(new_run_id) in their Future, and the reset-workflow treats start-workflow identically to activities. Observing-workflows can do whatever they deem appropriate, quite possibly "query observed-workflow, find out what was forgotten, hook up any missing signal channels again". Starting workflows can be a pseudo-activity, with exactly the same semantics as activities: success/fail/retries, and we already have the duplicate-ID policy to determine what happens when it succeeds more than once.

You can do all of this today with an Activity for your workflow client calls, except for the "get a Future for another workflow" part. Currently that requires polling to detect all completion scenarios, which isn't particularly scalable.

Apr 07 '21 03:04 Groxx

@Groxx Excited that you have those Qs/comments, because you are getting closed to the truth(core of the problem). My proposal is indeed too brief so hopefully replying to your comments will make them more detailed.

How exactly do we treat activities that are started-and-running at the point we reset to?

We'd fail them and reschedule new ones.

They have the same lifecycle: schedule, then start, then complete

True for lifecycle, Activity: scheduled->started->completed vs ChildWF: initiated->started->completed(The happy cases). But they are different in details in the server-client protocol, if you look at history events. The ChildWFStarted event carries the ChildWF runID. In cadence client experience, it's for providing RunID and telling parent workflow that the childWF has started. The code looks something like this:

childFuture, err := Workflow.ExecuteChildWorkflow(...)
runID := childFuture.getRunID() // returns when childWF started
results := childFuture.getResults() // returns when childWF completed

it seems like we can treat children-workflows the same way

That's exactly the Option 2 in my proposal. But the implementation could be more complicated than activity. One of the very basic design of reset is that resetting will result a new Run(and new RunID). (We have to do this way to be able to fork workflow history like Git branching).

So now, if we do reset in the same way as for Activity, the problem is that line of code:

runID := childFuture.getRunID() // returns when childWF started

becomes invalid.

So:

if we can regret and take away that feature, then problem solved. (I kinda leaning to this if we want to take shortcut, because I don't believe many people are using this runID for business). --- but this is a breaking change.
- a mitigate of that breaking changes could be that we provide runID when ChildWF completed. That will be much nicer.
Or we just annouce that childFuture.getRunID() is not guaranteed under reset. That will work. But just that feature becomes less useful.
Therefore my proposal is to make a new event type to update the RunID for parent workflow. This is the only solution if we have to keep runID. and this solution is also generic enough for updating RunID in ContinueAsNew case -- which is also broken today.

Apr 07 '21 05:04 longquanzheng

Retries make "started" ambiguous for activities, as it can happen multiple times, so it's a convenient thing that there is no "activity run ID" / activity-start future exposed.
But say there was: how would a reset affect that? I kinda suspect we wouldn't be agonizing over it for so long, and would just introduce a new api for reset-with-retry scenarios :) maybe as a channel of starts instead of a future, or a "GetRunIds()" that always returns the current knowledge of starts. (You could imagine this paired with a "get result for run id X", which is arguably useful as well)

I'm not quite sure what activities in this scenario would be either, but "all or nothing retries" seems reasonable. If after start, retry for the full retry spec. If before start or after complete, it doesn't matter.

... but child workflows don't have retries, right? So they're just long-running activities like we have now: failing an activity like that seems correct, so why not child workflows? The run id does not need to change, because it can't, and there's no desire to change it: it failed. Start another.

I'm half trying to convince myself that I've been over-thinking this for many months (I explicitly avoided using child workflows in my service, because I couldn't find a way to live without reset, and then spent more time trying to figure out how to monitor workflow failures efficiently, which requires child workflows and their futures/contexts...), and half think it's honestly probably a good idea. We can introduce child-workflow-retry-options to mitigate the pain of failures, same as on activities, and move on. (Maybe we need a duplicate reuse policy here too, though I think there's a chance we have that already)

And in v2, drop the concept in favor of generic external workflows, with simpler semantics. The migration will probably be pretty straightforward for simple uses, and unlocks much more complex uses (M:N observing, for example).

Apr 07 '21 06:04 Groxx

Retries make "started" ambiguous for activities, as it can happen multiple times, so it's a convenient thing that there is no "activity run ID" / activity-start future exposed.

hehe yeah, luckily. Otherwise our reset doesn't work today either.

but child workflows don't have retries, right?

They do have retry. (there was a bug that I fixed a few months ago: https://github.com/uber/cadence/issues/3351 ) so you can put a server side backoff retry policy on childWF.

ChildWF retry will make a new runID too, like ContinueAsNew -- meaning that it's also broken already for getting runID from childWfFuture.

I am not sure if getting results is also broken after retry...I believe so, since getting results from a continuedAsNew childWF has been broken for a long time. @meiliang86 correct me if I am wrong for this one.

Apr 07 '21 06:04 longquanzheng

The below approach only work for pending childWFs that are not closed. * But it doesn't work for case that resetting to pending childWFs that are already closed.

@Groxx After discussing with you I probably found a simple way to do reset with pending childWFs. We probably can zoom and talk about it. I may need @meiliang86 to confirm my idea as it require some details about client behavior.

The proposal is this: on the server, we only provide a slightly modified version of opt1 of my proposal -- which just allow bypassing the checking of pending childWFs, and also terminating all pending childWFs.

Then in the workflow code, ask the users to do this:

before resetting a workflow with childWFs, change the code from

childFuture, err := Workflow.ExecuteChildWorkflow(...)
runID := childFuture.getRunID() // returns when childWF started
results := childFuture.getResults() // returns when childWF completed

to

childFuture, err := Workflow.ExecuteChildWorkflow(...)
runID := childFuture.getRunID()
needStartNewChild := checkIfRunIsTerminatedByParentReset(runID) // yes, this must RPC call to Cadence directly!  Anti-pattern
if needStartNewChild {
      version := Workflow.getVersion("reset requires to start new childWF", 1 )
   
     // start childWF again because it's terminated by reset
     childFuture, err = Workflow.ExecuteChildWorkflow(...)
      runID = childFuture.getRunID()
}
results := childFuture.getResults() // returns when childWF completed
// and now it's safe to use the runID for business purpose

then do the reset
then restart all workflow workers!(important for the RPC call)
after a short while(all reset workflows have passed the new code block), remove the RPC code block in workflow because it could cause NDE issue:

childFuture, err := Workflow.ExecuteChildWorkflow(...)
runID := childFuture.getRunID()
 version := Workflow.getVersion("reset requires to start new childWF", 1 )
if version == 1 {
     // start childWF again because it's terminated by reset
     childFuture, err = Workflow.ExecuteChildWorkflow(...)
      runID = childFuture.getRunID()
}
results := childFuture.getResults() // returns when childWF completed
// and now it's safe to use the runID for business purpose

This will result the same as opt2

Apr 07 '21 06:04 longquanzheng

The run id does not need to change, because it can't, and there's no desire to change it: it failed

@Groxx I know why you thought this way. I had the same until I understand the design of Cadence. Workflow history is the key of the design(that's why we called our kernel service "history service"). It uses "event sourcing" data model: https://microservices.io/patterns/data/event-sourcing.html

Because of this event sourcing, the very basic design principal/invariance of Cadence: history must be immutable once written. (if following this, you will understand why we have to wait activity complete to write the activity started event, for activity with retry -- a lot of people feel this very hard to understand. And also we called the internal WF meta data "mutalbeState" because the history is "immutable").

Therefore, RunID must be changed after reset. Because reset means overriding history by nature. Moreover, we want to keep the history of the previous run, so that we can do reset again -- e.g. for a runIdA of 100 events, the first reset to event 50th, and users realizes this is not good, and they want to reset to event 60th. If we override the event 60th, then we will never be able to reset to 60th.

^ that's why I said our current reset is like "Git" branching. It allows user to fork history.

Apr 07 '21 16:04 longquanzheng

With the new proposal, the checkRunID invokes a remote call so it will be a (system) activity?

If we introduce a version in the start workflow, will it be easier to solve option2?

Apr 07 '21 18:04 yux0

Update: I have added below as the proposal of this issue

@yux0

With the new proposal, the checkRunID invokes a remote call so it will be a (system) activity?

No, it has to be a raw function to invoke RPC, without activity, local activity, or side effect. If we use any of activity, local activity, or side effect, the proposal won't work. Imagine this: when the parent is waiting for a childWF for the result, and then the parent is reset to a place where the childWF is started.

If using activity, local activity, or side effect, the parent before resetting will put a marker there, and we will never be able to remove the code.

However, after reflecting on that proposal, I realized that approach only work for pending childWFs that are open. But it doesn't work for case that resetting to pending childWFs that are already closed. Because there is on way to terminate a childWF that is already closed.

So I come up with another proposal by using StartChildWorkflowExecutionFailedEvent and ChildWorkflowExecutionFailedEvent https://github.com/uber/cadence/blob/3a1e1bcd79bb5bbe76c572e41be73fbdb458fe41/proto/public/uber/cadence/api/v1/history.proto#L77

When resetting a parent workflow to a point that has pending childWFs, disregarding the current or future of the childWF, always fail the childWFs using StartChildWorkflowExecutionFailedEvent/ChildWorkflowExecutionFailedEvent, with additional information that the failure reason = reset
- If the pending workflow hasn't started at the reset point, then use StartChildWorkflowExecutionFailedEvent
- If the pending workflow has started, then use ChildWorkflowExecutionFailedEvent
Reset will also terminate all pending workflows in the base run and current run(so that starting new ones won't fail).
Update client SDK to handle the error of failure reason = reset when starting childWF + getting results from childWF
Or we can ask user update workflow to handle the error of failure reason = reset

basically from

childFuture := Workflow.ExecuteChildWorkflow(...)
runID, err := childFuture.getRunID() // returns when childWF started or started fails
if err != nil{ return err}
results, err := childFuture.getResults() // returns when childWF completed

to

var childFuture 
var results
var err
var runID
for { // use a loop in case of multiple reset
    childFuture = Workflow.ExecuteChildWorkflow(...)
    runID, err = childFuture.getRunID() // returns when childWF started or started fails
    if err = ERR_FAILURE_BY_RESET{
          continue
    }else if err != nil{
          return err
    }
    results, err = childFuture.getResults() // returns when childWF completed
    if err = ERR_FAILURE_BY_RESET{
          continue
    }else if err != nil{
          return err
    }
    // finally get a workflow result and runID
    break
}

Apr 07 '21 21:04 longquanzheng

The run id does not need to change, because it can't, and there's no desire to change it: it failed

@Groxx I know why you thought this way. I had the same until I understand the design of Cadence. Workflow history is the key of the design(that's why we called our kernel service "history service"). It uses "event sourcing" data model: https://microservices.io/patterns/data/event-sourcing.html

Because of this event sourcing, the very basic design principal/invariance of Cadence: history must be immutable once written. (if following this, you will understand why we have to wait activity complete to write the activity started event, for activity with retry -- a lot of people feel this very hard to understand. And also we called the internal WF meta data "mutalbeState" because the history is "immutable").

Sure, for a running workflow that's clearly a reasonable choice. When resetting though we can copy-and-modify, which we already do - the first event now has more (meta)data, pointing to the previous run, and a new run-id. A reset is already not "just" a fork, even if we describe it as such.

On the server side: do we not copy perhaps? If it's literally just a pointer to previous + where to read up to, then yeah - this would be a moderately large change on the server. But it may be one worth making. Or we could event-source the change internally-to-the-server and that can achieve the exact same end result ¯\_(ツ)_/¯. I don't see how the client needs to be aware at all, it just gets a history and plays it.

basically from

// normal non-failure-tolerant "start child workflow" code

to

// normal failure-tolerant "start child workflow" code

Yeah, that's basically what I'm thinking. Any not-yet-completed children just return a new kind of error, workflow code needs to handle it like any other error. The exposed run-id does make configured-retries annoying though, as ^ doing that effectively means you're losing the code-simplifying niceties of configured-retries. We could perhaps make this behavior configurable somewhere, e.g. "I do not need the child run-id" -> you can use the simpler form safely.

The bigger complication would be if e.g. the child is set to abandon or cancel, and it does not terminate in time for the new one to start. Presumably that'd be a failure like it is now if you call 2x with args that do not allow that, but it does expose a surprising new source for them, as it seems impossible from the source code alone.

More generally: seems like child-run-id with retries is already flawed, as it'll give you the first run-id, but not the current one? If that's accurate, we could probably consider "child with retries + exposed runid" as a bug, and hide run-id going forward. The "started" future is still potentially relevant for making sure the request is accepted by the server though.

Apr 09 '21 03:04 Groxx

On the server side: do we not copy perhaps?

Yes, it is seriously hehe. see https://github.com/uber/cadence/blob/bd6c19c6930320f37fe2310121a5dc295c9ec2dc/service/history/reset/resetter.go#L323 and https://github.com/uber/cadence/blob/bd6c19c6930320f37fe2310121a5dc295c9ec2dc/common/persistence/cassandra/cassandraHistoryPersistence.go#L247

Copy is ineffcient for both storage and performance, so we made a big re-architecture on persistence for that(that's why we called it HistoryV2)

"I do not need the child run-id" -> you can use the simpler form safely

Agreed. I updated my proposal for the same idea.

Apr 09 '21 16:04 longquanzheng

Any new progress about this proposal now?

Mar 08 '22 06:03 zedongh