Validate build id and reschedule tasks when redirect rule applies
What changed?
- Send BuildIdRedirectInfo from Matching to History on Record*TaskStarted call containing information about redirect intention.
- History validates the redirect info against current MS and fails the request if the redirect does not have source build if equal to the current assigned build of the workflow.
- If redirect is valid, workflow is assigned to the new build id and all pending but not started task are rescheduled to be sent to the new build id.
Why?
To prevent the following problems from happening for workflows with concurrent tasks when redirect rules are deleted or not fully propagated yet:
- Assign wf back to an old build id after processing task using a newer build id
- Interleaved Starts: for some duration tasks are dispatched to a mix of old or new build IDs
- New activity output being fed to old wf
- Execution gets stuck after being (partially) redirected
How did you test it?
Functional test. More unit tests to be added in a followup PR.
Potential risks
As it is right now, in rare situations when a redirect rule is applied to a WF with concurrent activities and some of them are in backoff period due to failure, we may schedule (and start) them on the newer build without waiting for the backoff to finish. This is planed to be improved in the futuer.
Documentation
None.
Is hotfix candidate?
No.
I didn't quite finish the review and had a hard stop. I would be calmer accepting this if we had ndc/xdc tests that show the build IDs being propagated via replication
It would definitely be good to have a test for this. I don't think the changes can break anything not using versioning though (or at least if it does, it'll be obvious). So okay for now