temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Validate build id and reschedule tasks when redirect rule applies

Open ShahabT opened this issue 1 year ago • 1 comments

What changed?

  • Send BuildIdRedirectInfo from Matching to History on Record*TaskStarted call containing information about redirect intention.
  • History validates the redirect info against current MS and fails the request if the redirect does not have source build if equal to the current assigned build of the workflow.
  • If redirect is valid, workflow is assigned to the new build id and all pending but not started task are rescheduled to be sent to the new build id.

Why?

To prevent the following problems from happening for workflows with concurrent tasks when redirect rules are deleted or not fully propagated yet:

  1. Assign wf back to an old build id after processing task using a newer build id
  2. Interleaved Starts: for some duration tasks are dispatched to a mix of old or new build IDs
  3. New activity output being fed to old wf
  4. Execution gets stuck after being (partially) redirected

How did you test it?

Functional test. More unit tests to be added in a followup PR.

Potential risks

As it is right now, in rare situations when a redirect rule is applied to a WF with concurrent activities and some of them are in backoff period due to failure, we may schedule (and start) them on the newer build without waiting for the backoff to finish. This is planed to be improved in the futuer.

Documentation

None.

Is hotfix candidate?

No.

ShahabT avatar Apr 08 '24 18:04 ShahabT

I didn't quite finish the review and had a hard stop. I would be calmer accepting this if we had ndc/xdc tests that show the build IDs being propagated via replication

It would definitely be good to have a test for this. I don't think the changes can break anything not using versioning though (or at least if it does, it'll be obvious). So okay for now

dnr avatar May 04 '24 00:05 dnr