temporal CompleteUpdate message is sometimes not honored when in same WFT completion as ContinueAsNew

Expected Behavior

CompleteUpdate command should be honored when submitted in the same workflow task completion as ContinueAsNew.

Actual Behavior

Occasionally, there is a server error and it is not honored, and the update is applied to the post-ContinueAsNew run.

I have a workflow that CANs in the same WFT as an update is completed. I ensure that the update is sent in the first WFT and, most of the time, I see what I expect:

worker receives WFT: [doUpdate, startWorkflow]
worker sends WFT completion: [accept&completeUpdate, CAN]
update caller gets successful update response
worker receives WFT after CAN: [startWorkflow]
workflow completes

However, sometimes (~1/20) I see

<as above>
<as above>
server error when handling UpdateWorkflowExecution: unable to locate current workflow execution (update caller does not get a response)
WFT after CAN contains the update for a second time, in addition to startWorkflow

I assume what's happening is that the server error causes an internal retry (FE => History) of the UpdateWorkflowExecution, and this retry lands on the 2nd run, after CAN. But I haven't yet dug into why the server error happens

	// for close workflow we need to check if it is still the current run
	// since it's possible that the workflowID has a newer run before it's locked 

wcache.GetCurrentRunID(...) != wfContext.GetWorkflowKey().RunID

Steps to Reproduce the Problem

This was discovered while writing tests for sdk-typescript. In lieu of a simpler repro, it can be repro'd as follows

steps to repro

git clone [email protected]:temporalio/sdk-typescript.git
cd sdk-typescript
git fetch origin  1459-server-error-repro
git checkout 1459-server-error-repro
git submodule update --init --recursive

# Build temporal CLI against server 99f15c4dabf9a22763aaa13653c83a832bd0fffa
# Look at last commit, modify to use path to your temporal CLI executable

rustup update
npm install
npm run build
for i in `seq 1 100`; do (cd packages/test && npm run test -- -m 'unfinished update handler with continue-as-new waiting for all handlers to finish') || break; done

Log in happy case

[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina
tionTypeUpdate","meta":{"updateId":"update-id","identity":"[email protected]"},"runValidator":true}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb
mlzaGVkIg=="}],"randomnessSeed":"3270138216709017763","identity":"[email protected]","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","at
tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604980","nanos":463150000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco
lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8"
:114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1
":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne
w_waiting_for_all_handlers_to_finish","versioningIntent":0}}]
✅ caller got update result
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17192061066528283883","workflowRunTimeout":{},"workflowTaskTimeout":{"seco
nds":"10"},"continuedFromExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3
d23c1fb66","attempt":1,"cronScheduleToScheduleInterval":{"nanos":501851000},"memo":{},"startTime":{"seconds":"1722604980","nanos":938805000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"completeWorkflowExecution":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97,"
4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}]

Log in unhappy case

[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina
tionTypeUpdate","meta":{"updateId":"update-id","identity":"[email protected]"},"runValidator":true}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb
mlzaGVkIg=="}],"randomnessSeed":"3129644504916450612","identity":"[email protected]","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","at
tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604985","nanos":101629000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco
lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8"
:114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1
":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne
w_waiting_for_all_handlers_to_finish","versioningIntent":0}}]
time=2024-08-02T09:23:05.598 level=ERROR msg="service failures" operation=UpdateWorkflowExecution wf-namespace=default error="unable to locate current workflow execution"
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina
tionTypeUpdate","meta":{"updateId":"update-id","identity":"[email protected]"},"runValidator":true}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17700601131409760754","workflowRunTimeout":{},"workflowTaskTimeout":{"seco
nds":"10"},"continuedFromExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e7
2c853f631","attempt":1,"cronScheduleToScheduleInterval":{"nanos":566360000},"memo":{},"startTime":{"seconds":"1722604985","nanos":513977000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"completeWorkflowExecution
":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97,"4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}]

Specifications

Version: Server built from latest main 99f15c4dabf9a22763aaa13653c83a832bd0fffa

Aug 06 '24 16:08 dandavison

It seems that there are a total of 3 UpdateWorkflowRequests made to History service.

The first request is delivered to the worker, but is ultimately failed as it is completing, and is retried by the Frontend service.
The second fails immediately on entry to History service and is retried by the Frontend service.
The final request lands on the post-CAN run and thus is treated as a distinct update (executed a second time by a worker).

UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd"  request:{namespace:"default"  workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"}  first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb"  wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED}  request:{meta:{update_id:"update-id"  identity:"[email protected]"}  input:{header:{}  name:"unfinishedHandlersWorkflowTerminationTypeUpdate"  args:{}}}}
<nil>, Workflow Update was aborted.

UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd"  request:{namespace:"default"  workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"}  first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb"  wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED}  request:{meta:{update_id:"update-id"  identity:"[email protected]"}  input:{header:{}  name:"unfinishedHandlersWorkflowTerminationTypeUpdate"  args:{}}}}
<nil>, unable to locate current workflow execution

UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd"  request:{namespace:"default"  workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"}  first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb"  wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED}  request:{meta:{update_id:"update-id"  identity:"[email protected]"}  input:{header:{}  name:"unfinishedHandlersWorkflowTerminationTypeUpdate"  args:{}}}}
<nil>, workflow execution already completed <-- this happens because of the nature of the Typescript test being used.

Aug 09 '24 03:08 dandavison

Wouldn't the same happen with a signal?

Aug 09 '24 17:08 mfateev

It is kinda similar observable effect but probability to face it in real life are different, and I have hard times to understand which one is more likely.

Signals can be delivered twice when event was successfully written into the history by history service, and then it crashed w/o replying to frontend, or frontend service lost connection to history service, and workflow, while receiving and processing this signal replies with CAN, and then frontend, or API caller retries and 2nd attempt lands on 2nd run. Something really bad should happen on the server to trigger it.

Updates are different in a sense that any error with workflow (in the @dandavison's case it was matching tried to start workflow task 2nd time) leads to lost update, and trigger frontend retry. Retries for updates are part of design: server relies on retries to recreate update after those errors. This sounds like much more likely to happen. With one caveat though: updates are mostly delivered with speculative workflow task, and this task is also lost with any error. It means that server wouldn't process update acceptance for the 1st run. Update can go on normal workflow task only if it piggybacks existing workflow task created by something else (this is exactly what happened in @dandavison`s case where update was delivered on first workflow task). This fact, seems to me, significantly reduce chances of facing this issue in real life.

Aug 12 '24 00:08 alexshtin