CompleteUpdate message is sometimes not honored when in same WFT completion as ContinueAsNew
Expected Behavior
CompleteUpdate command should be honored when submitted in the same workflow task completion as ContinueAsNew.
Actual Behavior
Occasionally, there is a server error and it is not honored, and the update is applied to the post-ContinueAsNew run.
I have a workflow that CANs in the same WFT as an update is completed. I ensure that the update is sent in the first WFT and, most of the time, I see what I expect:
- worker receives WFT:
[doUpdate, startWorkflow] - worker sends WFT completion:
[accept&completeUpdate, CAN] - update caller gets successful update response
- worker receives WFT after CAN:
[startWorkflow] - workflow completes
However, sometimes (~1/20) I see
- <as above>
- <as above>
- server error when handling UpdateWorkflowExecution:
unable to locate current workflow execution(update caller does not get a response) - WFT after CAN contains the update for a second time, in addition to startWorkflow
I assume what's happening is that the server error causes an internal retry (FE => History) of the UpdateWorkflowExecution, and this retry lands on the 2nd run, after CAN. But I haven't yet dug into why the server error happens
// for close workflow we need to check if it is still the current run
// since it's possible that the workflowID has a newer run before it's locked
wcache.GetCurrentRunID(...) != wfContext.GetWorkflowKey().RunID
Steps to Reproduce the Problem
This was discovered while writing tests for sdk-typescript. In lieu of a simpler repro, it can be repro'd as follows
steps to repro
git clone [email protected]:temporalio/sdk-typescript.git
cd sdk-typescript
git fetch origin 1459-server-error-repro
git checkout 1459-server-error-repro
git submodule update --init --recursive
# Build temporal CLI against server 99f15c4dabf9a22763aaa13653c83a832bd0fffa
# Look at last commit, modify to use path to your temporal CLI executable
rustup update
npm install
npm run build
for i in `seq 1 100`; do (cd packages/test && npm run test -- -m 'unfinished update handler with continue-as-new waiting for all handlers to finish') || break; done
Log in happy case
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina
tionTypeUpdate","meta":{"updateId":"update-id","identity":"[email protected]"},"runValidator":true}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb
mlzaGVkIg=="}],"randomnessSeed":"3270138216709017763","identity":"[email protected]","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","at
tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604980","nanos":463150000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco
lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8"
:114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1
":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne
w_waiting_for_all_handlers_to_finish","versioningIntent":0}}]
✅ caller got update result
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17192061066528283883","workflowRunTimeout":{},"workflowTaskTimeout":{"seco
nds":"10"},"continuedFromExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3
d23c1fb66","attempt":1,"cronScheduleToScheduleInterval":{"nanos":501851000},"memo":{},"startTime":{"seconds":"1722604980","nanos":938805000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"completeWorkflowExecution":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97,"
4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}]
Log in unhappy case
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina
tionTypeUpdate","meta":{"updateId":"update-id","identity":"[email protected]"},"runValidator":true}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb
mlzaGVkIg=="}],"randomnessSeed":"3129644504916450612","identity":"[email protected]","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","at
tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604985","nanos":101629000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco
lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8"
:114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1
":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne
w_waiting_for_all_handlers_to_finish","versioningIntent":0}}]
time=2024-08-02T09:23:05.598 level=ERROR msg="service failures" operation=UpdateWorkflowExecution wf-namespace=default error="unable to locate current workflow execution"
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina
tionTypeUpdate","meta":{"updateId":"update-id","identity":"[email protected]"},"runValidator":true}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId
":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17700601131409760754","workflowRunTimeout":{},"workflowTaskTimeout":{"seco
nds":"10"},"continuedFromExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e7
2c853f631","attempt":1,"cronScheduleToScheduleInterval":{"nanos":566360000},"memo":{},"startTime":{"seconds":"1722604985","nanos":513977000}}}
[runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"completeWorkflowExecution
":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97,"4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}]
Specifications
- Version: Server built from latest
main99f15c4dabf9a22763aaa13653c83a832bd0fffa
It seems that there are a total of 3 UpdateWorkflowRequests made to History service.
- The first request is delivered to the worker, but is ultimately failed as it is completing, and is retried by the Frontend service.
- The second fails immediately on entry to History service and is retried by the Frontend service.
- The final request lands on the post-CAN run and thus is treated as a distinct update (executed a second time by a worker).
UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd" request:{namespace:"default" workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"} first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb" wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED} request:{meta:{update_id:"update-id" identity:"[email protected]"} input:{header:{} name:"unfinishedHandlersWorkflowTerminationTypeUpdate" args:{}}}}
<nil>, Workflow Update was aborted.
UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd" request:{namespace:"default" workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"} first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb" wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED} request:{meta:{update_id:"update-id" identity:"[email protected]"} input:{header:{} name:"unfinishedHandlersWorkflowTerminationTypeUpdate" args:{}}}}
<nil>, unable to locate current workflow execution
UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd" request:{namespace:"default" workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"} first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb" wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED} request:{meta:{update_id:"update-id" identity:"[email protected]"} input:{header:{} name:"unfinishedHandlersWorkflowTerminationTypeUpdate" args:{}}}}
<nil>, workflow execution already completed <-- this happens because of the nature of the Typescript test being used.
Wouldn't the same happen with a signal?
It is kinda similar observable effect but probability to face it in real life are different, and I have hard times to understand which one is more likely.
Signals can be delivered twice when event was successfully written into the history by history service, and then it crashed w/o replying to frontend, or frontend service lost connection to history service, and workflow, while receiving and processing this signal replies with CAN, and then frontend, or API caller retries and 2nd attempt lands on 2nd run. Something really bad should happen on the server to trigger it.
Updates are different in a sense that any error with workflow (in the @dandavison's case it was matching tried to start workflow task 2nd time) leads to lost update, and trigger frontend retry. Retries for updates are part of design: server relies on retries to recreate update after those errors. This sounds like much more likely to happen. With one caveat though: updates are mostly delivered with speculative workflow task, and this task is also lost with any error. It means that server wouldn't process update acceptance for the 1st run. Update can go on normal workflow task only if it piggybacks existing workflow task created by something else (this is exactly what happened in @dandavison`s case where update was delivered on first workflow task). This fact, seems to me, significantly reduce chances of facing this issue in real life.