sdk-go
sdk-go copied to clipboard
Canceling workflow can cause infinite replay attempts
Expected Behavior
Canceling a workflow should cancel the workflow without side effects or replays.
Actual Behavior
Canceling workflows works the majority of the time, but for some cases it returns this error:
Task processing failed with error Namespace project TaskQueue taskqueue WorkerID [email protected]@ WorkerType WorkflowWorker Error BadRequestCancelActivityAttributes: invalid history builder state for action: add-activitytask-cancel-requested-event
This causes Temporal to keep attempting replays with Attempt 1 until the workflow is terminated manually. The workflow is marked as running when it should be canceled. In the logs, the workflow is canceled at the same activity step (WaitQueryCompleteActivity), but one succeeds and other fails with the error and keeps replaying.
Steps to Reproduce the Problem
ExecuteQueryWorkflow (workflow.ActivityOptions{WaitForCancellation: true})
- workflow.ExecuteLocalActivity(CreateQueryActivity)
- workflow.ExecuteLocalActivity(RunQueryActivity)
- f(ctx, data) -- workflow.ExecuteActivity(ctx, WaitQueryCompleteActivity, engine).Get(ctx, nil)....
- client.CancelWorkflow(context.Background(), getQueryWorkflowID(queryId), "")
possibly related?: https://github.com/temporalio/sdk-go/issues/469
Specifications
- Version: -- go.temporal.io/api v1.4.1-0.20210420220407-6f00f7f98373 -- go.temporal.io/sdk v1.7.0
- Platform: -- Mac
Logs: query 118 success- 8546f839-3403-4841-af11-a25ba8ca91eb.json.txt query 119 succes - 5c3e2e65-44b7-4899-ab0d-fbf29ab21106.json.txt
@hungcs If you happen to have some minimal sample code that reproduces the issue that would be much appreciated
This is a result of the state machine in sdk-go completing an activity and concurrently a cancel request also issuing a cancel for the same activity.
In temporal-server, HasActivityFinishEvent is supposed to check for something like this, but when an activity completes, it calls DeleteActivity which clears all the related fields. It's not clear me to how HasActivityFinishEvent is meant to check for completed activities.
I will put together a sample when I get a chance, but I also need to confirm it's not an issue with the database I'm using.
I have created a sample that shows this issue.
Here is the worker: https://gist.github.com/azuisleet/e136167e54b92b33ea45f5d3f21ac7ea
And a command to dispatch the workflow, and then cancel it at the same time as the activity: https://gist.github.com/azuisleet/6bb9dc621e10db21dfeee12c5b46e22c
This works consistently regardless of the storage engine. The workflow becomes stuck trying to cancel a completed activity, with the server reporting the error message described above:
temporal | {"level":"warn","ts":"2021-08-13T13:17:20.025Z","msg":"invalid history builder state for action","service":"history","shard-id":1,"address":"172.24.0.3:7234","shard-item":"0xc000662280","component":"history-cache","wf-action":"add-activitytask-cancel-requested-event","wf-history-event-id":12,"error-type":"InvalidHistoryAction","bool":false,"wf-schedule-id":5,"wf-id":"b639b495-490f-4958-8ae9-013374c3b6fc","wf-run-id":"e3745b20-6361-4a79-a215-805bfe82b733","wf-namespace-id":"34e86ca4-28e1-445d-8a7e-1f9a6d863d9c","logging-call-at":"mutable_state_impl.go:4400"}
temporal | {"level":"info","ts":"2021-08-13T13:17:20.025Z","msg":"Failing the workflow task.","service":"history","shard-id":1,"address":"172.24.0.3:7234","shard-item":"0xc000662280","component":"history-engine","value":"BadRequestCancelActivityAttributes: invalid history builder state for action: add-activitytask-cancel-requested-event","wf-id":"b639b495-490f-4958-8ae9-013374c3b6fc","wf-run-id":"e3745b20-6361-4a79-a215-805bfe82b733","wf-namespace-id":"34e86ca4-28e1-445d-8a7e-1f9a6d863d9c","logging-call-at":"workflowTaskHandlerCallbacks.go:469"}
Should activities be allowed to complete if a WorkflowExecutionCancelRequested has already been persisted in the workflow history? (mutableState.IsCancelRequested() being true). It seems like this isn't allowed given that the mutable state removes all traces of the activity upon completion.
I'm also seeing this issue every time I run the cancellation go sample at https://github.com/temporalio/samples-go/tree/main/cancellation
@mnussbaum what version of ~~Go~~ Temporal are you using? This was supposed to be fixed in 1.9, wondering if you're still encountering this
Go version:
$ go version
go version go1.17.3 linux/amd64
When I run the cancellation example's cancellation/cancel/main.go the worker from that example starts printing these lines indefinitely:
2021/11/09 15:58:54 INFO heartbeating... Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ ActivityID 5 ActivityType ActivityToBeCanceled Attempt 1 WorkflowType YourWo
rkflow WorkflowID workflowID-to-cancel RunID dd0b6b04-d04b-4859-9e66-5ceac0780548
2021/11/09 15:58:54 INFO context is cancelled Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ ActivityID 5 ActivityType ActivityToBeCanceled Attempt 1 WorkflowType Y
ourWorkflow WorkflowID workflowID-to-cancel RunID dd0b6b04-d04b-4859-9e66-5ceac0780548
2021/11/09 15:58:54 INFO ActivityToBeCanceled returns I am canceled by Done, <nil> Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ WorkflowType YourWorkflow Workflow
ID workflowID-to-cancel RunID dd0b6b04-d04b-4859-9e66-5ceac0780548 Attempt 1
2021/11/09 15:58:54 DEBUG ExecuteActivity Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ WorkflowType YourWorkflow WorkflowID workflowID-to-cancel RunID dd0b6b04-d04
b-4859-9e66-5ceac0780548 Attempt 1 ActivityID 15 ActivityType ActivityToBeSkipped
2021/11/09 15:58:54 DEBUG RequestCancelActivity Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ WorkflowType YourWorkflow WorkflowID workflowID-to-cancel RunID dd0b6b
04-d04b-4859-9e66-5ceac0780548 Attempt 1 ActivityID 15
2021/11/09 15:58:54 INFO Task processing failed with error Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ WorkerType WorkflowWorker Error BadRequestCancelActivityAt
tributes: invalid history builder state for action: add-activitytask-cancel-requested-event
2021/11/09 15:59:04 DEBUG Cached state staled, new task has unexpected events Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ WorkflowID workflowID-to-cancel RunID dd
0b6b04-d04b-4859-9e66-5ceac0780548 Attempt 2 CachedPreviousStartedEventID 14 TaskFirstEventID 1 TaskStartedEventID 20 PreviousStartedEventID 8
I believe this is the same issue based on the same invalid history builder state for action message in the original post, but my apologies if this is a different issue
sorry, I meant the temporal version
@mnussbaum - I may have been able to replicate and this may be showing us an issue we've been seeing. I will confirm and get back.
I'm running temporal locally via the latest version of https://github.com/temporalio/docker-compose, which I believe is running on 1.13.0
(Edited, my mistake) this issue occurred for me pre-1.10.0.
Cancelling a workflow concurrently with an activity completion creates an invalid state: The go state machine is trying cancel everything, even though it should actually replay the last activity completion event, due to the ordering of events in the workflow history, the cancel event appears before the completion event, hence the suggestion in https://github.com/temporalio/sdk-go/issues/481#issuecomment-898723337 that this may be a temporal service issue.
If the Go state machine were allowed to potentially replay "alternate branches" beyond the cancel event, it would also be able to un-stuck itself.
I have opened #623 to address this.
My mistake, I didn't check the version of the SDK when I was testing. I think the issue was actually mitigated with #504
When I upgrade to 1.11.0 and temporal 1.13.1, I cannot reproduce the issue, as the cancel always appears first in the event history:

edit: Updated my comment. I also checked v1.8.0, v1.9.0, and v1.10.0. The issue stops occurring in v1.10.0
The temporal samples-go repo looks like it's on sdk version 1.11.0 from https://github.com/temporalio/samples-go/blob/main/go.mod#L16. Do you think the error I'm seeing from the sample's cancellation example is a different issue?
2021/11/09 15:58:54 INFO Task processing failed with error Namespace default TaskQueue cancel-activity WorkerID 483128@tecopa@ WorkerType WorkflowWorker Error BadRequestCancelActivityAt
tributes: invalid history builder state for action: add-activitytask-cancel-requested-event
The issue created as part of this was solved.