temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Force complete activity when it is retrying

Open samarabbas opened this issue 4 years ago • 11 comments

Is your feature request related to a problem? Please describe. A bug in activity could result in incorrect return type causing another activity to fail continuously. Provide a mechanism to force complete an activity in retry without restarting the workflow.

Describe the solution you'd like RespondActivityTaskCompletedById api does not support retry attempt as input argument. Also need a way to allow completion when activity is backing off and not started at all.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

samarabbas avatar Nov 13 '20 17:11 samarabbas

+1

zengzilu avatar Aug 31 '22 03:08 zengzilu

+1

jdupl123 avatar Nov 01 '22 01:11 jdupl123

+1 For me, it would be a convenient way to “repair” a single blocked WF if an activity is periodically failing. In a way, activity complete could allow to “repair forward” (skip over a failing activity), like reset already allows to “repair backwards” (go back to an earlier state of the workflow execution).

Of course it is notbest practice to (manually) repair workflows all the time, but in some edge cases and incidents it would be great to have the possibility. Redeploying an updated activity worker might not always be a solution.

@mfateev was also supporting this idea. image

Eventually, this could even be integrated into the WebUI, so that e.g. also a support engineer could repair a workflow.

p4p4 avatar Feb 20 '23 16:02 p4p4

similarly, a tctl workflow complete command might also be handy to complete whole (child) workflows, what do you think?

p4p4 avatar Feb 21 '23 09:02 p4p4

Hi, I am interested in working on this issue.

Based on my understanding, if a workflow has multiple activities, the failure from one of them will result in the following activities' failure. Currently, the only way to fix this is to start the workflow from the beginning. We want to introduce a mechanism to retry the first failed activity when we find it is incorrect to prevent the cascading.

Is my understanding above correct? If so, how this is different from the retry options when we start a workflow with an activity here? We can handle the incorrect output from an activity with the retry options above.

Also, a way to reproduce this would be very helpful. Thank you!

alexseedkou avatar Mar 21 '24 22:03 alexseedkou

This is fairly easy to add.

First we need to understand how to expose this in the API. I would add a bool skip_started_check or bool force field in the RespondActivityTaskCompletedByIdRequest message and the other RPCs that resolve an activity (RespondActivityTaskFailedByIdRequest, RespondActivityTaskCanceledByIdRequest).

Then we need to relax this condition if the flag is set (in all corresponding APIs).

bergundy avatar Mar 27 '24 17:03 bergundy

If you want to take this on, you should start by making a PR to the https://github.com/temporalio/api repo and if this is accepted, you can continue to implement in the server (this) repo.

bergundy avatar Mar 27 '24 17:03 bergundy

Hi @bergundy, thank you for your reply.

May I know if my understanding above is correct regarding this issue, and how this is different from the retry options when we start a workflow with an activity here?

Thank you in advance for your guidance.

alexseedkou avatar Mar 27 '24 21:03 alexseedkou

This issue is for allowing completing and failing activities that are currently backing off. Seems like that's not what you want @alexseedkou based on your comment here:

Based on my understanding, if a workflow has multiple activities, the failure from one of them will result in the following activities' failure. Currently, the only way to fix this is to start the workflow from the beginning. We want to introduce a mechanism to retry the first failed activity when we find it is incorrect to prevent the cascading.

IIUC, you could reset the workflow to just before the activity was scheduled. Does that address your need? Feel free to tag me on the Temporal community Slack to continue the discussion.

bergundy avatar Mar 28 '24 23:03 bergundy

An update on this issue:

Team has discussed this issue internally and decided to change the server behavior to accept activity completions even if the activity is currently backing off by default/

alexseedkou avatar Apr 05 '24 21:04 alexseedkou

We'll need to update API and SDKs documentation to reflect the fact that RespondActivityTaskCompleted and RespondActivityTaskCompletedById have different behaviors.

mjameswh avatar May 08 '24 14:05 mjameswh