capact icon indicating copy to clipboard operation
capact copied to clipboard

Add capact action retry

Open lukaszo opened this issue 3 years ago • 1 comments

Description

When action fails, there is no way to fix it.

It's already possible to run argo retry and workflow will continue. capact action watch will show the progress. If, after retry, the workflow finishes successfully, the status of action is still failed.

Reason

  • When an Action fails, there is no option to retry it. Currently, we need to delete and create it once again
  • Delete is fully manual and time-consuming.
  • User waste time and resources to execute once again the whole Action even if almost everything went well e.g. RDS was provisioned successfully, but last step failed on network timeout when uploading data to OCH.

(edited by @mszostok)

Options

A. dummy CLI command which does argo retry underneath and spawn job with Argo runner watching for the same workflow [3MD] B. Use rollback (#502) feature to rollback to previous state and run Action again [3MD]

lukaszo avatar Aug 10 '21 22:08 lukaszo

I was thinking about adding that, but this is a bit more complicated than calling argo retry from our CLI and once again watching Action CR.

Unfortunately, when we retry some steps the generated values are not preserved and for example terraform runner will create a new instance on GCP side instead of ensuring that the ones that were created are up and running. On the other hand, if we will preserve values then in case of helm runner, such helm install will simply fail.

As a result, in some situation, such retry can mess up more than can help, so we need to remember to implement it correctly.

Maybe we should revisit this task after adding rollback, so instead of retry, we will execute destroy (if Action has any, if not then is no-op) and then execute it once again. For example, OSB API does that for orphan mitigation.

/cc @platten

mszostok avatar Aug 11 '21 07:08 mszostok