code-review icon indicating copy to clipboard operation
code-review copied to clipboard

retry code review bot's own tasks if they hit an exception

Open Archaeopteryx opened this issue 1 year ago • 7 comments

E.g. https://firefox-ci-tc.services.mozilla.com/tasks/CgHOBJ-oSVSHtrTl5-4XEw/runs/0/logs/public/logs/live.log is a task which an exception (e.g. becomes unresponsive) and the machine gets terminated without uploading the logs. There should be at least one attempt to retry the task, e.g. by setting it to auto retry.

Archaeopteryx avatar Nov 29 '24 20:11 Archaeopteryx

and if a task fails, it should return the failure to phab :)

sylvestre avatar Nov 29 '24 20:11 sylvestre

This would be good to have because it increases resiliency of the bot - a low probability issue will turn very unlikely.

Archaeopteryx avatar Sep 11 '25 13:09 Archaeopteryx

On GCP, preemptibles VMs get a signal, and it seems Taskcluster already supports it thanks to Jesse.

I do not think Taskcluster propagate that signal to the tasks themselves, but just set them as Exception, and retries them if retries are left.

This mean we cannot run a cleanup action, but could re-run through retry

La0 avatar Sep 11 '25 13:09 La0

IIRC there's also a way to force reruns in some cases depending on the return code of the cmd of the task. CC @bhearsum

marco-c avatar Sep 11 '25 20:09 marco-c

If you're using run-task, retry-exit-status is available: https://github.com/taskcluster/taskgraph/blob/9b0f5fc2c59994c393bd5e7e87bf4462e9cb5adf/src/taskgraph/transforms/task.py#L548-L549

bhearsum avatar Sep 11 '25 23:09 bhearsum

We are not using run-task, but a hook running a docker image directly

La0 avatar Sep 12 '25 08:09 La0

I'm not aware of any built-in way to do this, in that case.

bhearsum avatar Sep 15 '25 12:09 bhearsum