cdk-github-runners icon indicating copy to clipboard operation
cdk-github-runners copied to clipboard

Better error handling

Open kichik opened this issue 2 years ago • 0 comments

Problematic scenarios

The following scenarios are not properly handled yet.

  1. User cancels workflow. We currently don't recognize the cancellation and if the user cancels the workflow before the runner boots up, the runner will stay there until it times out or another job comes along. If the runner is finally assigned another job, that means it took it away from another runner that was started just for the new job. So this can cause an endless resource waste cycle where a runner is always on, even when a job is not running. For Fargate that has no time limit, this can mean the runner will basically run forever.
  2. Runner failure like configuration issues, missing capacity, or any random AWS failure. If the runner fails to even get the job, the job will just sit there waiting for the next runner to boot. But as we only create one runner per job, that means the old job is stealing the runner from the new job. The old job had to wait for the new job, and the new job will have to wait for the next job to come up. That could delay jobs for no reason. Our current solution for that is cancelling the workflow so the failure is clear and no job stealing occurs. However this can lead back to scenario number 1 if it happens fast enough.

Another complication to all of this is having to remove runners. There is a limited number of self-hosted runners allowed per-repo, and if we don't clean them up, we can get stuck unable to add runners. The runner usually removes itself, but any error conditions like provider timeout (lambda runs out of time), can prevent removal. That's why we always delete the runner using API on error. But to be able to delete the runner, we have to first stop any jobs running on it. That's another reason why we sadly have to cancel the entire workflow.

Wishlist

If we can get GitHub to make some changes, the following would really help simplify the solution for these corner cases.

  1. GitHub runner should support a timeout configuration that allows us to tell it to only wait a certain amount of time before giving up on getting a job.
  2. GitHub runner should allow us to configure a runner for a specific job and only for that job.
  3. GitHub API should allow us to mark just one job as failed instead of the whole workflow.

Other solutions

Assuming we can't get our wishlist items, here some incomplete ideas to help resolve these issues.

  • Have the step function monitor the job using /repos/{owner}/{repo}/actions/runs/{run_id}/jobs or the other one that has attempt information too. If we detect the job was cancelled, we can stop the runner. This means we won't be able to use .sync variants and will have to monitor the runners in the step function. This will also not work for Lambda runners as you can't stop a Lambda execution. But Lambdas are limited to 15 minutes which is not too bad.
  • Monitor the job with the method above and stop the runner if the job hasn't started in 5 minutes. That would mean the runner was stolen or the labels don't match or there is another issue. Will this work with multiple jobs running at the same time "stealing" runners from each-other? We already create the runners in the repo scope to limit cross-job stealing.
  • Figure out the undocumented actions API so we can fail a single job. We have the credentials and connection information in the .runner and .credentials files. Another option is creating a special runner fork that fails a single job.
  • Some kind of external to the step function monitor that makes sure all jobs are behaving right and has the power to start more runners if needed so stolen runners can be "fixed".

kichik avatar Jul 03 '22 19:07 kichik