actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Actions time out after some amount of time

Open Kampe opened this issue 4 years ago • 24 comments

When utilizing the actions runner controller in our clusters within GKE (1.18.20), we've noticed that long running jobs (usually ones over half an hour) will eventually kill themselves while in the process of completing and responds with no exit code. The job seemingly stops reporting to GitHub entirely. Is this known behavior and is there anything we can do to alleviate this?

Kampe avatar Jul 12 '21 23:07 Kampe

@Kampe Perhaps you're affected by this? https://github.com/actions-runner-controller/actions-runner-controller/issues/466

Also, do you use preemptive instances on GKE for your worker nodes that are running your runner pods?

mumoshu avatar Jul 12 '21 23:07 mumoshu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 12 '21 00:08 stale[bot]

I also tend to see this issue and am running on GKE preemptive instances. These failures don't align with preemptive node reclaims though.

FalconerTC avatar Aug 23 '21 18:08 FalconerTC

@FalconerTC Hey! Could you perhaps check if the pods are OOM-killed? I occasionally see the behavior on runner pods getting OOM-killed.

mumoshu avatar Aug 24 '21 00:08 mumoshu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 23 '21 01:09 stale[bot]

Looking into this again with some more details. #466 does seem related. I often (once or twice a day with maybe 100-200 various workflow runs a day) see the following in the UI

The self-hosted runner: ubuntu-20.04-ppmfq-84ds4 lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Looking at the pods, they don't appear to be hitting an OOM. Here are some logs from one pod that was killed

Starting Runner listener with startup type: service
Started listener process
Started running service

√ Connected to GitHub

2021-10-06 15:59:04Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.283.1 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
Runner listener exited with error code 4
Runner listener exit with undefined return code, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
Unrecognized command-line input arguments: 'once'. For usage refer to: .\config.cmd --help or ./config.sh --help

√ Connected to GitHub

2021-10-06 16:00:09Z: Listening for Jobs
2021-10-06 16:00:16Z: Running job: Lint
2021-10-06 16:04:37Z: Job Lint completed with result: Succeeded
Runner listener exited with error code 0
Runner listener exit with 0 return code, stop the service, no retry needed.

In this case, it looks like a job successfully completed in a runner, the runner failed to restart correctly, was assigned a new workflow, and then the pod completed and the workflow run was abandoned. The runner was also marked offline in the Github UI.

I should also note I'm seeing this both on a RunnerGroup on preemptibles and on a RunnerGroup running on regular GKE nodes. I'll add more information as I find it.

FalconerTC avatar Oct 06 '21 16:10 FalconerTC

@FalconerTC Thanks for the update!

Which kind of runners are you using, persistent or ephemeral? If you're using persistent ones, you'd better try ephemeral as you seem to use preemptible instances.

We also recently introduced support for the new GA version of ephemeral runners https://github.com/actions-runner-controller/actions-runner-controller#ephemeral-runners. It addresses a known race issue that a runner tries to grab job even while it's being stopped and that results in an error like that. If you haven't tried it, I'd recommend doing so.

Also, are you using cluster-autoscaler?

mumoshu avatar Oct 06 '21 23:10 mumoshu

I am using ephemeral runners, as well as the GKE cluster-autoscaler. I'm also using RunnerDeployments. I'll try out RUNNER_FEATURE_FLAG_EPHEMERAL and see if it improves things

FalconerTC avatar Oct 07 '21 17:10 FalconerTC

@FalconerTC Thanks for confirming. Yes, try the new ephemeral runners. It should improve things.

Also- when you're using cluster-autoscaler, you should block cluster-autoscaler from trying to terminate a node in which any runner pod is running a job. I think that can be done by adding "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation to every runner pod.

mumoshu avatar Oct 07 '21 23:10 mumoshu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 07 '21 00:11 stale[bot]

@FalconerTC Hey! Did RUNNER_FEATURE_FLAG_EPHEMERAL work for your case?

mumoshu avatar Nov 07 '21 09:11 mumoshu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 07 '21 13:12 stale[bot]

We're still occasionally seeing this issue. I'll post more info on it when I have it but I haven't noticed anything useful about it recently.

FalconerTC avatar Dec 17 '21 17:12 FalconerTC

@FalconerTC Hey! Do you think if this relates to #911? We've seen some long-running jobs get canceled due to GitHub API oddly reporting busy runners as offline. actions-runner-controller garbage-collects any runner pods whose twins in GitHub API are offline for a while, which may terminate jobs running on such runners in the middle of the process.

It's fixed in #1009. If you could build a custom build of actions-runner-controller from the head of our main branch and see if it fixes the issue, it would be great.

mumoshu avatar Dec 18 '21 03:12 mumoshu

It could be yeah. I've also been seeing offline runners building up over time still.

actions-runner-controller garbage-collects any runner pods whose twins in GitHub API are offline for a while

Curious about this process. One of my pools has 10 offline runners now and they don't seem to ever disappear until I manually remove them.

I'll give #1009 a shot

FalconerTC avatar Dec 18 '21 15:12 FalconerTC

I'm seeing this behaviour too, the number of offline runners seems to creep up to about two per day.

I'm trying to get some more data on the phenomenon, I suspected this could be due to a small trigger duration, but this didn't help.

We'll try to run the patch too if we can.

sledigabel avatar Dec 18 '21 16:12 sledigabel

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 18 '22 02:01 stale[bot]

I remember that we recently fixed an issue that ARC thinks offline but busy runners as offline(That's odd, but GitHub Actions API seems to report a runner to be both offline and busy at the same time. And when that happens, the runner is actually busy so ARC shouldn't remove it

If that was the major cause of this issue, perhaps we can close this as fixed?

mumoshu avatar Feb 21 '22 10:02 mumoshu

Also, ARC currently assumes a runner needs to be GCed if it doesn't become online until 10 minutes after runner pod creation

https://github.com/actions-runner-controller/actions-runner-controller/blob/7156ce040ec65dd80863f390f8397dbf01c8466d/controllers/runner_controller.go#L296

Perhaps this can be too short for some environments and we need to provide an option to tweak the timeout duration? If anyone could provide full logs from ARC and runner pods, it would be possible to narrow it down to the actual cause of your issue.

mumoshu avatar Feb 21 '22 10:02 mumoshu

@mumoshu We are still having this issue, it does not have to be long running job just compute intensive, like builds. Looks like ARC is killing the runners when they are in the middle of the job. We are using stateful runners version v2.294.0 and contoller chart is on latest version 0.20.2. Happens two or more times per day and it's not because of the OOM killed, since runners have no limits and utilise only 30% on node's memory. Could it be that runner can't respond to health checks since job is taking too much CPU?

ognjenVlad avatar Aug 26 '22 09:08 ognjenVlad

@ognjenVlad I think it might be due to https://github.com/actions-runner-controller/actions-runner-controller/issues/1581#issuecomment-1228025441, or https://github.com/actions-runner-controller/actions-runner-controller/issues/1535. If it happens only for RunnerSet, perhaps it's happening when you triggered an update to RunnerSet while one or more of runner pods are running jobs. Could you confirm?

mumoshu avatar Aug 26 '22 09:08 mumoshu

@mumoshu Thanks you fast reply.

We are not triggering any updates, it happens just while build jobs are running. And we do not have HPA set up, just stateful set with 3 pods. And safe-to-evict is not set so autoscaler is not rescheduling the pods. Pod is just killed and created immediately.

These are the logs I can see at that time: ERROR actions-runner-controller.runnerpod Failed to update runner for finalizer removal ERROR Reconciler error /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile

And before those errors there is Runner pod has been stopped with a successful status. Runner pod is marked as already unregistered.

ognjenVlad avatar Aug 26 '22 10:08 ognjenVlad

@ognjenVlad Hey! Thanks for sharing. The last two messages indicate that the runner pods did have successfully exited with the status code 0, which should be returned only when actions/runner finished the job correctly. ARC does delete runner pods once it observes that they exited with code 0. So, this is more likely that there might be any bug in actions/runner, if your observation is correct.

mumoshu avatar Aug 26 '22 10:08 mumoshu

@mumoshu Okay, I will try to debug it more and open issue on runner repo. Thanks!

ognjenVlad avatar Aug 26 '22 10:08 ognjenVlad

THIS IS A REALLY TERRIBLE ERROR WE ARE GETTING AND THE WORST PART IS IT IS COMPLETELY RANDOM THERE IS NO GOOD RETRY LOGIC IN THIS TOOL TO ALLOW IT TO JUST TO TRY AND RECONNECT TO THE MACHINE. COMPANIES HAVE PEER RESET CONNECTIONS FOR THESE TYPES OF THINGS AND THIS NEEDS TO BE SOMETHING THAT IS FIXED THE ERROR IS A COMPLETE LIE AND IN CORRECT.

YOUR TOOL IS STRUGGLING TO BE ABLE TO RUN THE DOCKER IN DOCKER CONNECTION OR W/E IS HAPPENING ON THE BACKEND OF THIS TOOLING AS EVERYTHING IS ALWAYS SO SECRET MAKE IT HARD FOR THE OPEN SOURCE WORLD TO HELP. GITLAB HAS THE SAME ISSUE BUT THEY HAVE BUILT IN REDUNACIES TO ALLOW FOR THE USER TO AT LEAST RETRY THE SET UP TO ENSURE IT'S NOT THEIR CODE BUT GITLABS CODE THAT IS THE ISSUE.

HERE IS AN EXAMPLE https://stackoverflow.com/questions/76430791/github-actions-self-hosted-runner-shuts-down-with-a-strange-error

CAN WE NOT HAVE THE GITHUB RUNNER RETRY IT'S OWN CONNECTION LOCATION ARE WE ALWAYS EXPECTING A PERFECT CONNECTION AND NOT THE ABILITY TO RE-CONNECT WHEN A CONNECTION IS DROPPED THE DOCKER IN DOCKER COMMAND WILL STILL BE RUNNING IF THAT IS THE CASE YES THAT WILL BE A LOT OF WORK I WOULD COMPLETELY AGREE HERE, BUT TELLING USERS IT'S A CPU THING OR WE DON'T HAVE ENOUGH RESOURCES IS JUST NOT CORRECT

JeremyDemlow avatar Oct 24 '23 19:10 JeremyDemlow

image

image

HERE IS AN EXAMPLE OF WHAT WE ARE TALKING ABOUT THIS IS THE ONLY OPEN ISSUE I SEE ON THIS BUT THEY ARE ALL RUNNING INTO SIMILAR THINGS AND YOU CAN SEE HERE THAT THIS HAPPENS AT RANDOM THIS IS A BACK TO BACK RUN USING SELF HOSTED RUNNERS SURE THERE IS A POSSIBILITY THAT THE NETWORK ON OUR END RESTS BUT AGAIN THAT LOGIC IS HARD TO KNOW SINCE THERE IS NO RETRY LOGIC FOR A JOB OUTSIDE OF A WORKFLOW THAT IF YOU ARE RUNNING A COMPOSITE WORKFLOW YOU CAN'T USE A DUEL WORKFLOW APPROACH

JeremyDemlow avatar Oct 24 '23 19:10 JeremyDemlow