github-actions-runner-operator icon indicating copy to clipboard operation
github-actions-runner-operator copied to clipboard

Pods and runner API not in sync, returning early

Open mattkim opened this issue 3 years ago • 6 comments

Hello again,

We've also noticed that every now and then we are getting this error from the operator.

Pods and runner API not in sync, returning early

It seems that this happens when there is a runner in the github repo with no corresponding pod.

Not sure how we get into this state, but is it possible to have the operator just automatically remove the unknown runner to keep the "pods and runner api" in sync?

mattkim avatar Dec 10 '21 18:12 mattkim

@mattkin this can happen in the [short] timeframe where pods have been spawned but not yet registered with github, it should fix it self after the pods have started and registered.

davidkarlsen avatar Jan 03 '22 23:01 davidkarlsen

Hi @davidkarlsen thanks for the response

Is it possible to get in a state where the pod has registered the runner name with github, crashes and on retry sees a duplicate runner with the same name?

I think this is one use-case we have observed.

mattkim avatar Jan 14 '22 17:01 mattkim

Any tips for debugging why a runner isn't starting up? I'm not seeing any logs in the operator that could be indicating a problem.

brian-pickens avatar Apr 01 '22 20:04 brian-pickens

Any tips for debugging why a runner isn't starting up? I'm not seeing any logs in the operator that could be indicating a problem.

See operator log. Also try describe on the runner pod and check its log

davidkarlsen avatar Apr 03 '22 23:04 davidkarlsen

I'm faced with the same problem. After many weeks of running nice and smoothly a runner is not removed by the operator in the github runner api and from this point the scaling of runners is not working anymore and I continuously getting this error in the log until I remove this runner by hand. Any advise or sth. that I can do to solve/finding this?

EDIT: Found that a call to kube api server for configmap failed with context deadline exceeded after this the operator gets shutdown signal and restarts a new container I think this causes the behaviour

E0412 10:24:41.602036 1 leaderelection.go:330] error retrieving resource lock github-actions-runner/4ef9cd91.tietoevry.com: Get "https://10.253.0.1:443/api/v1/namespaces/github-actions-runner/configmaps/4ef9cd91.tietoevry.com": context deadline exceeded I0412 10:24:41.602323 1 leaderelection.go:283] failed to renew lease github-actions-runner/4ef9cd91.tietoevry.com: timed out waiting for the condition 2022-04-12T10:24:41.602Z INFO controller.githubactionrunner Shutdown signal received, waiting for all workers to finish {"reconciler group": "garo.tietoevry.com", "reconciler kind": "GithubActionRunner"} 2022-04-12T10:24:41.602Z DEBUG events Normal {"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "github-actions-runner-operator-5c5c5f584-8njpz_974449cb-ae6a-4d8e-9389-40d6264b5c87 stopped leading"} 2022-04-12T10:24:41.602Z INFO controller.githubactionrunner All workers finished {"reconciler group": "garo.tietoevry.com", "reconciler kind": "GithubActionRunner"} 2022-04-12T10:24:41.603Z ERROR setup problem running manager {"error": "leader election lost"}

ckittelmann avatar Apr 13 '22 07:04 ckittelmann

Check the list of runners at github and forcefully delete them if malfunctioning

davidkarlsen avatar Apr 13 '22 11:04 davidkarlsen