github-actions-runner-operator icon indicating copy to clipboard operation
github-actions-runner-operator copied to clipboard

After a couple of hours of not using the runner, it will become offline

Open ni-aackerman opened this issue 1 year ago • 8 comments

So after everything is up an running, did run some jobs, all is good but then, after not using the runner for some hours, it wil become offline, and I will have to delete the runner pool and delete the runner from github so that it gets recreated. Is there some setting like idle timeout or something like that? I do have the minimum runner set to 1 in my CRD

ni-aackerman avatar Apr 17 '23 10:04 ni-aackerman

Hello?

ni-aackerman avatar Jun 17 '23 22:06 ni-aackerman

What does the logs of the pod say? I am no longer working with tietoevry, and no longer have write-access to the repo - but might be able to shed some light on what is wrong.

davidkarlsen avatar Jun 18 '23 20:06 davidkarlsen

Hi @davidkarlsen

we have these in our logs of the pod

# Runner removal

Cannot connect to server, because config files are missing. Skipping removing runner from the server.
Does not exist. Skipping Removing .credentials
Does not exist. Skipping Removing .runner


--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration




A runner exists with the same name
A runner exists with the same name runner-pool-pod-s4jd2.

looks like it can't be removed which cause it crashloop since the same pod is already running, do know where to look why it's failing to do so?

Cannot connect to server, because config files are missing.

ni-skopp avatar Jun 19 '23 10:06 ni-skopp

Then there is already a runner registered with this name in the GH console, force-remove those. It's maybe wise to set the pod-policy to restart: Never - then they won't reappear and end in this state.

davidkarlsen avatar Jun 19 '23 11:06 davidkarlsen

BTW: Love traktor ;-)

davidkarlsen avatar Jun 19 '23 11:06 davidkarlsen

I think this is the problem that we need to always force-remove it and sometimes recreate the runner-pool, I've added restartPolicy: Never , lets see if it's help

apiVersion: garo.tietoevry.com/v1alpha1
kind: GithubActionRunner
metadata:
  name: runner-pool
  namespace: github-actions-runner-operator
spec:
  minRunners: 1
  maxRunners: 9
  reconciliationPeriod: 1m
  podTemplateSpec:
    metadata:
      annotations:
        "prometheus.io/scrape": "true"
        "prometheus.io/port": "3903"
    spec:
      restartPolicy: Never
      affinity:

ni-skopp avatar Jun 19 '23 13:06 ni-skopp

@davidkarlsen just wanted to confirm that we don't see any issues since one week, looks like it helped, thank you!

ni-skopp avatar Jun 26 '23 15:06 ni-skopp

@davidkarlsen it happened again, even with the restartPolicy: Never.

In the meantime, until a fix is found I unded up setting a cron that runs every day inside kubernetes to delete the pod. This way a new one is spawned and is automatically registered in github. I know its not a real solution, it just helps us to focus on different things that are more urgent. Here the code for such cronjob:

kind: CronJob
metadata:
  name: delete-runner-pod
spec:
  schedule: "00 22 * * *"  # Schedule to run daily at midnight Berlin time
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kubectl
            image: bitnami/kubectl  # Using a kubectl container image
            command:
            - /bin/sh
            - -c
            - kubectl get pods --no-headers=true | awk '/^runner-pool-pod-/ {print $1}' | xargs -I {} kubectl delete pod {} --grace-period=0 --force
          restartPolicy: OnFailure
      ttlSecondsAfterFinished: 172800 # delete job after 48 hours

ni-aackerman avatar Jul 24 '23 10:07 ni-aackerman