actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Runner not cleaned up after completion

Open ajardan opened this issue 2 years ago • 1 comments

Checks

  • [X] I've already read https://github.com/actions-runner-controller/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.

Controller Version

v0.26.0

Helm Chart Version

0.21.0

CertManager Version

v1.9.1

Deployment Method

Helm

cert-manager installation

resource "helm_release" "cert-manager" { name = "cert-manager" repository = "https://charts.jetstack.io" chart = "cert-manager" version = "1.9.1" create_namespace = true namespace = "cert-manager"

set { name = "installCRDs" value = "true" } }

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: Runner
metadata:
  creationTimestamp: "2022-09-22T13:59:54Z"
  finalizers:
  - runner.actions.summerwind.dev
  generation: 1
  name: eks-runner-terraform-vsftp-security
  namespace: actions-runner-system
  resourceVersion: "1143184"
  uid: aa419171-bef9-4925-9c1c-e43f4d3296a3
spec:
  dockerdContainerResources: {}
  image: ""
  labels:
  - eks_runner
  - livetest
  - live
  repository: org/repo
  resources: {}
  serviceAccountName: eks-runner-livetest-live
status:
  phase: Running
  ready: false
  registration:
    expiresAt: "2022-09-22T14:59:54Z"
    labels:
    - eks_runner
    - livetest
    - live
    repository: org/repo
    token: XXXX

To Reproduce

1. Create the runner
2. Schedule a job
3. Observe finished runner but stuck POD

Describe the bug

A runner pod is stuck after the job is done and runner container exits, but the pod continues to run in NotReady state

Describe the expected behavior

The runner pod is terminated, and a new one starts

Controller Logs

2022-09-22T14:01:24Z	DEBUG	actions-runner-controller.runner	Runner appears to have been registered and running.	{"runner": "actions-runner-system/eks-runner-terraform-vsftp-security", "podCreationTimestamp": "2022-09-22 13:59:54 +0000 UTC"}

Runner Pod Logs

2022-09-22 13:59:56.763  DEBUG --- Github endpoint URL https://github.com/
2022-09-22 13:59:57.212  DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2022-09-22 13:59:57.215  DEBUG --- Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration




√ Runner successfully added
√ Runner connection is good

# Runner settings


√ Settings Saved.

2022-09-22 14:00:01.667  DEBUG --- Runner successfully configured.
{
  "agentId": 142,
  "agentName": "eks-runner-terraform-vsftp-security",
  "poolId": 1,
  "poolName": "Default",
  "ephemeral": true,
  "serverUrl": "https://pipelines.actions.githubusercontent.com/somelongid",
  "gitHubUrl": "https://github.com/org/repo",
  "workFolder": "/runner/_work"
2022-09-22 14:00:01.677  DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2022-09-22 14:00:01.678  DEBUG --- Waiting until Docker is available or the timeout is reached
}CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

√ Connected to GitHub

Current runner version: '2.296.2'
2022-09-22 14:00:03Z: Listening for Jobs
2022-09-22 14:00:35Z: Running job: Check PR
2022-09-22 14:01:22Z: Job Check PR completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner...


Generating RSA private key, 4096 bit long modulus (2 primes)
........++++
........................++++
e is 65537 (0x010001)
Generating RSA private key, 4096 bit long modulus (2 primes)
...................................................................++++
.............................++++
e is 65537 (0x010001)
Signature ok
subject=CN = docker:dind server
Getting CA Private Key
/certs/server/cert.pem: OK
Generating RSA private key, 4096 bit long modulus (2 primes)
..............................................++++
.....................................................................................++++
e is 65537 (0x010001)
Signature ok
subject=CN = docker:dind client
Getting CA Private Key
/certs/client/cert.pem: OK
time="2022-09-22T14:00:00.250350969Z" level=info msg="Starting up"
time="2022-09-22T14:00:00.252608955Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2022-09-22T14:00:00.253606299Z" level=info msg="libcontainerd: started new containerd process" pid=63
time="2022-09-22T14:00:00.253642630Z" level=info msg="parsed scheme: \"unix\"" module=grpc
time="2022-09-22T14:00:00.253659650Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
time="2022-09-22T14:00:00.253701962Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
time="2022-09-22T14:00:00.253716374Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
time="2022-09-22T14:00:00Z" level=warning msg="containerd config version `1` has been deprecated and will be removed in containerd v2.0, please switch to version `2`, see https://github.com/containerd/containerd/blob/main/docs/PLUGINS.md#version-header"
time="2022-09-22T14:00:00.274983011Z" level=info msg="starting containerd" revision=9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 version=v1.6.8
time="2022-09-22T14:00:00.308012401Z" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
time="2022-09-22T14:00:00.308683945Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.324630578Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"ip: can't find device 'aufs'\\nmodprobe: can't change directory to '/lib/modules': No such file or directory\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.324676835Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.324969917Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.btrfs\"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs (xfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.325002973Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.325024617Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2022-09-22T14:00:00.325075702Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.325160792Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.325436978Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.325660384Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2022-09-22T14:00:00.325691263Z" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
time="2022-09-22T14:00:00.325744240Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2022-09-22T14:00:00.325763257Z" level=info msg="metadata content store policy set" policy=shared
time="2022-09-22T14:00:00.331575052Z" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
time="2022-09-22T14:00:00.331608182Z" level=info msg="loading plugin \"io.containerd.event.v1.exchange\"..." type=io.containerd.event.v1
time="2022-09-22T14:00:00.331623475Z" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
time="2022-09-22T14:00:00.331675318Z" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.331706018Z" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.331747642Z" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.331769135Z" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.332070527Z" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.332090515Z" level=info msg="loading plugin \"io.containerd.service.v1.leases-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.332110778Z" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.332132322Z" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.332151702Z" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
time="2022-09-22T14:00:00.332293526Z" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
time="2022-09-22T14:00:00.332425401Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
time="2022-09-22T14:00:00.332781793Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1
time="2022-09-22T14:00:00.332817372Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.332835663Z" level=info msg="loading plugin \"io.containerd.internal.v1.restart\"..." type=io.containerd.internal.v1
time="2022-09-22T14:00:00.332879154Z" level=info msg="loading plugin \"io.containerd.grpc.v1.containers\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.332901754Z" level=info msg="loading plugin \"io.containerd.grpc.v1.content\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.332934445Z" level=info msg="loading plugin \"io.containerd.grpc.v1.diff\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.332949122Z" level=info msg="loading plugin \"io.containerd.grpc.v1.events\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.332964617Z" level=info msg="loading plugin \"io.containerd.grpc.v1.healthcheck\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.332985504Z" level=info msg="loading plugin \"io.containerd.grpc.v1.images\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.333000145Z" level=info msg="loading plugin \"io.containerd.grpc.v1.leases\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.333020447Z" level=info msg="loading plugin \"io.containerd.grpc.v1.namespaces\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.333042101Z" level=info msg="loading plugin \"io.containerd.internal.v1.opt\"..." type=io.containerd.internal.v1
time="2022-09-22T14:00:00.333211925Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.333238060Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.333255541Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
time="2022-09-22T14:00:00.333275067Z" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
time="2022-09-22T14:00:00.333299976Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
time="2022-09-22T14:00:00.333322524Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
time="2022-09-22T14:00:00.333374814Z" level=error msg="failed to initialize a tracing processor \"otlp\"" error="no OpenTelemetry endpoint: skip plugin"
time="2022-09-22T14:00:00.337277294Z" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
time="2022-09-22T14:00:00.337384211Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
time="2022-09-22T14:00:00.337461354Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
time="2022-09-22T14:00:00.337500359Z" level=info msg="containerd successfully booted in 0.063798s"
time="2022-09-22T14:00:00.343443667Z" level=info msg="parsed scheme: \"unix\"" module=grpc
time="2022-09-22T14:00:00.343469171Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
time="2022-09-22T14:00:00.343490468Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
time="2022-09-22T14:00:00.343536439Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
time="2022-09-22T14:00:00.344576449Z" level=info msg="parsed scheme: \"unix\"" module=grpc
time="2022-09-22T14:00:00.344595783Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
time="2022-09-22T14:00:00.344671525Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///var/run/docker/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}" module=grpc
time="2022-09-22T14:00:00.344685028Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
time="2022-09-22T14:00:00.364893834Z" level=warning msg="Your kernel does not support cgroup blkio weight"
time="2022-09-22T14:00:00.364912852Z" level=warning msg="Your kernel does not support cgroup blkio weight_device"
time="2022-09-22T14:00:00.365109632Z" level=info msg="Loading containers: start."
time="2022-09-22T14:00:00.463692617Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2022-09-22T14:00:00.542943451Z" level=info msg="Loading containers: done."
time="2022-09-22T14:00:00.555967430Z" level=info msg="Docker daemon" commit=e42327a graphdriver(s)=overlay2 version=20.10.18
time="2022-09-22T14:00:00.556092973Z" level=info msg="Daemon has completed initialization"
time="2022-09-22T14:00:00.635477377Z" level=info msg="API listen on /var/run/docker.sock"
time="2022-09-22T14:00:00.646423616Z" level=info msg="API listen on [::]:2376"

Additional Context

The pod status:

NAME READY STATUS RESTARTS AGE eks-runner-terraform-vsftp-security 1/2 NotReady 0 2m8s

runner container status:

Containers: runner: Container ID: docker://9b258234e9e289b251c4f14889ca37f995955e5c3a55b3236f696b3300cdeca2 Image: summerwind/actions-runner:latest Image ID: docker-pullable://summerwind/actions-runner@sha256:771a21d0c6f4ce2c403aa52fe2524b8a1a83dd70430ae6468cef9e9fa3095ea5 Port: Host Port: State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 22 Sep 2022 15:59:56 +0200 Finished: Thu, 22 Sep 2022 16:01:22 +0200

ajardan avatar Sep 22 '22 14:09 ajardan

@ajardan Hey! This can't be investigated further with the provided information. It's working fine for me so this might be due to some edge-cases coming from your configuration or your environment. Can you provide full logs from ARC and full kubectl describe or kubectl get output for the pods and runners?

mumoshu avatar Sep 22 '22 23:09 mumoshu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Oct 23 '22 02:10 github-actions[bot]

@mumoshu

I am running into a similar issue where-in the runner container has an exit 0 and then the pod sticks around until ARC comes and cleans it up. Previously, the pods would self-terminate when the runner container hit the exit 0 and now they don't.

We have changed a few things:

  • We are trying to install ARC through ArgoCD (helm) now
  • We updated our runner image from a pinned SHA to :latest.

Where would you recommend we look for configuration errors that are preventing the pods from exhibiting the correct behavior?

jrkarnes avatar Nov 02 '22 16:11 jrkarnes

The same here

brnpimentel avatar Nov 17 '22 10:11 brnpimentel

Previously, the pods would self-terminate when the runner container hit the exit 0

@jrkarnes What do you mean by "self-terminate" here? The pod does not terminate on its own. It needs to be deleted by doing a K8s DELETE Pod API call. Usually, ARC should detect the terminated container(s) in a runner pod and react to it by calling the delete pod API. Perhaps it isn't working for you for whatever reason? 🤔 Unfortunately though, I can't debug further with the provided information. Please file a dedicated issue linking this one, and do share the complete controller logs for investigation. A one-line excerpt from the controller log doesn't help...(and that's what's provided in this bug report). "The same here" doesn't help either, because there's no way for me to see if it's actually the same issue, or it's just a completely different issue that looks similar.

mumoshu avatar Nov 17 '22 11:11 mumoshu

In case it helps some of you folks here. We were experiencing the same issue on our side and it was due to missing a CRD upgrade after an upgrade from 0.20.xx to 0.26.0.

There was a change of behaviour after 0.21.xx that required a CRD upgrade in order for this functionality to keep working (clean up runners after they terminate). If you don't upgrade the CRD, you'll have to rely on the sync to clean up the pods (and you're also probably setting yourself up for a bunch of other problems).

@mumoshu describes the problem and the fix here https://github.com/actions-runner-controller/actions-runner-controller/issues/1291#issuecomment-1085243956 and here https://github.com/actions-runner-controller/actions-runner-controller/issues/1291#issuecomment-1085293928

edit: ping @ajardan @jrkarnes @brnpimentel

kuuji avatar Nov 29 '22 21:11 kuuji

I am seeing the same issue with v0.26.0 of actions-runner-controller running on EKS 1.23. I tried adjusting the syncPeriod via values of the helm chart, and even confirmed that the pod yaml showed the setting. Yet it seemed to have no effect. I also watched a Runner pod hang in NotReady for over 10 minutes.

Downgrading to v0.21.0 of actions-runner-controller, and 0.16.1 of the helm chart works. I see a near instant termination of the pod instead of NotReady.

edgan avatar Dec 05 '22 18:12 edgan

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jan 05 '23 01:01 github-actions[bot]

/remove stale

dschunack avatar Jan 06 '23 09:01 dschunack

Hi,

we noticed the same problem with the latest version on EKS 1.23. Any update or progress on this?

dschunack avatar Jan 06 '23 09:01 dschunack

I have the same setup as the first post. Helm chart with the same values. The pod is not cleaned up

√ Removed .credentials
√ Removed .runner
"Runner listener exit with 0 return code, stop the service, no retry needed."
"Exiting runner..."

Even though it was a clean install with installCRD=true, I replaced the CRDs with the recommended procedure and deleted the pods. The result was the same

I am running on AKS version 1.23.12

It is only happening on windows runners The runners have an issue with version 2.299.1 and the latest one 2.300.2

I have tried a downgrade to helm chart 1.17.0 which has the 0.21.1 app version and back again to 0.21.1 but the issue is the same on both versions

The docker image for windows runners worked as it should on a previous AKS v1.19 with summerwind/actions-runner-controller:v0.19.0

mazilu88 avatar Jan 09 '23 15:01 mazilu88

Any advice on how to debug this is highly appreciated

The message from the controller is INFO actions-runner-controller.runnerpod Runner pod is annotated to wait for completion, and the runner container is not restarting

mazilu88 avatar Jan 10 '23 07:01 mazilu88

@mazilu88 I migrated my configuration to a RunnerDeployment, and everything seems to work well this way.

ajardan avatar Jan 10 '23 09:01 ajardan

Thank you for the reply.

I am using RD and HRA so sadly that is not the fix for me

mazilu88 avatar Jan 10 '23 09:01 mazilu88

The solution for me was to configure the entrypoint as per documentation https://github.com/actions/actions-runner-controller/blob/3ede9b5a0159a5e0703ccae6eebfdc89defe2b8f/docs/configuring-windows-runners.md

In the initial setup, I had ENTRYPOINT ["pwsh", "-c", "./configure.ps1;] and inside it the ./run.cmd

Changed it to ENTRYPOINT ["pwsh", "-c", "./configure.ps1; ./run.cmd"]

mazilu88 avatar Jan 10 '23 11:01 mazilu88

I think this issue may be fixed in the latest version. This is what I ran on my K8s:

# Install cert-manager
helm repo add jetstack https://charts.jetstack.io
helm repo update

# Install chart
helm install --wait --create-namespace --namespace cert-manager cert-manager jetstack/cert-manager --version v1.3.0 --set installCRDs=true

# Install actions-runner-controller
helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller

# Install chart
helm upgrade --install --namespace actions-runner-system --create-namespace --set=authSecret.create=true --set=authSecret.enabled=true --set=authSecret.github_token="token_with_repo_admin:org_goes_here" --wait actions-runner-controller actions-runner-controller/actions-runner-controller

And now the pods disappear very quickly after a job finishes.

actions-runner-controller app version: v0.27.0 chart version: 0.22.0

emmahsax avatar Feb 08 '23 16:02 emmahsax

We are still seeing this in v0.27.0

selenium-e2e-jpl4t-8442l                                          1/1     Terminating   0          3h22m
selenium-e2e-jpl4t-9r89j                                          1/1     Terminating   0          3h21m
selenium-e2e-jpl4t-cwqzl                                          1/1     Terminating   0          3h19m
selenium-e2e-jpl4t-jnl6t                                          1/1     Terminating   0          3h21m

The node the pod belongs to is no longer under kubectl get nodes, the pod is stuck terminating, has a finalizer that is preventing it from closing, and the only error message is:

kubectl logs actions-runner-controller-6b77bf7bf6-l2cpm | grep selenium-e2e-jpl4t-cwqzl | tail


2023-02-14T23:22:13Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:22:35Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:23:18Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:23:35Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:24:23Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:24:35Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:25:28Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:25:35Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:26:34Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}
2023-02-14T23:26:35Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "gh-runner/selenium-e2e-jpl4t-cwqzl"}

cep21 avatar Feb 14 '23 23:02 cep21

We are still seeing this in v0.27.0

Then it must be some combination of other versioning issues, such as maybe K8s cluster version, or chart version or something. I have not seen any of this, and I've now set up two new clusters with no issues.

emmahsax avatar Feb 15 '23 15:02 emmahsax

And now the pods disappear very quickly after a job finishes.

@emmahsax can you share in which k8s version you are using?

brunomiranda-hotmart avatar Mar 06 '23 18:03 brunomiranda-hotmart

We are using Kubernetes 1.24 with AWS EKS.

emmahsax avatar Mar 06 '23 19:03 emmahsax

@emmahsax Assuming you're using the horizontal autoscaler, take a look at RUNNER_GRACEFUL_STOP_TIMEOUT and terminationGracePeriodSeconds in the docs. It seems to have solved the issue for me.

garbelini avatar Mar 08 '23 01:03 garbelini

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 07 '23 01:04 github-actions[bot]