actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

I am seeing Runner pod being terminated randomly. No help wrt to logs

Open S-G0D opened this issue 1 year ago • 2 comments

Checks

  • [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I'm not using a custom entrypoint in my runner image

Controller Version

0.27.4

Helm Chart Version

No response

CertManager Version

No response

Deployment Method

Kustomize

cert-manager installation

Yes,

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: gh-runner-pvc
  namespace: actions-runner-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi
  storageClassName: standard
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: abc-k8s-action-runner
  namespace: actions-runner-system
spec:
  template:
    spec:
      organization: abc
      labels: ["abc-k8s-action-runner"]
      imagePullPolicy: IfNotPresent
      tolerations:
      - key: "gh_runner_only"
        operator: "Exists"
        #value: true
        effect: "NoSchedule"
      nodeSelector:
        role: action_runner
      resources:
        requests:
          memory: 2Gi
          cpu: 1
        limits:
          memory: 4Gi
          cpu: 2
      volumes:
      - name: docker
        persistentVolumeClaim:
          claimName: gh-runner-pvc
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: abc-autoscaled-action-runner
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    name: abc-k8s-action-runner      # Name of RunnerDeployment
  minReplicas: 0
  maxReplicas: 6
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - repo-abc

To Reproduce

Run a 1hour plus logs.
See that runner gets terminated without cancel or any external event.

Describe the bug

Pod is getting shutdown showing above error in controller-manager. Job shows this logs. docker: Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?.

Describe the expected behavior

The pod should remain up throughout the job. There are some jobs running longer than 1 hour. We started seeing this behaviour recently.

Whole Controller Logs

2023-09-11T09:49:05Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z	INFO	runnerpod	This runner pod seems to have been deleted directly, bypassing the parent Runner resource. Marking the runner for deletion to not let it recreate this pod.	{"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z	INFO	runner	Removed finalizer	{"runner": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:05Z	INFO	runnerpod	Runner pod is annotated to wait for completion, and the runner container is not restarting	{"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}
2023-09-11T09:49:34Z	INFO	runnerpod	Runner pod has been stopped with a successful status.	{"runnerpod": "actions-runner-system/abc-k8s-action-runner-fnc4w-75bmp"}

Whole Runner Pod Logs

Executing:  77%|███████▋  | 43/56 [04:18<02:02,  9.40s/cell]docker: Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.

Additional Context

No response

S-G0D avatar Sep 11 '23 10:09 S-G0D

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Sep 11 '23 10:09 github-actions[bot]

I am seeing the same behavior as well. My runner pod is keep restarting after 30 minutes + have been passed. Using 0.27.3 here with cert manager 1.9.1, using helm deployment.

azdanielna avatar Dec 14 '23 15:12 azdanielna