actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Runners keep throwing docker Daemon running

Open sravula84 opened this issue 1 year ago • 13 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

actions-runner-controller-0.22.0

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1) installed ARC with github token 
2) configured runnerdeployment  with replica -10
3) configured horizontal scaler - min 10 max 50
4) runners are in running state, but after some time new runners getting docker daemon running error and job are waiting in queue to pick the runner

happening for all the new runners then pod going to error state

Describe the bug

runners are in running state, but after some time new runners getting docker daemon running error and job are waiting in queue to pick the runner

happening for all the new runners then pod going to error state

Describe the expected behavior

based on the load horizontal scaler should scale the runners, but it is throwing docker daemon running ? error

Additional Context

❯ helm get values actions-runner-controller
USER-SUPPLIED VALUES:
authSecret:
  create: true
  github_token: "" supplied github token

Controller Logs

a2844b0833", "allowed": true}
2024-01-12T21:31:30Z	INFO	runner	Failed to create pod due to AlreadyExists error. Probably this pod has been already created in previous reconcilation but is still not in the informer cache. Will retry on pod created. If it doesn't repeat, there's no problem	{"runner": "actions-runner-systems/github-action-np-h6z4z-gf9pg"}
2024-01-12T21:31:31Z	DEBUG	runner	Runner appears to have been registered and running.	{"runner": "actions-runner-systems/github-action-np-h6z4z-gf9pg", "podCreationTimestamp": "2024-01-12 21:31:30 +0000 UTC"}
2024-01-12T21:31:36Z	INFO	runnerpod	Failed to delete pod within 1m0s. This is typically the case when a Kubernetes node became unreachable and the kube controller started evicting nodes. Forcefully deleting the pod to not get stuck.	{"runnerpod": "actions-runner-systems/github-action-np-h6z4z-qjc6z", "podDeletionTimestamp": "2024-01-12 21:30:25 +0000 UTC", "currentTime": "2024-01-12T21:31:36Z", "configuredDeletionTimeout": "1m0s"}
2024-01-12T21:31:36Z	INFO	runnerpod	Forcefully deleted runner pod	{"runnerpod": "actions-runner-systems/github-action-np-h6z4z-qjc6z", "repository": ""}
2024-01-12T21:31:36Z	DEBUG	events	Forcefully deleted pod 'github-action-np-h6z4z-qjc6z'	{"type": "Normal", "object": {"kind":"Pod","namespace":"actions-runner-systems","name":"github-action-np-h6z4z-qjc6z","uid":"3be68602-3db3-4803-a2b7-9ae0ec52df94","apiVersion":"v1","resourceVersion":"605720"}, "reason": "PodDeleted"}
2024-01-12T21:31:39Z	DEBUG	horizontalrunnerautoscaler	Suggested desired replicas of 10 by PercentageRunnersBusy	{"replicas_desired_before": 10, "replicas_desired": 10, "num_runners": 10, "num_runners_registered": 9, "num_runners_busy": 6, "num_terminating_busy": 0, "namespace": "actions-runner-systems", "kind": "runnerdeployment", "name": "github-action-np", "horizontal_runner_autoscaler": "example-runner-deployment-autoscaler", "enterprise": "", "organization": "prosperllc", "repository": ""}
2024-01-12T21:31:39Z	DEBUG	horizontalrunnerautoscaler	Calculated desired replicas of 10	{"horizontalrunnerautoscaler": "actions-runner-systems/example-runner-deployment-autoscaler", "suggested": 10, "reserved": 0, "min": 10, "max": 20}
2024-01-12T21:32:19Z	DEBUG	runner	Runner appears to have been registered and running.	{"runner": "actions-runner-systems/github-action-np-h6z4z-9f5cp", "podCreationTimestamp": "2024-01-12 21:25:58 +0000 UTC"}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "1e47c97c-345a-4b39-825f-67129cf2201d", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "1e47c97c-345a-4b39-825f-67129cf2201d", "allowed": true}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "89c6d115-f817-472c-b629-0489fd90e10e", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "89c6d115-f817-472c-b629-0489fd90e10e", "allowed": true}

Runner Pod Logs

"https://pipelinesghubeus21.actions.githubusercontent.com/tMTkzAKYleoidiHAI9FjPaHPkEkp2s7TIoUW3BW1740YmeFlFo/",
  "gitHubUrl": "https://github.com/prosperllc",
  "workFolder": "/runner/_work"
2024-01-12 21:34:47.510  DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2024-01-12 21:34:47.512  DEBUG --- Waiting until Docker is available or the timeout of 120 seconds is reached
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?

sravula84 avatar Jan 12 '24 21:01 sravula84

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jan 12 '24 21:01 github-actions[bot]

this is the message we are seeing 2024-01-12 22:19:43.778 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled 2024-01-12 22:19:43.780 DEBUG --- Waiting until Docker is available or the timeout of 120 seconds is reached Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?

sravula84 avatar Jan 12 '24 22:01 sravula84

Hi Team, can one suggest me on this issue ?

Thanks Sridhar

sravula84 avatar Jan 13 '24 19:01 sravula84

HI Team,

any suggestions on the above one?

Thanks Sridhar

sravula84 avatar Jan 18 '24 01:01 sravula84

HI @nikola-jokic can you please advise on the above ticket ?

sravula84 avatar Jan 22 '24 18:01 sravula84

We just had a similar issue in one of our clusters today. We tracked the root cause to the latest v25.0.0 release of docker:dind and ended setting the Helm value for image.dindSidecarRepositoryAndTag to docker:24.0.7-dind which solved the issue. Maybe give that a try.

emilwangaa avatar Jan 22 '24 20:01 emilwangaa

@emilwangaa so i need to upgrade my actions runner controller chat to 25.0.0 and update docker dind verison ?

prosper-sre avatar Jan 24 '24 21:01 prosper-sre

@emilwangaa so i need to upgrade my actions runner controller chat to 25.0.0 and update docker dind verison ?

@prosper-sre I don't think you need to update your arc chart unless it doesn't support setting the dind version. The default setting for the arc chart is to pull the latest version if dind, which caused issues for us.

emilwangaa avatar Jan 27 '24 09:01 emilwangaa

Hello @emilwangaa, could you share how you changed image.dindSidecarRepositoryAndTag: docker:24.0.7-dind? I tried

helm upgrade --install -f custom-values.yaml --namespace actions-runner-system --create-namespace --wait actions-runner-controller actions-runner-controller/actions-runner-controller --set image.dindSidecarRepositoryAndTag=docker:24.0.7-dind --version ${CHART_VERSION}

but I still see the image dind:dind configured on the re-deployed runners

gera-aldama avatar Feb 03 '24 00:02 gera-aldama

Hello @emilwangaa, could you share how you changed image.dindSidecarRepositoryAndTag: docker:24.0.7-dind? I tried

helm upgrade --install -f custom-values.yaml --namespace actions-runner-system --create-namespace --wait actions-runner-controller actions-runner-controller/actions-runner-controller --set image.dindSidecarRepositoryAndTag=docker:24.0.7-dind --version ${CHART_VERSION}

but I still see the image dind:dind configured on the re-deployed runners

We use Terraform to install the chart, but your method looks right. Which version of the chart are you using? And have you tried setting it in the custom-values.yaml file that you specify instead?

emilwangaa avatar Feb 03 '24 07:02 emilwangaa

i have done the upgrade, but still seeing the same issues. looks like one of the workflow causing this problem and it is mono repo . some thing wrong with that . i need to figure it out.

i think it is purely resource issues, do we have any recommended specifications w.r.t to cpu memory ?

sravula84 avatar Feb 05 '24 17:02 sravula84

Thanks for the response @emilwangaa I can now see docker:24.0.7-dind but the issue still persists. I'm using version 0.23.7

gera-aldama avatar Feb 06 '24 21:02 gera-aldama

@emilwangaa I also did a version update and am experiencing the same issue, maybe it was fixed after updating to that version?

95jinhong avatar Feb 27 '24 08:02 95jinhong