concourse-chart Error `worker.beacon-runner.beacon.forward-conn.failed-to-dial`

Error `worker.beacon-runner.beacon.forward-conn.failed-to-dial`

Open tac-tics opened this issue 4 years ago • 16 comments

I'm getting hundreds of these errors in my worker pod a second with a fairly minimal configuration of this chart:

{"timestamp":"2019-12-19T10:10:49.596891570Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}

The configuration is:

web:
  replicas: 1
  ingress:
    annotations:
      kubernetes.io/ingress.class: "nginx"
      nginx.ingress.kubernetes.io/proxy-body-size: "0"
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    enabled: true
    tls:
    - secretName: tls-secret
      hosts:
      - XXXXXXX
    hosts:
    - XXXXXXX

concourse:
  web:
    tsa:
      heartbeatInterval: 120s
    kubernetes:
      createTeamNamespaces: false
    externalUrl: "https://XXXXXXX"
    localAuth:
      enabled: true
    auth:
      mainTeam:
        localUser: XXXXXXX

worker:
  replicas: 1
  emptyDirSize: 20Gi

secrets:
  create: false

persistence:
  worker:
    size: 256Gi

I'm working off of commit 8c45b70dc559e65fd0a0a2953254873ee222a49a (which is tag v8.4.1) with a minor modification to the stateful set:

diff --git a/templates/worker-statefulset.yaml b/templates/worker-statefulset.yaml
index 80c5bb0..dd35bb2 100644
--- a/templates/worker-statefulset.yaml
+++ b/templates/worker-statefulset.yaml
@@ -56,7 +56,7 @@ spec:
           {{- end }}
           imagePullPolicy: {{ .Values.imagePullPolicy | quote }}
           securityContext:
-            privileged: true
+            allowPrivilegeEscalation: true
           command:
             - /bin/bash
           args:
@@ -280,7 +280,7 @@ spec:
 {{ toYaml .Values.worker.resources | indent 12 }}
 {{- end }}
           securityContext:
-            privileged: true
+            allowPrivilegeEscalation: true
           volumeMounts:
             - name: concourse-keys
               mountPath: {{ .Values.worker.keySecretsPath | quote }}

Dec 19 '19 10:12 tac-tics

Have you found workaround?

Feb 18 '20 23:02 Bluesboy

No. I sunk a few days into trying to get it to work, and gave up on the project.

Feb 29 '20 01:02 tac-tics

I have the same problem. Any ideas?

Mar 04 '20 18:03 mathias-ewald

When I switched drivers the issue for me was with the persistent storage volumes. No guarantees, but you could try:

~~- set worker statefulset replicas=0~~ ~~- delete worker persistent storage volumes~~ ~~- set worker statefulset replicas back~~

@Bluesboy suggested: persistence.enabled: false

I haven't seen the issue since I did this and switched to overlay, but I'll report back if it happens on pod cycles.

Mar 04 '20 18:03 tjhiggins

I echoed {} into /concourse-work-dir/garden-properties.json, scaled the set down to 0 and back up and all workers are running 🤷‍♂

Mar 04 '20 18:03 mathias-ewald

When I switched drivers the issue for me was with the persistent storage volumes. No guarantees, but you could try:

set worker statefulset replicas=0

delete worker persistent storage volumes

set worker statefulset replicas back

I haven't seen the issue since I did this and switched to overlay, but I'll report back if it happens on pod cycles.

I'm also using kind of same method, however instead of redeploying whole chart I'm just deleting worker pods and at the time they being recreated by Helm, I delete PersistentVolumeClaims tied to that pods. After pods are recreated they creating new PVCs and for a while they are working well. Unfortunately it doesn't solves problem: after some time has passed it happens again. Looking forward for fix.

Mar 14 '20 01:03 Bluesboy

@tjhiggins Actually to completely avoid this issue you can just turn off persistence for worker pods by setting persistence.enabled: false if we deleting pvc anyway then why bother with that at all.

Mar 14 '20 01:03 Bluesboy

Realized that after. Fyi this looks like it will be fixed on concourse 6. We have been testing the latest rc and haven't seen this issue yet.

https://github.com/concourse/concourse/issues/5281

Mar 14 '20 02:03 tjhiggins

Is anyone still having this issue with newer versions of the chart and using Concourse 6.0 or higher?

May 29 '20 15:05 taylorsilva

We started seeing this error with one of our workers after the 6.7.0 release

Nov 13 '20 16:11 vigneshvpra

I see this error on 6.7.2

Dec 08 '20 14:12 sam701

We need more context on how to reproduce this. We run a bunch of integration tests with the helm chart and never get this error. This makes me think it might be setup error. If someone that's having this error can share a values.yaml that reproduces this issue we can fix this.

Dec 14 '20 22:12 taylorsilva

@taylorsilva In our case the problem was too little memory assigned to the workers 😄 🤲 Now everything is working fine. Sorry for misleading.

Dec 15 '20 08:12 sam701

I am running 7.1.0 and i have been struggling with this issue for days. Any progress or suggestions?

Jan 18 '22 02:01 eddytnk

I think I've been able to trace this issue. It started to occur randomly on different workers. In our specific case this is happening on LKE and is related to an update on, I believe, the Docker daemon. The root case is:

{"timestamp":"2022-02-23T04:55:01.240477901Z","level":"error","source":"guardian","message":"guardian.starting-guardian-backend","data":{"error":"bulk starter: mounting subsystem 'cpuset' in '/s
bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': operation not permitted                                                                                                      bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': operation not permitted

I believe this is related to some change in the way the daemon manages cgroups (perhaps it has switched to v2).

I know this is the problem because it happened on different Linode LKE worker nodes, and as I recycled them and the Concourse workers moved to the new nodes, they eventually all stopped working.

Right now, at least, I have no idea how to fix this. I'll try and find out the version of Docker daemon running on the nodes.

It's worth noting that I also only get this far with the overlay driver. With btrfs, I can't even start the worker:

 error: failed to create btrfs filesystem: exit status 1

Will try overlay2 as stated here: https://docs.docker.com/engine/security/rootless/.

None of the baggage claim drivers work.

Feb 23 '22 04:02 meezaan

So I was able to fix this by changing the runtime to containerd in the worker: section of the helm values file. All fixed now, but only with the overlay driver, which is good enough!

Feb 23 '22 09:02 meezaan

concourse-chart concourse-chart copied to clipboard

Error `worker.beacon-runner.beacon.forward-conn.failed-to-dial`

concourse-chart
concourse-chart copied to clipboard