concourse-chart
concourse-chart copied to clipboard
Error `worker.beacon-runner.beacon.forward-conn.failed-to-dial`
I'm getting hundreds of these errors in my worker pod a second with a fairly minimal configuration of this chart:
{"timestamp":"2019-12-19T10:10:49.596891570Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7777","error":"dial tcp 127.0.0.1:7777: connect: connection refused","network":"tcp","session":"4.1.4"}}
The configuration is:
web:
replicas: 1
ingress:
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/proxy-body-size: "0"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
enabled: true
tls:
- secretName: tls-secret
hosts:
- XXXXXXX
hosts:
- XXXXXXX
concourse:
web:
tsa:
heartbeatInterval: 120s
kubernetes:
createTeamNamespaces: false
externalUrl: "https://XXXXXXX"
localAuth:
enabled: true
auth:
mainTeam:
localUser: XXXXXXX
worker:
replicas: 1
emptyDirSize: 20Gi
secrets:
create: false
persistence:
worker:
size: 256Gi
I'm working off of commit 8c45b70dc559e65fd0a0a2953254873ee222a49a
(which is tag v8.4.1
) with a minor modification to the stateful set:
diff --git a/templates/worker-statefulset.yaml b/templates/worker-statefulset.yaml
index 80c5bb0..dd35bb2 100644
--- a/templates/worker-statefulset.yaml
+++ b/templates/worker-statefulset.yaml
@@ -56,7 +56,7 @@ spec:
{{- end }}
imagePullPolicy: {{ .Values.imagePullPolicy | quote }}
securityContext:
- privileged: true
+ allowPrivilegeEscalation: true
command:
- /bin/bash
args:
@@ -280,7 +280,7 @@ spec:
{{ toYaml .Values.worker.resources | indent 12 }}
{{- end }}
securityContext:
- privileged: true
+ allowPrivilegeEscalation: true
volumeMounts:
- name: concourse-keys
mountPath: {{ .Values.worker.keySecretsPath | quote }}
Have you found workaround?
No. I sunk a few days into trying to get it to work, and gave up on the project.
I have the same problem. Any ideas?
When I switched drivers the issue for me was with the persistent storage volumes. No guarantees, but you could try:
~~- set worker statefulset replicas=0~~ ~~- delete worker persistent storage volumes~~ ~~- set worker statefulset replicas back~~
@Bluesboy suggested:
persistence.enabled: false
I haven't seen the issue since I did this and switched to overlay, but I'll report back if it happens on pod cycles.
I echoed {}
into /concourse-work-dir/garden-properties.json, scaled the set down to 0 and back up and all workers are running 🤷♂
When I switched drivers the issue for me was with the persistent storage volumes. No guarantees, but you could try:
- set worker statefulset replicas=0
- delete worker persistent storage volumes
- set worker statefulset replicas back
I haven't seen the issue since I did this and switched to overlay, but I'll report back if it happens on pod cycles.
I'm also using kind of same method, however instead of redeploying whole chart I'm just deleting worker pods and at the time they being recreated by Helm, I delete PersistentVolumeClaims tied to that pods. After pods are recreated they creating new PVCs and for a while they are working well. Unfortunately it doesn't solves problem: after some time has passed it happens again. Looking forward for fix.
@tjhiggins Actually to completely avoid this issue you can just turn off persistence for worker pods by setting persistence.enabled: false
if we deleting pvc anyway then why bother with that at all.
Realized that after. Fyi this looks like it will be fixed on concourse 6. We have been testing the latest rc and haven't seen this issue yet.
https://github.com/concourse/concourse/issues/5281
Is anyone still having this issue with newer versions of the chart and using Concourse 6.0 or higher?
We started seeing this error with one of our workers after the 6.7.0 release
I see this error on 6.7.2
We need more context on how to reproduce this. We run a bunch of integration tests with the helm chart and never get this error. This makes me think it might be setup error. If someone that's having this error can share a values.yaml that reproduces this issue we can fix this.
@taylorsilva In our case the problem was too little memory assigned to the workers 😄 🤲 Now everything is working fine. Sorry for misleading.
I am running 7.1.0 and i have been struggling with this issue for days. Any progress or suggestions?
I think I've been able to trace this issue. It started to occur randomly on different workers. In our specific case this is happening on LKE and is related to an update on, I believe, the Docker daemon. The root case is:
{"timestamp":"2022-02-23T04:55:01.240477901Z","level":"error","source":"guardian","message":"guardian.starting-guardian-backend","data":{"error":"bulk starter: mounting subsystem 'cpuset' in '/s
bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': operation not permitted bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': operation not permitted
I believe this is related to some change in the way the daemon manages cgroups (perhaps it has switched to v2).
I know this is the problem because it happened on different Linode LKE worker nodes, and as I recycled them and the Concourse workers moved to the new nodes, they eventually all stopped working.
Right now, at least, I have no idea how to fix this. I'll try and find out the version of Docker daemon running on the nodes.
It's worth noting that I also only get this far with the overlay driver. With btrfs, I can't even start the worker:
error: failed to create btrfs filesystem: exit status 1
Will try overlay2 as stated here: https://docs.docker.com/engine/security/rootless/.
None of the baggage claim drivers work.
So I was able to fix this by changing the runtime to containerd in the worker: section of the helm values file. All fixed now, but only with the overlay driver, which is good enough!