traefik-helm-chart
traefik-helm-chart copied to clipboard
Traefik stops being monitorable during graceTimeout
Welcome!
- [X] Yes, I've searched similar issues on GitHub and didn't find any.
- [X] Yes, I've searched similar issues on the Traefik community forum and didn't find any.
What version of the Traefik's Helm Chart are you using?
v32.0.0
What version of Traefik are you using?
default from the v32.0.0 helm chart
What did you do?
I noticed that traefik is marked up == 0 while kube_pod_status_ready{condition="true"} == 0
I debugged it to the helm chart not setting requestAcceptGraceTimeout properly for the metrics entrypoint/port (docs: https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle). It should be set here by default:
https://github.com/traefik/traefik-helm-chart/blob/7a13fc8a61a6ad30fcec32eec497dab9d8aea686/traefik/values.yaml#L707-L719
The default value for graceTimeout is 10 seconds according to https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle , which means most people don't notice this bug. However, we needed to increase graceTimeout for long lived connections so for long periods of time traefik becomes completely unmonitored, and appears down (up == 0) to our prometheus.
What did you see instead?
I saw a bug
What is your environment & configuration?
Traefik helm chart + kube + prometheus + long lived connections with graceTimeout set to a long value (hours).
Additional Information
No response
At first glance, it seems more a configuration enhancement or warning to display than a real bug, but let's dig it. Would you please share values showing this issue you encountered ?
Can you confirm if these statements are true, I might be missing something:
- A) By default in traefik, graceTimeout is 10 seconds source
- B) By default in traefik's helm chart, the metrics endpoint does not set requestAcceptGraceTimeout source and so gets the default of 0s source
- C) A + B means the default helm chart causes traefik to be unmonitorable for 10 seconds while shutting down
I tried to reproduce this issue, without success:
steps attempted to reproduce
Deploy k3d
k3d cluster create traefik-hub --port 80:80@loadbalancer --port 443:443@loadbalancer --port 8000:8000@loadbalancer --k3s-arg "--disable=traefik@server:0"
Install traefik with localhost config
helm repo add --force-update traefik https://traefik.github.io/charts
helm install traefik -n traefik --wait --version v32.1.0 --set ingressClass.enabled=false --set ingressRoute.dashboard.enabled=true --set ingressRoute.dashboard.matchRule='Host(`dashboard.docker.localhost`)' --set ingressRoute.dashboard.entryPoints={web} --set ports.web.nodePort=30000 --set ports.websecure.nodePort=30001 traefik/traefik
Port forward to check metrics port availability
kubectl port-forward -n traefik pod/traefik-6f6f7f6bfb-8qfxr 9100:9100
Check metrics availability with curl
# In one terminal window
curl localhost:9100/metrics
# In the other
kubectl rollout restart -n traefik deployment/traefik
I observed I was able to curl the metrics endpoint during the 10s switch.
Are you able to provide reproducible test instructions for this issue ?
I'm closing this issse since there is no answer from @brianbraunstein Feel free to re-open it or open a new one if you need it, with steps detailing how to reproduce.