traefik-helm-chart icon indicating copy to clipboard operation
traefik-helm-chart copied to clipboard

Traefik stops being monitorable during graceTimeout

Open brianbraunstein opened this issue 1 year ago • 3 comments

Welcome!

  • [X] Yes, I've searched similar issues on GitHub and didn't find any.
  • [X] Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What version of the Traefik's Helm Chart are you using?

v32.0.0

What version of Traefik are you using?

default from the v32.0.0 helm chart

What did you do?

I noticed that traefik is marked up == 0 while kube_pod_status_ready{condition="true"} == 0

I debugged it to the helm chart not setting requestAcceptGraceTimeout properly for the metrics entrypoint/port (docs: https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle). It should be set here by default: https://github.com/traefik/traefik-helm-chart/blob/7a13fc8a61a6ad30fcec32eec497dab9d8aea686/traefik/values.yaml#L707-L719

The default value for graceTimeout is 10 seconds according to https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle , which means most people don't notice this bug. However, we needed to increase graceTimeout for long lived connections so for long periods of time traefik becomes completely unmonitored, and appears down (up == 0) to our prometheus.

What did you see instead?

I saw a bug

What is your environment & configuration?

Traefik helm chart + kube + prometheus + long lived connections with graceTimeout set to a long value (hours).

Additional Information

No response

brianbraunstein avatar Sep 27 '24 21:09 brianbraunstein

At first glance, it seems more a configuration enhancement or warning to display than a real bug, but let's dig it. Would you please share values showing this issue you encountered ?

mloiseleur avatar Oct 04 '24 09:10 mloiseleur

Can you confirm if these statements are true, I might be missing something:

  • A) By default in traefik, graceTimeout is 10 seconds source
  • B) By default in traefik's helm chart, the metrics endpoint does not set requestAcceptGraceTimeout source and so gets the default of 0s source
  • C) A + B means the default helm chart causes traefik to be unmonitorable for 10 seconds while shutting down

brianbraunstein avatar Oct 04 '24 10:10 brianbraunstein

I tried to reproduce this issue, without success:

steps attempted to reproduce

Deploy k3d

k3d cluster create traefik-hub --port 80:80@loadbalancer --port 443:443@loadbalancer --port 8000:8000@loadbalancer --k3s-arg "--disable=traefik@server:0"

Install traefik with localhost config

helm repo add --force-update traefik https://traefik.github.io/charts
helm install traefik -n traefik --wait   --version v32.1.0   --set ingressClass.enabled=false   --set ingressRoute.dashboard.enabled=true   --set ingressRoute.dashboard.matchRule='Host(`dashboard.docker.localhost`)'   --set ingressRoute.dashboard.entryPoints={web}   --set ports.web.nodePort=30000   --set ports.websecure.nodePort=30001    traefik/traefik

Port forward to check metrics port availability

kubectl port-forward -n traefik pod/traefik-6f6f7f6bfb-8qfxr 9100:9100

Check metrics availability with curl

# In one terminal window
curl localhost:9100/metrics
# In the other
kubectl rollout restart -n traefik deployment/traefik

I observed I was able to curl the metrics endpoint during the 10s switch.

Are you able to provide reproducible test instructions for this issue ?

mloiseleur avatar Oct 11 '24 09:10 mloiseleur

I'm closing this issse since there is no answer from @brianbraunstein Feel free to re-open it or open a new one if you need it, with steps detailing how to reproduce.

mloiseleur avatar Nov 26 '24 16:11 mloiseleur