traefik-ondemand-service icon indicating copy to clipboard operation
traefik-ondemand-service copied to clipboard

Getting 503 first time a workload is woken up

Open jturpin82 opened this issue 3 years ago • 23 comments

Thank you for that beautiful project. It's very useful to me!

I'm using a managed Kubernetes, GKE 1.21.5, traefik 2.5.6 (installed with helm chart) and using many workloads in the default namespace.

Using traefik-ondemand-plugin 1.2.0-beta.3 along with traefik-ondemand-service 1.7, it's quite straight forward to scale up and down (to zero) some small workloads (like nginx).

But there come a problem when trying to wake up bigger workloads (images around 400-500mb). I'm able to see the workload (pod) waking up, become ready, but end up with traefik getting 503 from the backend service (certainly because this is the actual behavior of using providers.kubernetesingress.allowEmptyServices=true). If I immediately refresh my page (or use a traefik plugin to handle the 503, like errorpages), I can access my web page.

But I would really want to avoid the 503 (since I would wake the workload also using API calls, and not only "web" calls).

My assumption is the following: When the workload is waking up, then an endpoint is generated and should finally become available to the kubernetes service. But in this case, Traefik is trying to access the kubernetes service while the endpoint is not totally ready (could be ready after some milliseconds). It's when the endpoint is "empty" on the following screenshot:

image004%5B51%5D

Traefik is getting the 503 when the endpoint is still at the empty stage.

Looking at the go code here, it seems like Traefik should consider the service up when the deployment is ready. Could we maybe consider checking the endpoint status? Or the service? Don't know if it could help or if it's really the root cause here...

Traefik is configured with the following options:

  • --experimental.plugins.traefik-ondemand-plugin.modulename=github.com/acouvreur/traefik-ondemand-plugin
  • --experimental.plugins.traefik-ondemand-plugin.version=v1.2.0-beta.3
  • --providers.kubernetesingress.allowEmptyServices=true

Here is how I annotated ingresses: traefik.ingress.kubernetes.io/router.middlewares: default-ondemand-kfnqt8n476wgq28@kubernetescrd

Bascillty the traefik-ondemand-service is configured as described here: KUBERNETES.md

Tell me if you need me to provide more conf files.

And also, thank you for your help!

--

jturpin82 avatar Jan 22 '22 22:01 jturpin82

I had the same behavior on a Docker Swarm setup. So it might be related to how quick the service is determined available/healthy.

acouvreur avatar Jan 22 '22 23:01 acouvreur

From the documentation https://doc.traefik.io/traefik/getting-started/faq/#502-bad-gateway

502 Bad Gateway

Traefik returns a 502 response code when an error happens while contacting the upstream service.

503 Service Unavailable

Traefik returns a 503 response code when a Router has been matched but there are no servers ready to handle the request.

This situation is encountered when a service has been explicitly configured without servers, or when a service has healthcheck enabled and all servers are unhealthy.

I think I got a "Bad Gateway" response in my case.

acouvreur avatar Jan 23 '22 12:01 acouvreur

Maybe the 503 could be a consequence of allowEmptyServices, according to the doc?

jturpin82 avatar Jan 23 '22 14:01 jturpin82

Thanks @acouvreur I will test that on Kubernetes!

jturpin82 avatar Apr 22 '22 07:04 jturpin82

My change is only related to Swarm. I tried to get some metadata that could help me consider a service healthy for more than 5 seconds, but couldn't.

If you do, please share it with me as I'll fix it right away.

acouvreur avatar Apr 22 '22 07:04 acouvreur

Could the age (now - creationTimestamp) of the Endpoint be considered?

Example below with a simple nginx service: kubectl create deploy nginx --image nginx kubectl expose deploy nginx --port 80 kubectl get endpoints nginx -o=jsonpath='{.metadata.creationTimestamp}'

2022-04-22T07:43:32

jturpin82 avatar Apr 22 '22 07:04 jturpin82

The creationTimestamp can be considered when there is no healthcheck. Bt when there's a healthcheck, is the creationTimestamp set to the first healthy check?

acouvreur avatar Apr 22 '22 07:04 acouvreur

Yes you're right. creationTimestamp can not be considered as a reliable information on the service full availability.

I think the only way to be sure the service healthy is to check if endpoint is bound to at least 1 ip, like the following:

kubectl get endpoints nginx -o=jsonpath='{.subsets[*].addresses[*].ip}'

What do you think? Is this something that can be checked on your side? Thank you for your hard work on this plugin!

jturpin82 avatar Apr 22 '22 08:04 jturpin82

That might be a good solution, I'll look into it

acouvreur avatar Apr 23 '22 11:04 acouvreur

Thank you!

jturpin82 avatar Apr 24 '22 06:04 jturpin82

I'm having an issue implementing it, endpoints can be created with a different name than the service/deployment right? So if you have any lead on this... Right now I can't find a good solution for it.

acouvreur avatar Apr 29 '22 08:04 acouvreur

The endpoint associated with a Service must always have the same name as the Service. Kubernetes will automatically create an Endpoints object with the same name as the service

So maybe, if we want to consider a service fully available could we consider the ip from the endpoint with the same name. Is that make sense?

jturpin82 avatar Apr 29 '22 08:04 jturpin82

I'll create this feature as an experimental flag

acouvreur avatar Apr 29 '22 12:04 acouvreur

You can try it out on #29

Feedback welcomed

acouvreur avatar May 07 '22 23:05 acouvreur

Sure. Let me test and get back to you!

jturpin82 avatar May 10 '22 21:05 jturpin82

Could you please add a complementary rule to the RBAC example, as below?:

  - apiGroups:
      - ""
    resources:
      - endpoints
    verbs:
      - get

Otherwise users will get a level=error msg="endpoints "XXXXXXXXX" is forbidden: User "system:serviceaccount:default:traefik-ondemand-service" cannot get resource "endpoints" in API group "" in the namespace "default

jturpin82 avatar May 11 '22 13:05 jturpin82

That makes sense. Adding it resolves the issue ?

acouvreur avatar May 11 '22 13:05 acouvreur

Just deployed on test env and waiting some more few days to see if this is fully working. Getting back soon...

jturpin82 avatar May 11 '22 13:05 jturpin82

Unfortunately, still getting some 503s the first time. Let me gather some piece of evidence

jturpin82 avatar May 16 '22 20:05 jturpin82

Image used: ghcr.io/acouvreur/traefik-ondemand-service:fix-wait-for-k8s-endpoint-to-have-one-ip

jturpin82 avatar May 16 '22 20:05 jturpin82

I'm also getting the same error with ghcr.io/acouvreur/traefik-ondemand-service:fix-wait-for-k8s-endpoint-to-have-one-ip

tomaszduda23 avatar May 23 '22 18:05 tomaszduda23

I'll take a look in a few days

acouvreur avatar May 23 '22 18:05 acouvreur

See https://github.com/acouvreur/sablier/issues/62

acouvreur avatar Oct 13 '22 19:10 acouvreur