contour icon indicating copy to clipboard operation
contour copied to clipboard

Progressive traffic increase for new Pods

Open costimuraru opened this issue 5 years ago • 14 comments

We have a JVM-based web app behind Contour/Envoy/NLB, with horizontal pod auto scaling in place. When a new pod gets created due to auto scaling, Contour/Envoy directs a proportional amount of traffic on that new pod. However, because the app is cold, we're seeing consistent timeouts until it warms up.

Screenshot 2020-02-28 17 36 47

We tried the same scenario by using a Service type LoadBalancer, in EKS (with an Elastic Load Balancer in front) and we don't see the same issue in this scenario. This seems to be because the ELB is doing a progressive traffic increase on the new pod, as the graph seen below. Screenshot 2020-02-28 17 34 56

Is there any plan to support something similar in Contour? I see we have the possibility to set weights for different services in an IngressRoute. Would it be something to consider to set some weifghts at pod level for a given service, based on their age? (or is something like this available today)?

costimuraru avatar Feb 28 '20 15:02 costimuraru

Thanks for logging this issue.

This sounds like a time where health checks from Contour or readiness checks from Kubernetes would help.

Kubernetes supports pod readiness checks, and Contour supports endpoint health checks, both of which could ensure that traffic does not get to a warmed instance, as long as your application can indicate that it's ready somehow.

Contour's endpoint health checks are only available in the HTTPProxy object ( and the now deprecated IngressRoute), however. Pod readiness checks are available in any recent version of Kubernetes.

youngnick avatar Mar 01 '20 23:03 youngnick

Thanks, @youngnick. This sounds like we need to warm up the new pods ourselves. The issue was asking whether this could be handled by Contour/Envoy itself, by doing a progressive traffic increase on the new pod(s), hence warming up the instance.

costimuraru avatar Mar 02 '20 13:03 costimuraru

I agree with what @youngnick suggested. You could have your readiness probe call an endpoint which would trigger the app to warm up, but put an initial delay that matches the time your app needs to spin up.

Additionally, you could look at adding a retry to the requests, so if the request does fail, then it would get retried by Envoy.

I'm going to close this out, but please re-open if you have further questions on this @costimuraru !

stevesloka avatar Mar 02 '20 18:03 stevesloka

Thanks for the response, @stevesloka

have your readiness probe call an endpoint which would trigger the app to warm up

I think we might not be on the same page regarding the warm up. The warm up is not related to the application being slow to start or anything like that. This is about the app warming up by processing (real) HTTP requests.

The scenario right now with Contour is:

  1. app starts on the new pod and is ready to handle requests (this happens quite fast)
  2. contour throws a lot of requests to the new pod
  3. app can't handle these many requests at once, being in a cold state and crashes

This problem is known and other load balancers have implemented algorithms to mitigate it. For example see this from the Application Load Balancer from AWS: https://aws.amazon.com/about-aws/whats-new/2018/05/application-load-balancer-announces-slow-start-support/

Application Load Balancers now support a slow start mode that allows you to add new targets without overwhelming them with a flood of requests. With the slow start mode, targets warm up before accepting their fair share of requests based on a ramp-up period that you specify

This issue is related exactly to this kind of behavior, where Contour would be able to support a slow start mode and not overwhelm new pods with requests.

costimuraru avatar Mar 02 '20 19:03 costimuraru

Hey, @youngnick, @stevesloka,

Any thoughts on the above?

Appreciate the feedback.

costimuraru avatar Mar 10 '20 14:03 costimuraru

Hi @costimuraru, currently, Contour does minimal configuration of Envoy aside from what it's directed to do by Kubernetes objects.

If I understand what you're asking for - having Contour detect new endpoint pods and gradually shift traffic to them - this is a very large change to Contour's current model of using Envoy, as it would involve Contour keeping track of all the health of all the endpoints of the service, and gradually changing the weights of each endpoint after a given period, which is a very large departure from our current model.

I will speak to the team about this idea, we will need to double check if Envoy has any feature that would make adding this feature to Contour easier.

youngnick avatar Mar 15 '20 23:03 youngnick

In addition, I think what @stevesloka and I were trying to suggest earlier is having the readiness check do some common requests to the app itself to warm the caches before marking the pod as ready for traffic.

youngnick avatar Mar 15 '20 23:03 youngnick

Thanks for the detailed answer, @youngnick!

In addition, I think what @stevesloka and I were trying to suggest earlier is having the readiness check do some common requests to the app itself to warm the caches before marking the pod as ready for traffic.

We tried this, but the number of requests is just too low to do any real warming (we're trying to warm up from 0 to ~4000 requests per second, for each pod). We also tried adding a PostStart lifecycle hook on the Pod, where we'd run an http generator process to send requests to the app (via localhost), but this also is a problematic. The warm up takes quite a bit of time (eg. ~ 2 minutes), during which the Pod is not actually receiving any external traffic. Even if we add tens of pods due to a spike, we are not able to process the extra requests, because we need for this warm up period to finish (so we're back to the VM world, where it takes minutes to spin up a new machine). It's also quite hard to generate requests that map to real life use cases, as these are frequently getting updated. All in one, doing this warmup workarounds add quite a lot of work and don't yield the best results.

costimuraru avatar Mar 16 '20 14:03 costimuraru

@costimuraru - this is more an Envoy issue in my mind (Contour could leverage that feature of course, once implemented in Envoy). Have you considered filing the issue in the Envoy project instead?

lrouquette avatar Apr 30 '20 19:04 lrouquette

Thanks, @lrouquette. Created the issue in Envoy: https://github.com/envoyproxy/envoy/issues/11050

costimuraru avatar May 04 '20 21:05 costimuraru

This is available in Envoy now so Contour could adopt the feature!

From slack convo:

We'd need to just plan out the API features of how to implement. Probably would need to add to the services struct and add the slow-startup configuration:  https://github.com/projectcontour/contour/blob/main/apis/projectcontour/v1/httpproxy.go#L627

stevesloka avatar Dec 13 '21 18:12 stevesloka

cc @CrossingTheRiverPeole

skriss avatar Dec 16 '21 16:12 skriss

Added the help wanted label here if anyone is interested in picking up this issue!

skriss avatar Dec 16 '21 16:12 skriss

It would be very useful for us to have support for this new Envoy feature in Contour.

costimuraru avatar Dec 20 '21 14:12 costimuraru

Thanks a lot for this !!

tailrecur avatar Oct 17 '22 14:10 tailrecur

@skriss If I understand the Compatibility matrix correctly, this means that this change would get rolled in the next major release (1.23.0 ??) and the minimum supported K8s version for this release will be 1.23. Is this correct?

tailrecur avatar Oct 18 '22 19:10 tailrecur

yes that is correct

sunjayBhatia avatar Oct 18 '22 19:10 sunjayBhatia