envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Peak EWMA load balancing

Open mbyio opened this issue 2 years ago • 33 comments

Title: I would like envoy to support peak EWMA load balancing, in addition to the existing load balancing algorithms.

Description:

Please correct me if this is already supported. :)

We often want least request/least loaded load balancing, which is supported by envoy. But I don't believe envoy has a way to provide load balancing that takes into account the latency of the targets. Peak EWMA load balancing is one algorithm that automatically prefers sending requests to targets with lower latency. This can sometimes reduce p99 latency without requiring any other changes, and it has other benefits, such as automatically preferring to send requests to targets closer to the source (because network round-trip time will be lower so latency will be lower).

In my specific use case, I have an application distributed across zones, and sometimes traffic has to pass through a load balancer to pass from one zone to another, and sometimes not. So the various latencies between points are all over the place. I'd like the load balancer to handle this for me automatically.

[optional Relevant Links:]

https://linkerd.io/2016/03/16/beyond-round-robin-load-balancing-for-latency/ https://servd.host/blog/intelligent-load-balancing https://faun.pub/adaptive-load-balancing-algorithm-and-implementation-6f13ccb61bea https://github.com/kubernetes/ingress-nginx/blob/main/rootfs/etc/nginx/lua/balancer/ewma.lua

mbyio avatar Apr 20 '22 18:04 mbyio

This has come up before. Would be happy to see this implemented. This is will be a little bit trickier than a simple load balancer as there will need to be feedback from other parts of the system that drive it, so it might need to be a load balancer and filter pair or something like that.

mattklein123 avatar Apr 20 '22 23:04 mattklein123

This is will be a little bit trickier than a simple load balancer as there will need to be feedback from other parts of the system that drive it, so it might need to be a load balancer and filter pair or something like that.

Is that because it would need information about the latency?

mbyio avatar Apr 20 '22 23:04 mbyio

This is will be a little bit trickier than a simple load balancer as there will need to be feedback from other parts of the system that drive it, so it might need to be a load balancer and filter pair or something like that.

Is that because it would need information about the latency?

I think so since it needs response times for each worker, counts moving average then uses that value as an inverse weighting when deciding which instance to send traffic to. I am not sure we can directly get it from stats(metrics), or we can add record values for each thread, and dynamically calculate them to get the value for dispatch.

daixiang0 avatar Apr 21 '22 01:04 daixiang0

Yes we will need to keep rolling per host latencies. This is certainly possible but is not a beginner item. Once we have the latencies the rest is fairly straightforward.

Note also that there is an open issue on latency based outlier detection so the same latency tracking system would feed that also.

mattklein123 avatar Apr 21 '22 04:04 mattklein123

I see, thank you for the information! I was hoping this would be something I could take on, but this is sounding like it might be too difficult, given that it involves multiple systems.

mbyio avatar Apr 21 '22 04:04 mbyio

Yes we will need to keep rolling per host latencies. This is certainly possible but is not a beginner item. Once we have the latencies the rest is fairly straightforward.

Note also that there is an open issue on latency based outlier detection so the same latency tracking system would feed that also.

I had seen temporal latency mentioned in the outlier detection docs, but had yet to see corresponding code. Guess I know why now. I know we're hoping to use latency in conjunction with our custom consistent hashing load balancing policy by marking latency based outliers as degraded. So I'll definitely be keeping an eye on this.

jtway avatar Apr 21 '22 11:04 jtway

Here is the issue tracker latency outlier detection: https://github.com/envoyproxy/envoy/issues/288.

All of the pieces are there in the code base to do this at this point including built in histograms. They just need to be put together in a way that makes sense. Potentially the best way to do this is to just directly add the ability for the outlier detection system to track host latency on a per host basis (configurable for perf reasons). Latency would be fed in by the router. At that point we could use it for direct outlier detection and also build a LB that consults this data also.

mattklein123 avatar Apr 21 '22 14:04 mattklein123

Maybe there is a point to pay attention:

When implementing latency-based load balancing (or PeakEWMA), there may be server returns an error (such as network disconnection from the server to MySQL or Redis), resulting in a fast response time, but in fact this is the worst.

Therefore, I mean it is necessary to consider whether the failure rate (5xx) should also be used as a calculation factor during implementation.

jizhuozhi avatar Apr 02 '23 18:04 jizhuozhi

Hi

Is this currently under development? We have large services with slow pods and fast pods, the slower pods are deployed on slow or busy instances, and the fast pods are deployed on new-generation servers.

We migrated from Finagle to running with Envoy using Istio, with Power of Two Choices (P2C) + Peak EWMA. Due to how Envoy performs the load balancing (Without Peak EWMA), since the migration, we've had a split in CPU and latency, between the slow and the fast pods.

We're considering moving back to Finagle if we won't find any other solution. Even a workaround using EnvoyFilter that we will manage with an operator might be fine for us, but we couldn't make it work.

liorfranko avatar Jan 08 '24 21:01 liorfranko

Hello, I want to try to implement the PeakEWMA algorithm in envoy. I have implemented improved PeakEWMA in mosn https://github.com/mosn/mosn/issues/2252 and it performs well.

But my main language is not c++, so this may take a long time, thanks.

jizhuozhi avatar Mar 15 '24 07:03 jizhuozhi

I'm all for introducing a more sophisticated load balancing technique, but we should be really understand why what we have today does not address your specific use-case. Before getting into the weeds of the implementation in https://github.com/envoyproxy/envoy/pull/32942, can someone shed light on why the current LEAST_REQUEST load balancer doesn't work for you? Why is the active request count not a good signal and why is EWMA preferred?

I looked at the Linkerd article, but there is no investigation or discussion of why EWMA performs better than the "least loaded" algorithm. All it does is show a graph, but it's unclear what the selection mechanism is (P2C, etc...) once they derive the endpoint weights or how they derive the endpoint weights. The servd article gives even less information.

This matters because Envoy has things like the active_request_bias in the LeastRequestLbConfig, which will impact the behavior of the load balancer in ways that impact observed latencies. Are we sure that this knob won't give you the desired behavior and that it performs worse than the proposed EWMA approach?

I'd prefer a more rigorous discussion about the merits of EWMA, the implications of relying on observed latencies, and the behavior of the load balancer under some common overload scenarios. If there's a more in-depth blog post or paper you can drop here, that would be great too.

tonya11en avatar Mar 19 '24 18:03 tonya11en

We use Envoy as a sidecar running on Kubernetes with Istio service-mesh. We have services with ~500 pods sending requests to other services with ~500 pods. The environment is a low-latency environment. Istio sets the weights of each endpoint: 1.

We run on variance hardware, and some endpoints are slower than others. We noticed that the Envoy LEAST_REQUEST algorithm doesn't work well when some pods are slower than others. The reason is that when the source pod chooses a destination pod, the active_request queues are mostly empty or at a size of 1/2, so the algorithm works as a round-robin and splits the requests evenly across all pods.

We see that all the pods in the service run at the same RP/S but at very different response times.

We migrated those services from running outside Kubernetes with Finagle + PeakEWMA. And after the migration, we noticed this performance degredation.

liorfranko avatar Mar 26 '24 08:03 liorfranko

The reason is that when the source pod chooses a destination pod, the active_request queues are mostly empty or at a size of 1/2, so the algorithm works as a round-robin and splits the requests evenly across all pods.

How exactly are you measuring this? Is it expected that you have no in-flight requests?

If the active request counts are ever non-zero, there should be a difference in the effective weights of those endpoints. If not, then this is a deviation from expected behavior that we need to understand.

We see that all the pods in the service run at the same RP/S but at very different response times.

This is surprising to me. I would expect this to be reflected in the number of outstanding requests to each pod, since the average size of the [effective] active request queues would be a function of the RPS and request latencies. Can you share more details? Specifically, the actual RPS and latency numbers. It would be great to also see the exact distribution of RPS across your pods, so we can better understand this situation.

tonya11en avatar Mar 27 '24 04:03 tonya11en

How exactly are you measuring this? Is it expected that you have no in-flight requests? Yes, we're running low-latency services (Up to 30ms response time)

Here is a graph showing the RP/S per pod, about ~135: Query is: sum(irate(istio_requests_total{reporter="destination",destination_canonical_service="$workload"})) by (pod) Screen Shot 2024-03-27 at 18 00 06

Here is a graph showing the response time per pod: Query is: histogram_quantile(0.5, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination",destination_canonical_service="$workload"}[1m])) by (le,pod)) Screen Shot 2024-03-27 at 18 12 31

When looking at Envoy:

istio-proxy@******:/$ curl http://localhost:15000/clusters | grep ******.svc.cluster.local | grep 8083 | grep -E "cx_active|weight"
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1
outbound|8083||******.svc.cluster.local::******:8083::cx_active::2
outbound|8083||******.svc.cluster.local::******:8083::weight::1

liorfranko avatar Mar 27 '24 16:03 liorfranko

This is great. Thanks for the extra information.

So, the cx_active stats are showing us the number of active connections. Envoy's LEAST_REQUEST load balancer normalizes endpoint weights by the number of active requests, so we need to understand what those values are if we want to get to the bottom of this. Unfortunately, we're only going to have visibility at the level of an Envoy cluster- the only stat exposed is upstream_rq_active which is per-cluster.

We can try to make an educated guess via a hand-wavy application of Little's law, but that only applies to averages, not the p50s you've provided. However, the RPS to each pod looks identical and I'll assume the p50 is close enough to the average to hazard a guess.

You've got p50 latencies ranging from ~20ms to ~30ms. So: 130 rq/s * .02s (latency) ~= 3 rq 130 rq/s * .03s (latency) ~= 4.5 rq

These active request numbers are close enough that I would feel comfortable claiming it's "balanced". Also, it would be better to look at the tail latencies (p95/p90) instead of the medians, since those are the ones affected most by imbalanced load.

Let me know your thoughts.

tonya11en avatar Mar 27 '24 17:03 tonya11en

Thanks for the detailed explanation! All the above makes perfect sense, and when Envoy balances according to the current upstream_rq_active, I agree that it's perfectly balanced.

Here is the P50 of a specific service: Screen Shot 2024-04-01 at 19 04 03 And here is the P95 of the same service at the same time: Screen Shot 2024-04-01 at 19 06 28

The fastest pod runs at a flat 40ms response time, while the slowest pod runs at +180ms

We think that by applying different weights based on historical response time, the spread of the tail latencies should decrease.

liorfranko avatar Apr 01 '24 16:04 liorfranko

Thanks for the extra graphs, that helps in seeing the differences between backends.

We think that by applying different weights based on historical response time, the spread of the tail latencies should decrease.

My understanding of this approach is that it assumes that sending excess load to a backend causes the latency to increase, but I'm unsure about how you'd know when to send less load. We'd need to know what an "unloaded" response time would be for a backend to derive the new weight based on a latency measurement, otherwise we wouldn't know what to do with the information from latency measurements. How would that work, exactly and how does it compare to the current approach?

tonya11en avatar Apr 01 '24 19:04 tonya11en

I think that success latency + error rate can provide a good measurement for weight.

liorfranko avatar Apr 02 '24 06:04 liorfranko

I think that success latency + error rate can provide a good measurement for weight.

I'm not opposed to the idea, we just need to be more specific on what exactly this means. How does it look in practice? The level of detail I'm looking for is something like a formula for an endpoint's weight based on success latency and error rate.

We should also spell out how exactly one modifies a weight based on some observed latency, because it's not obvious from the previous comments.

tonya11en avatar Apr 02 '24 20:04 tonya11en

I'm not quite sure how to provide the formula. Maybe @jizhuozhi can share his experience as he did for mosn https://github.com/mosn/mosn/issues/2252?

liorfranko avatar Apr 04 '24 06:04 liorfranko

Hello, @tonya11en and @liorfranko , my original implementation (https://github.com/mosn/mosn/pull/2253) is just using PeakEWMA as linkerd and kubernetes-ingress without "unloaded" condition, and optimized in https://github.com/mosn/mosn/issues/2295 (with formula proof).

In mosn, we have both considered 4xx error and 5xx error with different bias, but this is not verified. In the production environment, we found that if it is a 4xx error, after load balancing the weight adjustment for all servers is consistent, so there is no need to provide additional 4xx error rate (rate limiting may require circuit breaker to solve), so I think just 5xx error is enough in envoy.

jizhuozhi avatar Apr 04 '24 17:04 jizhuozhi

@jizhuozhi can you just briefly describe how all of this works here? I don't know how mosn works and would appreciate it if you could clearly describe how this new load balancing algorithm you're proposing would work for Envoy. We won't accept a change like this without associated documentation, so you'd have to write up an explanation anyway.

I can't seem to get a clear answer from anyone on what exactly you have in mind for how this would work. For example, let's assume you have historical latency information for all endpoints in the load balancer's host set- how do you derive the endpoint weights?

Also, can you speak to why the current LEAST_REQUEST load balancer (LRLB) insufficient here? It uses the number of in-flight requests, which is a function of request latency and request rate, so it's unclear why this doesn't work for your use case.

tonya11en avatar Apr 04 '24 20:04 tonya11en

Hello @tonya11en , I'm sorry that I misunderstood what you meant. Let me fully explain the work I did.

First, the LEAST_REQUEST load balancer (LRLB) calculates the expected completion time of all tasks based on the number of active requests. It assumes that all servers have the same processing time, but in fact different servers have different processing times (limited by the processor model , NUMA nodes, network distance, etc.). Therefore, what PeakEWMA actually does is to make the mathematical expectations of LRLB more accurate by recording historical response time indicators, and to reduce the weight of historical data on decision-making by decay average (EWMA).

Secondly, as we discussed, unexpected unloaded can make all intelligent load balancing strategies make wrong decisions, which should actually be another topic: service circuit breaker. However, service circuit breaker has cost. Before that, we still have to face the impact of wrong decisions. Therefore, in mosn and the current implementation, we use the decay failure rate to fix the error delay caused by load reduction (in In mosn, we also support configuring bias to enhance error perception).

Finally, we did not use the weighted load balancing method in the traditional sense, but used the randomization method (P2C), the weight is just as a factor to calculate weighted active requests. The intention here is the same as mentioned above. We want to avoid errors caused by overload of a certain instance. In P2C, the probability of each instance being selected is not the proportion of its weight in the set, but the position of its score in the set. The higher the position, the greater the probability, the probability of the same position is the same (I am not a math major, so forgive me for not being able to give a specific probability distribution)

jizhuozhi avatar Apr 05 '24 03:04 jizhuozhi

First, the LEAST_REQUEST load balancer (LRLB) calculates the expected completion time of all tasks based on the number of active requests. It assumes that all servers have the same processing time, but in fact different servers have different processing times (limited by the processor model , NUMA nodes, network distance, etc.).

This isn't quite right- take a look at the LRLB's documented behavior. If all endpoint weights are identical, we use a P2C selection based on the number of active requests. If endpoint weights are heterogeneous, it's only then that we scale the weights based on the number of active requests as described in the formula in the docs (we also have a bias parameter).

The LRLB doesn't exactly calculate an expected completion time of anything or assume that the nodes are homogeneous. You might be mistaking this with the virtual time used in the EDF scheduler, which is used to decide which node to select next in the schedule. If a node cannot keep up with the amount of traffic being sent to it (high processing times), the average number of in-flight requests is expected to increase, which would result in a decrease in traffic sent to that node. Assuming the global requests per second is unchanging, average request latency and in-flight requests (concurrency) should be proportional. This is what we'd expect based on Little's Law.

I'd recommend taking a look at our blog post on how some of the LB algorithms work in Envoy.

Secondly, as we discussed, unexpected unloaded can make all intelligent load balancing strategies make wrong decisions, which should actually be another topic: service circuit breaker. However, service circuit breaker has cost. Before that, we still have to face the impact of wrong decisions. Therefore, in mosn and the current implementation, we use the decay failure rate to fix the error delay caused by load reduction (in In mosn, we also support configuring bias to enhance error perception).

I'm not sure what you mean by "unexpected unloaded" or service circuit breakers here. Are you referring to the same circuit breakers you can configure in the Envoy cluster config? We already have mechanisms for shedding load in the presence of errors, such as outlier detection and admission control.

Can you be explicit about what you mean when you refer to:

  • "unexpected unloaded"
  • service circuit breaker
  • error delay

I'm also a bit confused about whether you want to use endpoint latencies here or error rates. It doesn't appear like you and @liorfranko are proposing the same thing, so I'd like to request 2 things from you:

  1. Please write documentation for this proposed load balancing algorithm first as if it were to go in the Envoy docs' supported load balancer section. That's roughly the level of detail I'm expecting here.
  2. Given that the active request count and request latencies are proportional, where and how is the existing LEAST_REQUEST load balancer in Envoy falling short and how will this new algorithm address its shortcomings?

I'll conclude by saying that I'm not opposed to introducing a new load balancing algorithm here! We just need to clearly articulate how the existing set of algorithms is insufficient for the case you presented (where there is a heterogeneous set of hosts with different capabilities). I'll also mention that even if we determine this new algorithm isn't necessary, you can always write a load balancer extension and use it in your own deployments (analogous to writing a custom filter).

tonya11en avatar Apr 05 '24 16:04 tonya11en

Hello @tonya11en, I'm glad to receive your reply. I understand the current issues and recognize that this is a necessary discussion process to ensure the healthy development of the community.

What I need to clarify is that I understand how Least Request is implemented in envoy, but what I want to express is that usually Least Request is abstracted as a service queue, and we want to find the shortest queue to ensure that the current request is completed as soon as possible (which also abstracted in nginx's blog). So I say Least Request calculates the expected end time assuming that all instances have the same processing time.

Can you be explicit about what you mean when you refer to:

"unexpected unloaded" service circuit breaker error delay

As for "unexpected unloaded", this is literal. For example, most modern application services need to read data from the database before performing complex calculations. However, if a single instance problem causes continuous failure to read database data, then it will no subsequent complex calculations be performed. At this time, the performance of this instance is "unexpected unloaded", and the delay of the request at this time is the "error delay". This "error delay" should not be calculated as normal service latency, and the instance with the problem should be avoided from being selected to trigger "service circuit breaker" (therefore, we provide a separate bias for the error rate).

Given that the active request count and request latencies are proportional, where and how is the existing LEAST_REQUEST load balancer in Envoy falling short and how will this new algorithm address its shortcomings?

As for the comparison between Least Request and PeakEWMA, I cannot give our online indicators, I cannot give our online indicators, but I can give the approximate curves of our online services using different load balancing algorithms when single-instance problem occurred, the service of one of our two thousand 16c32G instances (including Intel and AMD processors of different models and production batches, and hosts with different workloads, etc.), and single-instance problems will cause requests for this instance to frequently timeout (from real alarms)

image

We have gained this experience from our online indicators: Although the number of active requests per instance is different, and some may be higher than other instances, because modern services are executed in parallel by multiple processors, when the number of active requests is less than a const N, the delay will not increase due to the increase in the number of active requests, but the historical delay can reflect the performance trend of the instance. Therefore, the PeakEWMA algorithm can forward the request to an instance with a lower delay as much as possible without triggering queuing.

jizhuozhi avatar Apr 06 '24 05:04 jizhuozhi

What I need to clarify is that I understand how Least Request is implemented in envoy, but what I want to express is that usually Least Request is abstracted as a service queue, and we want to find the shortest queue to ensure that the current request is completed as soon as possible (which also abstracted in nginx's blog). So I say Least Request calculates the expected end time assuming that all instances have the same processing time.

It's fine to think of the number of outstanding requests from the Least Request balancer (LRLB) point-of-view as a queue for this discussion, but just note that the observed behavior depends on how the applications handle concurrent requests. Given this, I understand why you would say that the LRLB assumes all instances have the same processing time for a request, but it's not quite accurate to think this way. We make no assumptions about an instance's processing time.

The LRLB simply favors endpoints with less in-flight requests via P2C selection. This means we are balancing the number of in-flight requests to each instance from each LB's perspective. I keep bringing up Little's Law in my comments because it is useful for understanding the relationship between the avg queue size, avg request latency, and avg RPS:

avg_queue_size = avg_latency * avg_RPS

The LRLB is trying to keep the avg_queue_size balanced across the endpoints in the host set. This means if some of the endpoints have elevated avg_latency relative to others, the queue size will grow if the rate of requests to that endpoint stays the same. So to keep things balanced, less requests will be sent to the high-latency endpoints to keep the in-flight requests balanced.

Errors are accounted for via outlier detection, which should be on by default.

As for the comparison between Least Request and PeakEWMA, I cannot give our online indicators, I cannot give our online indicators, but I can give the approximate curves of our online services using different load balancing algorithms when single-instance problem occurred, the service of one of our two thousand 16c32G instances (including Intel and AMD processors of different models and production batches, and hosts with different workloads, etc.), and single-instance problems will cause requests for this instance to frequently timeout (from real alarms)

I'm not sure I understand this as it's missing several important details:

  • Is there a unique set of hosts being used for each LB algorithm in the graphs?
  • Is the composition of the fleet the same across each LB algorithm in the graph? For example, 20% slow hosts 80% fast, etc.
  • What were the in-flight request counts and RPS graphs showing for the instances across these tests?
  • Were the instances for all tests running the same workloads/service?

This also doesn't really explain how EWMA LBs would perform better. Can you just provide a simple scenario where the EWMA LB is expected to work better than LR LB? For example, let's say you have 3 hosts and you want to balance requests across them. What characteristics would these hosts have that Least Request will not work for you?

Also, please remember from my last comment that you can write a load balancer extension and experiment with any algorithm you want. If the algorithm you implement there is shown to behave better in your Envoy deployment, that data would be very valuable for this discussion.

tonya11en avatar Apr 08 '24 19:04 tonya11en

Not the original commenter, but I wanted to chime in with some of my opinions.

Load is not what you should balance: Introducing Prequa paper goes into more detail about where latency-based (or a combination of latency and in-flight request) load balancing outperforms the power of two choices, especially at high load. Section 5.2 Replica selection rule does a good job of comparing the effect on tail latency by different load balancing algorithms.

One interesting idea from the paper is to use a combination of in-flight requests and latency to convert it into an approach called hot and cold. To directly quote from the paper:

To minimize both latency and RIF, Prequal selects replicas using a hot-cold lexicographic (HCL) rule that labels probes in the pool as either hot or cold based on their RIF value. Prequal clients maintain an estimate of the distribution of RIF across replicas, based on recent probe responses. They classify pool elements as hot if their RIF exceeds a specified quantile (QRIF) of the estimated distribution; otherwise, cold. In replica selection, if all probes in the pool are hot, then the one with the lowest RIF is chosen; otherwise, the cold probe with the lowest latency is chosen. Our experiments (§5.2) suggest that QRIF ∈ [0.6, 0.9] is a good choice, although even 0 is effective (i.e., RIF-only control).

I think the above paragraph does a good job of explaining the situations where one might want to use latency as a signal for load balancing rather than active requests.

When the system is in steady state (meaning there is no queuing or close to zero queuing in the application stack), it's rare to see the majority of backends having non-zero active requests unless the load balancing is done via a central load balancer (in which case client in-flight requests == server in-flight requests) or pushing for very high capacity. This becomes even more likely when the size of backends is small (example: Python services running with uWSGI, where they are often configured to run with a large number of replica count but each replica has a small number of workers like 4 or 8). So when the LB is presented with a choice to send a request to 2 backends with zero in-flight requests, then choosing the one that has lower latency could end up being better.

Let me know what's your thoughts are on this, Personally i would love to help on this.

Buffer0x7cd avatar May 22 '24 22:05 Buffer0x7cd

That's some convincing data! I'll need to do a more in-depth read, but from my quick pass the data in section 5.2 makes a good case for implementing the Prequal algorithm.

Rather than reinventing the wheel w.r.t. the probes, I'd recommend figuring out how the Load Reporting Service can be used. Let me know your thoughts.

@Buffer0x7cd if you are willing to take the lead on this, I'd be happy to provide guidance and review. Are you familiar enough with the codebase to talk through how you imagine the implementation to work?

tonya11en avatar May 22 '24 23:05 tonya11en

Yeah, LRS was the first thing that came to my mind when thinking about an alternative to implementing probes. Although I have a few questions about including load in the feedback loop.

Currently, I can think of two approaches to include load info:

  1. As you mentioned, using LRS to propagate load to a management server, which can implement the correct algorithm and push the updates to envoy servers via EDS. Looking at some of the past issues, I came across a document which suggests a similar design. I am thinking we can use the updates from EDS to implement what they refer to in the paper as a probing pool. Each Envoy server receives a list of endpoints which it can use for host selection.

  2. Piggybacking the load information as part of the response in the form of a header and using the data provided in the header to adjust the weight of the endpoints. This seems like a popular choice to solve load balancing issues which arise due to the client's view of load vs. the server's view of load, especially on systems where the number of clients can be close to the number of backend tasks (like sidecars). Some blogs I came across discuss using a similar strategy from Uber, Dropbox, and Netflix. In this design, we can offload the calculation of load to servers (based on in-flight requests, latency in the last X seconds). Although I'm not sure about the implications of changing endpoint weights after each request (I am still trying to wrap my head around the load balancing implemented in Envoy, but based on some cursory reading, it seems like the weights are implemented as a min-heap when all weights are unequal). So if we try to update the weights dynamically (assuming we implement this as an Envoy filter that updates the weight as part of the response cycle), that can create issues?

Are you familiar enough with the codebase to talk through how you imagine the implementation to work?

I haven't worked with C++ since it hasn't been needed for my work for a long time, so I will probably spend a few weeks trying to get familiar with the code base.

Let me know your thoughts.

Buffer0x7cd avatar May 23 '24 18:05 Buffer0x7cd

@Buffer0x7cd don't worry about the C++ details, for now. Let's first just flesh out a design for how probes would work and clarify the algorithm. I can help with the load balancer implementation details if you can take point on figuring out how probes would work.

I haven't gotten around to giving the paper more attention yet, but one of the things I want to make sure of is that this makes sense to implement in the Envoy's load balancer (as opposed to it being a control plane feature). We just can't have a scenario where the control plane goes down and the load balancing algorithm stops working. If there's no way to avoid that, we should look at possibly implementing this in Istio or thinking about another way to get this kind of functionality into Envoy.

tonya11en avatar May 24 '24 17:05 tonya11en