grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

least_request LB strategy causes full TPS drop when upstream endpoint hang

Open jiangzhchyeah opened this issue 5 months ago • 7 comments

Is your feature request related to a problem?

Yes. The least_request load balancing strategy can cause a complete TPS drop when a single upstream endpoint hangs. This occurs due to two primary factors:

  1. Long request timeouts (like 30 seconds or more) make this much worse.
  2. As shown in LeastRequestLoadBalancer$ReadyPicker.nextChildToUse(), the N_CHOICES selection method randomly picks two endpoints. It may select the same unhealthy endpoint twice(instead of two distinct endpoints).

When this occurs, all traffic is routed to the hanged up endpoint, causing a full service degradation, which is unacceptable.

Describe the solution you'd like

  1. Suport for FULL_SCAN mode of xDS LEAST_REQUEST Load Balancer Policy, which would check all endpoints before picking one.
  2. Adjust the N_CHOICES algorithm to prevent it from picking the same endpoint twice, like record which endpoints were already chosen, or something else.

jiangzhchyeah avatar Jul 21 '25 03:07 jiangzhchyeah

  1. FULL_SCAN mode is not necessary in the majority cases as cited in the envoyproxy site that gRPC also implemented.

  2. Regarding avoiding picking the same endpoint twice, this is a problem acknowledged in the gRFC A84 for LRS, and it mentions that envoy had some PR attempted but closed in the end. They discussed some potential performance hits and alternatives in the PR but not sure why it wasn't followed through to completion. We could look into this more from gRPC side.

kannanjgithub avatar Jul 22 '25 03:07 kannanjgithub

a complete TPS drop when a single upstream endpoint hangs

Nonsense. Only the individual RPCs that are unlucky enough to randomly select that one endpoint twice are impacted. Nothing near "all traffic"

Describe the solution you'd like

Both solutions offered come with performance costs that we don't want to accept. Instead, you can consider:

  1. Increasing the N choices used. Bumping it to 3 makes it less likely than 2, and 5 is less likely than 3, with diminishing returns
  2. Detect and deal with the broken endpoint. You don't describe what is wrong with the endpoint, but for certain failures enabling keepalive on the channel will detect when connections are bad and disconnect them, which will also cause the LB policy to avoid that endpoint until it is working again.

ejona86 avatar Jul 24 '25 04:07 ejona86

Hi @ejona86 , here is my reproduce step, it might explain why there is a complete TPS drop when a single upstream endpoint hangs.

Reproduction Steps:

  1. Set up
    • Run 1 gRPC client with least_request load balancing policy.
    • Connect it to 3 upstream gRPC servers.
    • Configure:
      • Request timeout: 120 seconds
      • Keepalive: 10s interval + 20s timeout
  2. Load test:
    • Continuously send requests at 100 concurrency. (All servers are quickly responding normally at this stage)
  3. Simulate failure:
    • Send kill -19 (SIGSTOP) to freeze one backend server

What Happens:

  1. The client's TPS drops to ZERO in seconds because:
    • The load balancer gets "stuck" sending all traffic to the frozen server after several round request completed.
    • Healthy servers receive no requests.
  2. After about 20-30 seconds, all the traffic are routed to the 2 healthy server, client's TPS recover to normal.

This also happened in envoy.(metioned by https://github.com/envoyproxy/envoy/issues/39737).

jiangzhchyeah avatar Jul 25 '25 10:07 jiangzhchyeah

From your observation what is the reason the two healthy servers were not receiving any traffic since the picking the target endpoint is done on a per RPC basis?

kannanjgithub avatar Aug 05 '25 03:08 kannanjgithub

@kannanjgithub, after many rounds, all the concurrent RPCs end up being stuck waiting on the bad server. Each "round" 1/9 of the RPCs got assigned to the bad server. I expect the RPC completion time was much faster than the 120s RPC timeout. (It's because the application is closed-loop with a fixed concurrency. The application gets hung given enough time and poor enough RPC timeouts. The channel isn't hung.)

ejona86 avatar Aug 05 '25 04:08 ejona86

@ejona86, Exactly. The combination of blocking calls + fixed concurrency + long rpc timeouts + 1/9 of the RPCs got assigned to the bad server lead to this issue.

jiangzhchyeah avatar Aug 07 '25 01:08 jiangzhchyeah

Both solutions offered come with performance costs that we don't want to accept. Instead, you can consider:

@ejona86 If it's fine to special case whenN_CHOICES == 2 we could adapt the picker to pick two distinct endpoints the same way finagle does it. This question then becomes why would one picking any other value for N_CHOICES 😄

tommyulfsparre avatar Aug 09 '25 07:08 tommyulfsparre