calico icon indicating copy to clipboard operation
calico copied to clipboard

Linkerd GRPC load balancing doesn't work with Calico eBPF

Open nstrashevskii opened this issue 1 year ago • 2 comments

Here https://github.com/projectcalico/calico/issues/6908 they recommended setting the parameters bpfConnectTimeLoadBalancing=Disabled and bpfHostNetworkedNATWithoutCTLB=Enabled (docs) , after adding the grpc balancing does not work, tell me what else could be wrong?

Expected Behavior

uniform balancing of the grpc

Current Behavior

uneven load distribution between servers during grpc requests

Steps to Reproduce (for bugs)

apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
spec:
  bpfConnectTimeLoadBalancing: Disabled
  bpfEnabled: true
  bpfExternalServiceMode: DSR
  bpfHostNetworkedNATWithoutCTLB: Enabled
  bpfLogLevel: ""
  logSeverityScreen: Info
  prometheusMetricsEnabled: true
  reportingInterval: 0s
  vxlanEnabled: true
  vxlanPort: 4789
  vxlanVNI: 4096

Context

Your Environment

  • Calico version 3.27
  • Orchestrator version (e.g. kubernetes, mesos, rkt): 1.22.15

nstrashevskii avatar Mar 18 '24 14:03 nstrashevskii

I am nt quite sure how grpc loadbalancing in linkerd works, but it seems you still have connectivity, but you observe non-uniform distribution. Do you see some connections failing? Is there an easy how-to to reproduce the issue?

tomastigera avatar Mar 18 '24 21:03 tomastigera

We have 6 pods at our stand on which grpc load is applied, the screenshot shows that the distribution is not uniform. Settings as in the description above. screen

I am nt quite sure how grpc loadbalancing in linkerd works, but it seems you still have connectivity, but you observe non-uniform distribution. Do you see some connections failing? Is there an easy how-to to reproduce the issue?

bvbvr avatar Mar 28 '24 08:03 bvbvr

Again, do you see connections failing?

I assume that linkerd is responsible for the uniform distribution. Calico ebpf does not guarantee a uniform distribution for resolving services if linkerd would rely on it in any way (i do not think so)

tomastigera avatar Apr 01 '24 18:04 tomastigera

Linkerd load balances gRPC per request, selecting the back end to use based on an exponentially-weighted moving average of latency -- so it'll tend to pick low-latency endpoints, but makes no guarantee of a uniform distribution. Linkerd expects workloads to make requests to Service IP addresses, and Linkerd will itself always route requests to endpoint IPs -- if a workload sends a request directly to an endpoint IP, by default Linkerd will honor that rather than load balancing (you can change this behavior, but the default is to assume that the workload knows what it's doing).

So, with @tomastigera, I'm curious if connetions are failing, and curious about how things look over longer time windows, but I'd need more information to be sure that something is actually broken.

kflynn avatar Apr 04 '24 15:04 kflynn

But problem is still actuality. Some pod are under load, other are not (if you look at the rps distribution). What other settings can be tested for loadbalancing grpc? Which calico logs should I provide you with and which linkerd logs should I provide?

bvbvr avatar Apr 07 '24 14:04 bvbvr

@bvbvr It seems that neither linkerd nor calico gives you a guarantee of uniform load distribution. There is no knob that would help you. Do you see similar load distribution when using calico in iptables mode or do you see uniform distribution in that case? If it is uniform with iptables, we may start looking at why ebpf mode makes a difference, why some backends have higher latency then others etc. We do not provide support for Linkerd. It so far seems like the dataplane delivers your connections.

I am closing the issue for now, but feel free to reopen, if you can pinpoint the problem on calico. We are happy to help.

tomastigera avatar Apr 08 '24 18:04 tomastigera