Linkerd GRPC load balancing doesn't work with Calico eBPF
Here https://github.com/projectcalico/calico/issues/6908 they recommended setting the parameters bpfConnectTimeLoadBalancing=Disabled and bpfHostNetworkedNATWithoutCTLB=Enabled (docs) , after adding the grpc balancing does not work, tell me what else could be wrong?
Expected Behavior
uniform balancing of the grpc
Current Behavior
uneven load distribution between servers during grpc requests
Steps to Reproduce (for bugs)
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
spec:
bpfConnectTimeLoadBalancing: Disabled
bpfEnabled: true
bpfExternalServiceMode: DSR
bpfHostNetworkedNATWithoutCTLB: Enabled
bpfLogLevel: ""
logSeverityScreen: Info
prometheusMetricsEnabled: true
reportingInterval: 0s
vxlanEnabled: true
vxlanPort: 4789
vxlanVNI: 4096
Context
Your Environment
- Calico version 3.27
- Orchestrator version (e.g. kubernetes, mesos, rkt): 1.22.15
I am nt quite sure how grpc loadbalancing in linkerd works, but it seems you still have connectivity, but you observe non-uniform distribution. Do you see some connections failing? Is there an easy how-to to reproduce the issue?
We have 6 pods at our stand on which grpc load is applied, the screenshot shows that the distribution is not uniform. Settings as in the description above.
I am nt quite sure how grpc loadbalancing in linkerd works, but it seems you still have connectivity, but you observe non-uniform distribution. Do you see some connections failing? Is there an easy how-to to reproduce the issue?
Again, do you see connections failing?
I assume that linkerd is responsible for the uniform distribution. Calico ebpf does not guarantee a uniform distribution for resolving services if linkerd would rely on it in any way (i do not think so)
Linkerd load balances gRPC per request, selecting the back end to use based on an exponentially-weighted moving average of latency -- so it'll tend to pick low-latency endpoints, but makes no guarantee of a uniform distribution. Linkerd expects workloads to make requests to Service IP addresses, and Linkerd will itself always route requests to endpoint IPs -- if a workload sends a request directly to an endpoint IP, by default Linkerd will honor that rather than load balancing (you can change this behavior, but the default is to assume that the workload knows what it's doing).
So, with @tomastigera, I'm curious if connetions are failing, and curious about how things look over longer time windows, but I'd need more information to be sure that something is actually broken.
But problem is still actuality. Some pod are under load, other are not (if you look at the rps distribution). What other settings can be tested for loadbalancing grpc? Which calico logs should I provide you with and which linkerd logs should I provide?
@bvbvr It seems that neither linkerd nor calico gives you a guarantee of uniform load distribution. There is no knob that would help you. Do you see similar load distribution when using calico in iptables mode or do you see uniform distribution in that case? If it is uniform with iptables, we may start looking at why ebpf mode makes a difference, why some backends have higher latency then others etc. We do not provide support for Linkerd. It so far seems like the dataplane delivers your connections.
I am closing the issue for now, but feel free to reopen, if you can pinpoint the problem on calico. We are happy to help.