investigation: traffic distribution in kcp-front-proxy to shard replica connections
We have had a report of traffic unbalance between front-proxy and kcp shard replicas. While we have no details at the time, it makes sense to investigate this some more.
Current Situation
The typical deployment of kcp-front-proxy with multiple replicas for the same shard is the Helm chart. For the Helm chart, we've configured front-proxy routing to go to the Service DNS name for the kcp root shard. This means that front-proxy doesn't know about individual shard replicas and sends traffic to the virtual cluster IP for the service.
Given that front-proxy basically doesn't do load balancing here but offloads it to the Service, this might not be ideal for a component that is primarily concerned with being the "front door" of a (global) kcp setup. Basically, front-proxy does load balancing by shard, but not by shard replica.
There's some reason to believe that the type of incoming connections make it hard for Kubernetes' service load balancing to do a good job. One of them problems are likely watch calls, which are kept open and therefore long living connections.
Investigation
The person picking up this ticket should investigate, see if we can get meaningful traffic metrics out of the shard replicas, and think about what could be changed to improve the situation. One of the suggestions from the community call was to look into the Kubernetes aggregation layer's handling of proxying requests, since it appears to resolve DNS to individual endpoints and distributes load itself. Maybe we can re-use that logic.
Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
/remove-lifecycle rotten