kcp icon indicating copy to clipboard operation
kcp copied to clipboard

investigation: traffic distribution in kcp-front-proxy to shard replica connections

Open embik opened this issue 4 months ago • 1 comments

We have had a report of traffic unbalance between front-proxy and kcp shard replicas. While we have no details at the time, it makes sense to investigate this some more.

Current Situation

The typical deployment of kcp-front-proxy with multiple replicas for the same shard is the Helm chart. For the Helm chart, we've configured front-proxy routing to go to the Service DNS name for the kcp root shard. This means that front-proxy doesn't know about individual shard replicas and sends traffic to the virtual cluster IP for the service.

Given that front-proxy basically doesn't do load balancing here but offloads it to the Service, this might not be ideal for a component that is primarily concerned with being the "front door" of a (global) kcp setup. Basically, front-proxy does load balancing by shard, but not by shard replica.

There's some reason to believe that the type of incoming connections make it hard for Kubernetes' service load balancing to do a good job. One of them problems are likely watch calls, which are kept open and therefore long living connections.

Investigation

The person picking up this ticket should investigate, see if we can get meaningful traffic metrics out of the shard replicas, and think about what could be changed to improve the situation. One of the suggestions from the community call was to look into the Kubernetes aggregation layer's handling of proxying requests, since it appears to resolve DNS to individual endpoints and distributes load itself. Maybe we can re-use that logic.

embik avatar Jul 31 '25 15:07 embik

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kcp-ci-bot avatar Nov 02 '25 20:11 kcp-ci-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kcp-ci-bot avatar Dec 02 '25 20:12 kcp-ci-bot

/remove-lifecycle rotten

ntnn avatar Dec 02 '25 20:12 ntnn