routing-release icon indicating copy to clipboard operation
routing-release copied to clipboard

Index Duplication in Gorouter Leading to Routing to Stale Endpoints

Open Mrizwanshaik opened this issue 1 month ago • 2 comments

Current behavior

Description: Recently, we encountered an issue in Gorouter where a new instance endpoint was registered on the same index as an existing stale endpoint. This resulted in Gorouter routing requests to the unhealthy endpoint, leading to multiple 502 errors.

Details:

Route-emitter was down and not sending unregister messages. Due to route integrity, gorouter retained the endpoint information and did not prune those stale endpoint.

In the meantime, Diego recreated a new instance on another cell with a new instance_id and canonical address (IP:Port). Gorouter treated this as a new endpoint and added it to the routing pool on the same index where the stale endpoint already existed. The current implementation in Gorouter does not validate the index number but only considers the canonical address and instance_id when adding endpoints.

Since mTLS is enabled for Gorouter-to-app container traffic, Gorouter does not prune stale endpoint unless they match one of the prunableClassifiers. Additionally, the requests were non-idempotent, so Gorouter did not retry them on the healthy endpoint. Eventually, we observed the prune-endpoint-failed log in Gorouter, but by then, it was too late.

Desired behavior

We propose introducing new logic in Gorouter to check if an endpoint already exists in the pool for the same index. If a new registration message is received for the same index, the existing endpoint should be updated or replaced to prevent duplicate registrations.

Affected Version

0.351.0

Mrizwanshaik avatar Nov 18 '25 14:11 Mrizwanshaik

Minor addendum for the proposal, to keep performance at the same level:

  • check if the endpoint entry exists based on the canonical address (current logic)
  • only when it does not exist, check in the endpoint map if the same app index already exists
    • delete that (stale) entry with appropriate log and add the new endpoint information.

This assumes that the app index is supposed to be unique. As the app index is managed via Cloud Controller, it should be semantically unique.

If an unregister message is still received after this index-based replacement has taken place, the target address will not be there anymore and the message should be ignored with a debug-level log (current implementation).

peanball avatar Nov 19 '25 10:11 peanball

We may have seen this issue yesterday, too. We had broken backends. Gorouter did not resume normal operation after the backends were back up again. GoRouter still logged backend endpoint failures. This issue would explain the strange behaviour.

marcohelmerich avatar Nov 21 '25 13:11 marcohelmerich

@Mrizwanshaik - Thank you for writing this up so clearly. I think this direction makes sense. Please ping the slack channel when you have a PR ready for review.

ameowlia avatar Dec 04 '25 15:12 ameowlia