linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Resolve service mirroring delay caused by RepairEndpoints

Open jeremychase opened this issue 1 year ago • 3 comments

RepairEndpoints runs periodically to reconcile services in a multicluster environment. When the number of exported services approaches 500, RepairEndpoints runs nearly continuously which causes mirroring of services to be delayed. We need to investigate if we still need to run RepairEndpoints periodically and if there are other synchronization methods we could use.

jeremychase avatar Sep 08 '22 16:09 jeremychase

I looked into how we use RepairEndpoints and we do need to retain its behavior; however it is currently running much more frequently than necessary.

Important findings:

  • RepairEndpoints is called at startup.
  • RepairEndpoints is called after a link is re-established.
  • RepairEndpoints is periodically called. This behavior was added for endpoints exposed as DNS hostnames whose target IP addresses change. Added in PR: #4588

Potential ways to improve:

  • Do not schedule periodic RepairEndpoints if endpoints are not DNS hostnames.
  • From within repairEndpoints, return if the endpoint DNS target is the same as the previous invocation.
  • Use the DNS TTL to schedule the next invocation of RepairEndpoints.

jeremychase avatar Sep 13 '22 21:09 jeremychase

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 15 '22 01:12 stale[bot]

This is worth keeping open. We have a few good solutions listed above that would be worth exploring.

kleimkuhler avatar Dec 15 '22 15:12 kleimkuhler