linkerd2
linkerd2 copied to clipboard
Resolve service mirroring delay caused by RepairEndpoints
RepairEndpoints runs periodically to reconcile services in a multicluster environment. When the number of exported services approaches 500, RepairEndpoints runs nearly continuously which causes mirroring of services to be delayed. We need to investigate if we still need to run RepairEndpoints periodically and if there are other synchronization methods we could use.
I looked into how we use RepairEndpoints and we do need to retain its behavior; however it is currently running much more frequently than necessary.
Important findings:
- RepairEndpoints is called at startup.
- RepairEndpoints is called after a link is re-established.
- RepairEndpoints is periodically called. This behavior was added for endpoints exposed as DNS hostnames whose target IP addresses change. Added in PR: #4588
Potential ways to improve:
- Do not schedule periodic RepairEndpoints if endpoints are not DNS hostnames.
- From within
repairEndpoints
, return if the endpoint DNS target is the same as the previous invocation. - Use the DNS TTL to schedule the next invocation of RepairEndpoints.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This is worth keeping open. We have a few good solutions listed above that would be worth exploring.