thanos
thanos copied to clipboard
Implement endpoint groups
This draft PR illustrates how an HA group of endpoints can be configured in Thanos Query.
Looking for feedback on the approach!
Fixes https://github.com/thanos-io/thanos/issues/5335
- [ ] I added CHANGELOG entry for this change.
- [ ] Change is not relevant to the end user.
Changes
Add the flags endpoint-group and endpoint-group-strict to Thanos Query for load balancing instead of fanout.
Verification
One downside of using load balancing from gRPC is that dns resolution is done once during startup and is cached for a very long time, potentially forever. Addresses will be re-resolved when the downstream target goes away.
This can be problematic if the endpoing group is scaled out and new targets are added. The recommended workaround here is to set max_connection_age to something like 5m which will cause periodic dns resolution, but also a complete recreation of all connections to the endpoint group.
cc @saswatamcode @SrushtiSapkale
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
I am not sure what the status of the LFX project is, but this could be a short term solution that gets us unblocked until we have a better implementation.
Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days.
Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.
Currently very interested in this work. What are the blockers for this being merged?
There was an LFX project that was a superset of this functionality: https://github.com/thanos-io/thanos/pull/5505
Maybe @saswatamcode or @matej-g have some information on how far that got and whether it makes sense to continue working on this PR.
There was an LFX project that was a superset of this functionality: #5505
Maybe @saswatamcode or @matej-g have some information on how far that got and whether it makes sense to continue working on this PR.
Thanks for the link!
Yes, we made some progress on that end, but I think there were some more items left as well. This is also related to https://github.com/thanos-io/thanos/issues/2600
I'll try to organize them a bit, and move this forward! 🙂
I see that HA endpoints are out of scope of that proposal: https://github.com/thanos-io/thanos/pull/5505/files#diff-5dad1d444b473dcd0b72f4770b3ba03089499cfaf027205c83914f63124644e5R38. Are you looking to cover them in your work?
@SuperQ brought this feature up yesterday during contributor hours. Do we want to proceed with something like this until we have more extensive endpoint configuration? It would be great to not have to run envoy for load balancing gRPC connections since it's another component in the query path that can break. cc @saswatamcode @bwplotka
I marked these flags as experimental, there could be some hidden dragons that we'll uncover over time.