thanos Implement endpoint groups

This draft PR illustrates how an HA group of endpoints can be configured in Thanos Query.

Looking for feedback on the approach!

Fixes https://github.com/thanos-io/thanos/issues/5335

[ ] I added CHANGELOG entry for this change.
[ ] Change is not relevant to the end user.

Changes

Add the flags endpoint-group and endpoint-group-strict to Thanos Query for load balancing instead of fanout.

Verification

Jul 28 '22 08:07 fpetkovski

One downside of using load balancing from gRPC is that dns resolution is done once during startup and is cached for a very long time, potentially forever. Addresses will be re-resolved when the downstream target goes away.

This can be problematic if the endpoing group is scaled out and new targets are added. The recommended workaround here is to set max_connection_age to something like 5m which will cause periodic dns resolution, but also a complete recreation of all connections to the endpoint group.

Jul 28 '22 09:07 fpetkovski

cc @saswatamcode @SrushtiSapkale

Jul 29 '22 11:07 fpetkovski

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Oct 01 '22 10:10 stale[bot]

I am not sure what the status of the LFX project is, but this could be a short term solution that gets us unblocked until we have a better implementation.

Nov 02 '22 05:11 fpetkovski

Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days. Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

Jan 07 '23 21:01 stale[bot]

Currently very interested in this work. What are the blockers for this being merged?

Jan 23 '23 14:01 trevorriles

There was an LFX project that was a superset of this functionality: https://github.com/thanos-io/thanos/pull/5505

Maybe @saswatamcode or @matej-g have some information on how far that got and whether it makes sense to continue working on this PR.

Jan 23 '23 14:01 fpetkovski

There was an LFX project that was a superset of this functionality: #5505

Maybe @saswatamcode or @matej-g have some information on how far that got and whether it makes sense to continue working on this PR.

Thanks for the link!

Jan 23 '23 14:01 trevorriles

Yes, we made some progress on that end, but I think there were some more items left as well. This is also related to https://github.com/thanos-io/thanos/issues/2600

I'll try to organize them a bit, and move this forward! 🙂

Jan 24 '23 05:01 saswatamcode

I see that HA endpoints are out of scope of that proposal: https://github.com/thanos-io/thanos/pull/5505/files#diff-5dad1d444b473dcd0b72f4770b3ba03089499cfaf027205c83914f63124644e5R38. Are you looking to cover them in your work?

Jan 24 '23 07:01 fpetkovski

@SuperQ brought this feature up yesterday during contributor hours. Do we want to proceed with something like this until we have more extensive endpoint configuration? It would be great to not have to run envoy for load balancing gRPC connections since it's another component in the query path that can break. cc @saswatamcode @bwplotka

Feb 03 '23 17:02 fpetkovski

I marked these flags as experimental, there could be some hidden dragons that we'll uncover over time.

Feb 09 '23 08:02 fpetkovski

thanos thanos copied to clipboard

Implement endpoint groups

Changes

Verification

thanos
thanos copied to clipboard