mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Ruler, Querier: Increase frequency of gRPC healthchecks

Open dimitarvdimitrov opened this issue 3 years ago • 9 comments

Signed-off-by: Dimitar Dimitrov [email protected]

What this PR does

This uses the same frequency and timeout that the querier has to the query-frontends. This will help in cases when the store-gateways become unresponsive for prolonged periods of time. The querier will more quickly remove the connection to that store-gateways. So fewer new queries will go to it and existing queries will fail faster.

Does the same for the ruler-ruler clients (used to get state of running rules in other rulers) for good measure.

Which issue(s) this PR fixes or relates to

Fixes NA

Checklist

  • NA Tests updated
  • NA Documentation added
  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

dimitarvdimitrov avatar Oct 07 '22 18:10 dimitarvdimitrov

The querier will more quickly remove the connection to that store-gateways. So fewer new queries will go to it and existing queries will fail faster.

Querier uses store-gateway ring to find which store-gateways should be queried. If gRPC connection doesn't exist in the pool, querier will try to create one (all of that happens in this function). I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

pstibrany avatar Oct 10 '22 07:10 pstibrany

Leaving technical approval to an engineer. Approved from a docs perspective to get this off my review list. :)

osg-grafana avatar Oct 10 '22 08:10 osg-grafana

The querier will more quickly remove the connection to that store-gateways. So fewer new queries will go to it and existing queries will fail faster.

Querier uses store-gateway ring to find which store-gateways should be queried. If gRPC connection doesn't exist in the pool, querier will try to create one (all of that happens in this function). I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

yeah that's a good point. It will only be forced to reestablish a new connection. New connection dialing is not bounded by a timeout unfortunately. We can use DialContext with a context with an expiration and also use WithBlock() here: https://github.com/grafana/mimir/blob/da636a69aa0401f6f590563d98ee2510d0636dc8/pkg/querier/store_gateway_client.go#L45

WDYT @pstibrany ?

dimitarvdimitrov avatar Oct 10 '22 10:10 dimitarvdimitrov

We can use DialContext with a context with an expiration and also use WithBlock() here:

This sounds like a good idea, but will require some changes to the client pool. We already get similar benefit from using time-bound context for individual calls (if connection doesn't exist yet, it is created first during call to eg. Series or other methods on StoreGatewayClient interface).

pstibrany avatar Oct 10 '22 11:10 pstibrany

This sounds like a good idea, but will require some changes to the client pool.

The client factory is within this package, so there's no need to change dskit if that's what you meant. I pushed a change in https://github.com/grafana/mimir/pull/3168/commits/9847f9da2bbe0c2124f19800b1d151f01793ce64

dimitarvdimitrov avatar Oct 10 '22 11:10 dimitarvdimitrov

Previously blocksStoreReplicationSet.GetClientsFor function would return fast, returning map of clients to ULIDs. But not all store-gateways will necessarily by called by client of this function -- store-gateways are queried randomly, so user may be lucky and not hit the "bad" store-gateway.

Now blocksStoreReplicationSet.GetClientsFor will try to establish connection to store-gateways first. By doing that, all users will suffer if connection to store-gateway cannot be established in time, while previously some users might have been lucky and still get a response.

pstibrany avatar Oct 11 '22 06:10 pstibrany

Previously blocksStoreReplicationSet.GetClientsFor function would return fast, returning map of clients to ULIDs. But not all store-gateways will necessarily by called by client of this function -- store-gateways are queried randomly, so user may be lucky and not hit the "bad" store-gateway.

Now blocksStoreReplicationSet.GetClientsFor will try to establish connection to store-gateways first. By doing that, all users will suffer if connection to store-gateway cannot be established in time, while previously some users might have been lucky and still get a response.

This sounds like we sacrifice the redundancy of store-gateways.

so we added the WithBlock and DialContext with a timeout context so that we don't try to reestablish new connections all the time after the health check fails. How about we continue using DialContext but remove WithBlock? This way we will effectively be retrying to reestablish a connection to a failed store-gateway every $timeout seconds[^1] until the store-gateway also fails its ring healthcheck.

[^1]: Assuming a continuous flow of queries

dimitarvdimitrov avatar Oct 11 '22 10:10 dimitarvdimitrov

How about we continue using DialContext but remove WithBlock?

This combination doesn't have the desired effect. From DialContext documentation:

// In the non-blocking case, the ctx does not act against the connection. It
// only controls the setup steps.

We need to configure timeout on context that will be used to create connection, which for non-blocking clients is first call to client methods.

pstibrany avatar Oct 11 '22 10:10 pstibrany

I couldn't find any other way to limit to bound the connection time with the grpc libs.

We can override the behaviour on storeGatewayClient to take care of dialing with block and with context on the first call to the remote instead of in dialStoreGatewayClient.


I'm trying to think about what we're trying to solve with these dial timeouts. I went back to your previous comment @pstibrany

I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

It will fail requests to the store-gateways faster. Which means that they will be retried faster - the querier excludes store gateways which it already tried to query from retries. So if instead of waiting for 30s and then trying again, it will wait for 5s and then try again. So 3 tries are a total of 15s, not 1m30s (on average).

If all of this is correct, then why do we need to bound the dial time? The client pool closes and removes a connection after it exceeds its healthcheck timeout. It will do the same on a yet uninitialized connection. So this will also break any outstanding calls using the connection (like a hung Series call).

dimitarvdimitrov avatar Oct 11 '22 12:10 dimitarvdimitrov

@pstibrany @pracucci can you take another look please?

dimitarvdimitrov avatar Oct 28 '22 08:10 dimitarvdimitrov

@pstibrany, @pracucci can you take another look? I changed to PR to only have the gRPC client timeouts as we spoke IRL.

We also spoke about reducing the heartbeat timeout in the ring. I will first try this out internally first before changing the defaults.

dimitarvdimitrov avatar Nov 21 '22 09:11 dimitarvdimitrov

Done, also added a changelog entry

dimitarvdimitrov avatar Nov 21 '22 11:11 dimitarvdimitrov