mimir icon indicating copy to clipboard operation
mimir copied to clipboard

store-gateway: reduce latency impact due to index header lazy-loading

Open dimitarvdimitrov opened this issue 2 years ago • 2 comments

Context

The store-gateway can lazily load the index header of a block when each block is requested by the querier. The store-gateway loses these loaded index headers upon restart and also unloads them when they haven't been queried after time idle period (3h by default). Loading one index header can take between seconds and minutes.

Related to https://github.com/grafana/mimir/issues/4762

Problem

When the store-gateway crashes, is rescheduled on a new node or rolled out with a new version it loses the index headers. This means that subsequent queries for the blocks of these index headers will suffer a latency increase. But it's also possible that other store-gateways in other zones have this index header already loaded.

Proposal

Change the querier to do a pre-request check in store-gateways for which replicas have the requested blocks already loaded.

  • upon executing a query the querier loads the list of all 3 (= replication factor) store-gateways that can serve a block
  • the querier sends a TriggerBlockLoad([]ULID) map[ULID]bool (naming suggestions welcome) RPC to each of the 3 store-gateways concurrently
  • the result of TriggerBlockLoad is a map from blockID to whether the block is already loaded; if a block isn't already loaded the store-gateway starts lazy-loading its index header
  • the querier tries to request each block from a replica which already has it loaded; if there aren't any, then it falls back to a random selection

Alternatives

  • https://github.com/grafana/mimir/issues/4762
  • Disable lazy-loading.

dimitarvdimitrov avatar Apr 18 '23 10:04 dimitarvdimitrov

  • the querier sends a TriggerBlockLoad([]ULID) map[ULID]bool (naming suggestions welcome) RPC to each of the 3 store-gateways concurrently

Maybe we could move this up a level to make a call to all store-gateways that will be involved in the query (instead making a request to 3 of them for each block)?

Naming: I've heard "pre-flight checks" used for something similar in the past.

  • to be less aggressive the store-gateway can enqueue the blocks which we are lazy-loading. This can also be controlled via blocks-storage.bucket-store.meta-sync-concurrency

Did you mean -blocks-storage.bucket-store.index-header.lazy-loading-concurrency ?

56quarters avatar May 08 '24 13:05 56quarters

  • the querier sends a TriggerBlockLoad([]ULID) map[ULID]bool (naming suggestions welcome) RPC to each of the 3 store-gateways concurrently

Maybe we could move this up a level to make a call to all store-gateways that will be involved in the query (instead making a request to 3 of them for each block)?

Today the store-gateway involved in the query are only one per block. I think we'd need to involve all 3 replicas for every block in the query. That way we can choose the most ready store-gateway.

  • to be less aggressive the store-gateway can enqueue the blocks which we are lazy-loading. This can also be controlled via blocks-storage.bucket-store.meta-sync-concurrency

Did you mean -blocks-storage.bucket-store.index-header.lazy-loading-concurrency ?

yes. This has already been done since opening this issue

dimitarvdimitrov avatar May 13 '24 11:05 dimitarvdimitrov