cortex Alternative to scaling down ingesters

Alternative to scaling down ingesters

Open danielblando opened this issue 6 months ago • 3 comments

Is your feature request related to a problem? Please describe. Today, scaling down ingesters is a complicated and highly manual process. As described on the website, scaling down requires ensuring that blocks are flushed to storage and that queries use the stored data set to 0s (.query-store-after). However, this approach is not suitable for all use cases, as in some scenarios, we want to utilize ingesters for querying as well, to improve request performance.

Describe the solution you'd like Automating the scale down of ingesters is not a trivial task. It is desirable for ingesters to have a mechanism that allows users to scale them down gradually without missing data.

A proposed solution is to introduce a new state for ingesters called READONLY. In this state, ingesters cannot receive data, meaning all Push requests would fail, but they can still accept query data. Cortex would use the ring operation to filter out the correct ingesters by state, allowing the distributor and query/ruler to use the appropriate set of ingesters.

Write = NewOp([]InstanceState{ACTIVE})
Read = NewOp([]InstanceState{ACTIVE, PENDING, LEAVING, JOINING, READONLY})

To enable users to set an ingester to READONLY mode, ingesters would have a new API that allows them to transition to READONLY or ACTIVE. It would be permissible for an ingester to return to ACTIVE mode as a way to cancel a scale down if needed.

a.indexPage.AddLink(SectionDangerous, "/ingester/mode", "Change Ingester mode on ring")

Furthermore, to allow ingesters to be safely removed from the ring, they would also have a new API that lets users know which blocks an ingester has loaded. The idea is that when an ingester has deleted all blocks, it can be stopped.

a.indexPage.AddLink(SectionDangerous, "/ingester/blocks", "List blocks on ingesters")

This approach introduces a new READONLY state for ingesters, enabling a controlled scale down process without data loss. Users can transition ingesters to READONLY mode, preventing new data ingestion while allowing queries on existing data. Once an ingester has deleted all its blocks, it can be safely stopped and removed from the ring.

Describe alternatives you've considered

Using the LEAVING state as READONLY. This was discarded as the LEAVING state already has multiple logics and premises on why the pod is in that state, which could make the code more confusing.
Not having the /ingester/blocks endpoint and using the .query-store-after configuration to scale down ingesters. While this can still work, it adds complexity for the user as they would need to track the time, ensure the configuration hasn't changed, and account for failures in ingesters pushing blocks to storage.

Additional context Add any other context or screenshots about the feature request here.

Aug 07 '24 18:08 danielblando

cortex cortex copied to clipboard

Alternative to scaling down ingesters

cortex
cortex copied to clipboard