thanos icon indicating copy to clipboard operation
thanos copied to clipboard

receive: Hashring Update Improvements

Open squat opened this issue 4 years ago • 17 comments

Currently, any change to the hashring configuration file will trigger all Thanos Receive nodes to flush their multi-TSDBs, causing them to enter an unready state until the flush is complete. This unavailability during a flush allows for a clear state transition, however it can result in downtimes on the order of five minutes for every configuration change. Moreover, during configuration changes, the hashring goes through an even longer period of partial unreadiness, where some nodes begin and finish flushing before and after others. During this partial unreadiness, the hashring can expect high internal request failure rates, which cause clients to retry their requests, resulting in even higher load. Therefore, when the hashring configuration is changed due to automatic horizontal scaling of a set of Thanos Receivers, the system can expect higher than normal resource utilization, which can create a positive feedback loop that continuously scales the hashring.

We propose modifying how the Thanos Receive component re-configures itself after the hashring configuration file has changed so that the system experiences no downtime. Our plan is for Thanos Receive to create a new multi-TSDB instance to replace the multi-TSDB instance it is using to ingest data. Once the swap has been completed in a concurrent-safe manner, the old multi-TSDB can be flushed. This live swap has the benefit of eliminating the unready state that would have occurred due to the configuration change. Furthermore, any partial unreadiness in the entire hashring will be shortened and limited exclusively to the instant when some nodes have loaded the new configuration before others. The duration of this configuration discrepancy can be further reduced in cloud native environments using sidecars that watch an API for updates to the configuration and apply it to disk as soon as a change is identified.

A major benefit of avoiding unreadiness during the application of configuration changes is that the generation of the configuration itself can now safely be based upon the readiness of the individual nodes in the hashring without causing a feedback loop. This means that as a hashring is incrementally scaled up, only nodes that are finished starting up will be considered for membership in the hashring, avoiding black holes in the internal request forwarding logic.

A downside of this multi-multi-TSDB approach is that the resource utilization of the Receive is now dependent on the frequency with which the configuration is changed, as frequent updates to the configuration would mean many multi-TSDB instances are open concurrently. This is likely a safe trade-off, given that short-lived multi-TSDB instances will likely have very little data in memory and will require relatively little resources to flush and close.

cc @thanos-io/thanos-maintainers cc @brancz

squat avatar Sep 08 '20 22:09 squat

Hi! Can I take this up as a part of my community Bridge program?

jaybatra26 avatar Sep 09 '20 15:09 jaybatra26

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Nov 20 '20 16:11 stale[bot]

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Jan 19 '21 19:01 stale[bot]

Still needed.

jmichalek132 avatar Jan 20 '21 09:01 jmichalek132

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Apr 18 '21 22:04 stale[bot]

Our plan is for Thanos Receive to create a new multi-TSDB instance to replace the multi-TSDB instance it is using to ingest data. Once the swap has been completed in a concurrent-safe manner, the old multi-TSDB can be flushed.

Could you elaborate on the fact that why we need to swap the tsdb data before we could flush it?

yashrsharma44 avatar May 31 '21 19:05 yashrsharma44

Could you elaborate on the fact that why we need to swap the tsdb data before we could flush it?

When we are flushing a TSDB instance, it can't ingest any new samples. This means that during such situations (when we are flushing the TSDB) the Receiver becomes unready. To avoid this, we can start a new multiTSDB and switch to that for ingestion, while in background, we flush the old multiTSDB.

onprem avatar Jun 01 '21 04:06 onprem

When we are flushing a TSDB instance, it can't ingest any new samples. This means that during such situations (when we are flushing the TSDB) the Receiver becomes unready. To avoid this, we can start a new multiTSDB and switch to that for ingestion, while in background, we flush the old multiTSDB.

So effectively we are switching to a new multiTSDB rather than swapping data, the original statement was little misleading.

yashrsharma44 avatar Jun 01 '21 04:06 yashrsharma44

Let's be careful with our words here: I don't think there is anything"misleading" in the text, as that implies negative intent.

"Our plan is for Thanos Receive to create a new multi-TSDB instance to replace the multi-TSDB instance it is using to ingest data."

To me, this says exactly what you paraphrased from Prem. It never mentions swapping data, only swapping, ie replacing TSDBs.

Maybe it was unclear to you? Or perhaps the word "swap" is confusing because of its use in memory management? Could you share which part of the text in your mind suggests copying data?

squat avatar Jun 01 '21 07:06 squat

Sure, I didn't mean the statement as "misleading", but more like "unclear", should have correctly used the adjective 😅.

Regarding the swap, I got confused with swapping the data of oldTsdb into newTsdb. Especially this statement -

Once the swap has been completed in a concurrent-safe manner,

Suggests that we might be moving data or switching to new tsdb which is not clear, hence the confusion 😛

yashrsharma44 avatar Jun 01 '21 07:06 yashrsharma44

Our plan is for Thanos Receive to create a new multi-TSDB instance to replace the multi-TSDB instance it is using to ingest data.

Regarding the newTSDB, how are we planning to switch to it in a concurrent manner? Should we use proceed as -

  1. Get reference to the oldMultiTSDB and start flushing the old tsdb using the reference.
  2. Create a newMultiTSDB and store it's reference to oldTSDB.
  3. We might need RWLock while we perform step 2.

Ideas?

yashrsharma44 avatar Jun 03 '21 21:06 yashrsharma44

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Aug 03 '21 00:08 stale[bot]

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Oct 11 '21 06:10 stale[bot]

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] avatar Oct 30 '21 17:10 stale[bot]

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Mar 02 '22 16:03 stale[bot]

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] avatar Apr 17 '22 06:04 stale[bot]

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] avatar Sep 21 '22 06:09 stale[bot]