cortex Blocks storage unable to ingest samples older than 1h after an outage

Blocks storage unable to ingest samples older than 1h after an outage

Open pracucci opened this issue 4 years ago • 54 comments

TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Let's consider this scenario:

Multiple Prometheus servers remote writing to the same Cortex tenant
Some Prometheus servers stop remote writing to Cortex (for any reason, ie. networking issue) and they fall behind more than 1h
When the Prometheus servers will be back online, Cortex will discard any sample whose timestamp is older than 1h because the max timestamp in the TSDB head is close to "now" (due to the working Prometheus servers which never stopped to write series) while the failing ones are trying to catch up writing samples older than 1h

We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.

@bwplotka You may be interested, given I think this issue affects Thanos receive too.

Mar 31 '20 09:03 pracucci

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

Mar 31 '20 09:03 pracucci

Noting other possible options (not necessarily good options):

extend block range (would lead to higher ingester memory usage, and longer WAL replays)
~change the cutting formula – I was trying to refer to 50% limit, but it's not related to cutting.~

Mar 31 '20 09:03 pstibrany

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

Mar 31 '20 11:03 bwplotka

Also, clock skew can cause it.

Mar 31 '20 11:03 bwplotka

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

New block is cut from the head, when head (in-memory data) covers more than 1.5x of the block range. For 2 hours block range, it means that head needs to have 3h of data to cut a block. Block "start time" is always the minimum sample time in the head, while block "end time" is aligned on block range boundary. New block always covers single "block range" period. Data stored into a block is then removed from the Head. That means that after cutting the block, head will already have at least 1h of data, but possibly even more.

When writing new data via appender, minimum time limit is computed as Max(minT, maxT-0.5*block range), where minT and maxT are minimum/maximum sample times in the head. Limit is then enforced in Add and AddFast methods.

When head covers less than 0.5 of block range (<1h for 2h block range), samples cannot be older than min time in the head. When head covers more than 0.5 of block range (>1h for 2h block range), samples cannot be older than half block range since max time in the head.

Mar 31 '20 11:03 pstibrany

Thanks for the explanation. I think we are getting into the backfilling area.

I think opening a side block and use vertical compaction would solve it.

Mar 31 '20 12:03 bwplotka

We could start Prometheus discussion for it as well to enable that behavior if vertical compaction is enabled for TSDB itself.

Mar 31 '20 12:03 bwplotka

I think we are getting into the backfilling area.

It depends. How long time back is considered backfilling? From the Cortex (or Thanos receive) perspective, I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Mar 31 '20 12:03 pracucci

Maybe it means that remote write is not good enough. You only have up to 2 hours to even still have that WAL around in this scenario, at which point it would be cut to a block. Maybe we need to discuss back-filling with blocks as a remote write extension. Just thinking out loud. (I think I briefly discussed this with @csmarchbanks once before)

Mar 31 '20 12:03 brancz

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

Mar 31 '20 12:03 bwplotka

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

From the pure UX side, if it's about a Prometheus server catching up after an outage then I wouldn't consider it backfilling.

Mar 31 '20 12:03 pracucci

I guess the boundary is then 1.5x block size (once we exceed WAL (3h))?

Mar 31 '20 12:03 bwplotka

Actually user can change that, so we can make even 2 years WAL. If some one will not upload things for 2y and suddenly want to put 2y old sample, would that be still not backfill :thinking: ?

Mar 31 '20 12:03 bwplotka

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

Mar 31 '20 12:03 brancz

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

What do you mean? When would you cut then? You don't know upfront what writes you expect to see but don't see, no?

Mar 31 '20 13:03 bwplotka

The heuristic of allowing inserts up to 0.5x timespan of head blocks is based on the assumption that we can safely and correctly cut blocks at that time, I'm wondering what other strategies there might be. Clearly other databases do different things and time-based things are actually kind of weird in the first place. What I'm trying to say is, if we remove that requirement, then we might be able to think of ways how we can improve this situation (potentially combined with vertical compaction?).

Mar 31 '20 16:03 brancz

Maybe we need to discuss back-filling with blocks as a remote write extension

I think I have some code sitting around somewhere that does this (I was using it to populate datasets from Prometheus into various backends that supported remote write). If there is interested I'd be happy to dig it up again.

we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic

Yes, that would be great. There were some ideas around this when we were discussing how to limit Prometheus memory usage weren't there? I remember at least something around space-based head block.

Mar 31 '20 19:03 csmarchbanks

Catching up with emails now :) looks like I missed some discussions

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

While "out of bound" in TSDB would work fine, it needs some more discussion if it has to be upstream. Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected. (Additionally, vertical queries and compactions are a tad bit more expensive in terms of CPU and Memory)

I think opening a side block and use vertical compaction would solve it.

Is this idea for upstream Prometheus or Thanos/Cortex? But anyway, do we have any requirement that data is available for querying soon after ingesting?

extend block range (would lead to higher ingester memory usage, and longer WAL replays)

With the m-mapping work that is going on, the memory usage can be taken care of. And if this partial chunks work looks good to maintainers (follow up of m-map work), that would also take care of WAL replays :). But this would mean Cortex can increase it's block range, but the default in upstream Prometheus would need to be changed too so that WAL is kept around longer.

Apr 06 '20 06:04 codesome

I would as much try to avoid adding samples older than Head mint and bring vertical compaction into play in the upstream Prometheus, because (1) Code is already complex enough (is that a valid argument? :P) (2) If not used it correctly, users will silently lose/corrupt data. (3) Unexpected spikes in CPU/Memory (maybe this should be expected?)

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

Apr 06 '20 06:04 codesome

Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected.

Is this true even after the m-mapping work is complete?

I think opening a side block and use vertical compaction would solve it. But anyway, do we have any requirement that data is available for querying soon after ingesting?

Yes, we should be able to immediately query back samples pushed to the ingesters, like it's working for the chunks storage (this issue affects only the blocks storage).

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

I was thinking about an optional flag (like overlapping data).

Apr 06 '20 07:04 pracucci

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Jun 05 '20 07:06 stale[bot]

This is still valid. @codesome has some ideas he wanna experiment.

Jun 05 '20 08:06 pracucci

Can you share those ideas Ganesh?

On Fri, 5 Jun 2020 at 09:05, Marco Pracucci [email protected] wrote:

This is still valid. @codesome https://github.com/codesome has some ideas he wanna experiment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortexproject/cortex/issues/2366#issuecomment-639325550, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OZY3M5UICNW2ZTVOQDRVCRNFANCNFSM4LXMKN6Q .

Jun 05 '20 09:06 bwplotka

Can you share those ideas Ganesh?

It was one of my TODO for today :) I will share with you all once I have written it down.

Jun 05 '20 09:06 codesome

This is an idea from the top of my head, needs more thought

The root problem

These 2 checks, this and this

Solution

Enable vertical compaction and querier and make that check optional - upstream will enable the check whereas Cortex and Thanos can turn that off.

Why does this work?

Because we are not discarding out of bound sample and the sample would just get added to it's series. Out of order samples in a series are still discarded.

If a series lagging behind in time causes overlap with data on disk, the vertical querier will take care of deduping.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Any gotchas? Yes

Head compaction is based on the mint and maxt of the Head. One can generate synthetic data to add just 2 samples per series ranging over 3h and cause lots of unnecessary Head compactions. This can be taken care of manually calling Compact() with some logic around it instead of calling it very often.
It is possible to add samples with time range spanning a very large time range - this can end up compacting all blocks into one. One possible solution is instead of removing the out of bound check entirely, make the minValidTime configurable so that you can decide how long back in time you want to accept.

Jun 05 '20 14:06 codesome

Thanks @codesome for sharing your idea!

As you stated, a downside of this approach is that you may end up with a single large block which will impact the compaction. This is a solvable problem, replacing TSDB compactor planner with a custom one (ie. we could exclude those not-aligned blocks from compaction or compact them together without impacting correctly-aligned blocks).

The longer the time range for such blocks is, the more it's problematic at query time so, as you stated, we may add a limit to the oldest timestamp we do allow to ingest (now - threshold). This would practically make this system not working for backfilling purposes (because the threshold would be in terms of hours, not days/months) but may solve the problem described in this issue.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Ideally we don't want any vertical compaction occur in the ingesters. Vertical compaction will be done by the Cortex compactor later on. What do you think?

Jun 08 '20 06:06 pracucci

Sorry I am missing what is the solution here...

Enable vertical compaction and querier and make that check optional - upstream will enable the check whereas Cortex and Thanos can turn that off.

Can we elaborate?

Is this essentially what we proposed for backfilling? Start another TSDB for each out of band request (and keep it for some time until cut)? That's would be quite neat. I don't get how it can produce large blocks - it's exactly the same as you would have in total with all out band data included :thinking:

Jun 08 '20 07:06 bwplotka

Can we elaborate?

Here the "check" is checking of min valid time for the samples - where we discard samples before 1h. When I say make that check optional, we remove that particular check in Cortex/Thanos and allow any timestamp as long as it is not out-of-order within the same series. Later in the comment, I suggested having configurable min valid time instead of removing the check completely.

So it is not the same as what was proposed for backfilling. In my solution, there is no new TSDB running on the side, it's the main TSDB which will allow samples back in time. Now, with this in mind, you can read the my above comment https://github.com/cortexproject/cortex/issues/2366#issuecomment-639505032 again and hopefully it will be clear this time :)

This would practically make this system not working for backfilling purposes

We could have another route to do the backfilling and not via the main ingest way. That would be the thing that Bartek is mentioning above - start a TSDB on the side to specially take care of backfilling. My solution only addresses ingesting old samples which might be lagging because of outage at cortex or Prometheus remote-write behind.

I don't get how it can produce large blocks

So in the backfilling work happening in Prometheus, we are making sure that the new blocks dont cause an overlap between 2 blocks. An example reference is here for the new blocks created. But in case of the proposed solution, if we take any timestamp, the head block could have mint which is less than the mint of the oldest block. Hence after head compaction, the block will overlap all the blocks on the disk and hence cause vertical compaction with all of them to end up with a single big block.

Jun 10 '20 06:06 codesome

Starting another TSDB on the side for non-backfilling purposes is a lot of complexity for the normal ingestion path. So I suggest we have a separate API itself for backfilling which will explicitly start another TSDB and merge with the main TSDB later. And in case of lagging samples for the usual ingestion, I suggest the above solution. WDYT?

Jun 10 '20 06:06 codesome

I agree to do an experiment on @codesome proposal. I also agree that opening parallel TSDBs adds complexity which I would like to avoid if possible. I think we're fine being able to ingest samples only up to X hours ago for this use case; if the solution doesn't come with exceeded complexity, than it would significantly relax the issue with a low effort.

As I mentioned previously, we'll probably have to do some changes to the compactor planner to get these larger blocks compacted together (ie. if a block time range is over the max compaction time range)but it's something we've already experimented on and we know it's doable.

Jun 10 '20 06:06 pracucci

cortex cortex copied to clipboard

Blocks storage unable to ingest samples older than 1h after an outage

cortex
cortex copied to clipboard