thanos receive: Support backfilling data through remote write (in case of origin got disconnected)

Thanos, Prometheus and Golang version used: v0.10.1

Object Storage Provider: AWS S3

What happened: We use thanos-receive to get data from multiple locations. Due to https://github.com/thanos-io/thanos/issues/1624 we have to use short tsdb.block-duration=15m. So, in case any issue happens and thanos-receive is down for >15m remote prometheuses are accumulating data. When connectivity to thanos-receive reestablishes, prometheuses start to send delayed data - which is great. But thanos-receive won't accept that as timestamp is in the past chunk. So, we're getting holes on graphs.

What you expected to happen: Thanos-receive is able to upload old data to S3.

Note that we cannot do both - upload to S3 from remote locations and also send to remote-write. Because thanos-compact cannot deduplicate such blobs. Which leads to sum() = x2 etc.

These remote prometheuses are running inside k8s on emptyDir. Switching to just thanos-sidecar (s3 upload) model would lead again to holes on graphs when prometheus pod is restarted. Because sidecar only uploads completed chunk, and we loose WAL in emptyDir on restart. Another option is to use pvc, but this would bind prometheus to some AZ, as EBS are per-AZ entity. And single AZ could be down.

That's why remote-write model looks so competitive for us.

How to reproduce it (as minimally and precisely as possible): Shutdown working thanos-receive for >2h, then turn it back on. See that prometheuses uploads stale data, but thanos-receive reject it.

Feb 10 '20 13:02 sepich

What do you mean exactly by:

What you expected to happen: Thanos-receive is able to upload old data to S3.

Thanos receive is already able to upload old data so S3. Each instance of Thanos receive in the hashring should use a different replica label. Thanos querier will correctly deduplicate data from this when querying, even when using aggregations likse sum, so no queries should be incorrect.

This sounds like we are trying to solve a symptom of the actual issue in 1624. I would prefer to focus on fixing that so that everyone can use block durations of 2hrs and avoid anything like what you are seeing. What do you think?

Feb 10 '20 13:02 squat

What do you mean exactly by: Thanos receive is already able to upload old data so S3

Sorry, I've meant such errors after thanos-receive is down for some time and than back on:

level=debug ts=2020-02-06T14:51:27.634126008Z caller=writer.go:69 component=receive component=receive-writer msg="Out of bounds metric" lset="{__name__=\"container_processes\", instance=...}" sample="timestamp:1581000525667 "
level=warn ts=2020-02-06T14:51:27.635264235Z caller=writer.go:83 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=2

Feb 10 '20 13:02 sepich

aha, ok so you mean something like:

What you expected to happen: Thanos-receive is able to continue ingesting data even after closing and uploading a block without seeing out of bounds errors.

I still think that this is really addressing a symptom rather than the root of the issue. I am not against investigating this avenue but I think it will be more valuable and solve more people's issues if we go to the root and prevent receive from OOMing while replaying the WAL. This way you can have longer block durations and as a result see fewer out of bounds errors.

Feb 10 '20 14:02 squat

I think there is space for something like this in some shape or form in the prometheus remote write protocol. Essentially the protocol itself should have the ability to even if the backend was down for >2h be able to backfill data that has been flushed to a block from the head block/WAL. I probed some other Prometheus team members and it did seem like there were other parties that would be interested in such a thing.

Feb 10 '20 14:02 brancz

I still think that this is really addressing a symptom rather than the root of the issue.

Unfortunately even with default tsdb.block-duration=2h on thanos-receive side remote-write scheme is unable to recover after being down for >2h. Logs on prometheus v2.15.2 side:

ts=2020-02-14T08:38:06.999Z caller=dedupe.go:111 component=remote level=error remote_name=f9a55a url=https://xxx/api/v1/receive msg="non-recoverable error" count=100 err="server returned HTTP status 409 Conflict: 3 errors: failed to non-fast add 28 samples: out of bounds: conflict; 409; 409"
ts=2020-02-14T08:38:11.921Z caller=dedupe.go:111 component=remote level=warn remote_name=f9a55a url=https://xxx/api/v1/receive msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1581657554 minSendTimestamp=1581669481

And lastSendTimestamp stays the same in each event, not growing. On thanos-receive side (3 hashring endpoints with replica=1):

level=warn ts=2020-02-14T08:42:27.873428299Z caller=writer.go:83 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=35
level=error ts=2020-02-14T08:42:27.874889268Z caller=handler.go:347 component=receive component=receive-handler msg="storing locally" err="failed to non-fast add 35 samples: out of bounds: conflict" endpoint=http://thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive
level=error ts=2020-02-14T08:42:27.884528922Z caller=handler.go:395 component=receive component=receive-handler msg="forwarding returned non-200 status" err=409 endpoint=http://thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive
level=error ts=2020-02-14T08:42:27.887348733Z caller=handler.go:395 component=receive component=receive-handler msg="forwarding returned non-200 status" err=409 endpoint=http://thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive

Metric thanos_receive_forward_requests_total dropped after receive being down: It seems prometheuses sending same old data over and over again, and unable to move on to new data. I have to manually restart all prometheuses with deleting /wal/* folder content to recover from this.

Feb 14 '20 08:02 sepich

There are some caveats:

The receive component will only ingest data up to 1 hour old.
Prometheus will stop retrying sending data once it's not available in the WAL anymore, and from there only attempt to send the new data newly available in the WAL, so data older than 2h won't be possible to be ingested/backfilled, only within 1 hour.

Feb 14 '20 10:02 brancz

Thank you for such an important information! I believe this should be mentioned in docs. I'm confirming that setting --storage.tsdb.min-block-duration=15m --storage.tsdb.max-block-duration=15m on prometheus side make the data acceptable by thanos-receive.

Is there a reason on why it is 1h? And not 2h or more for default 2h prometheus WAL?
Maybe i misunderstand your sentence - but issue when started to happen will not fix by itself. I do see in prometheus logs each 2h that WAL is rotated, block written, but lastSendTimestamp stays the same and data is not accepted by thanos-receive for 6h.

Feb 14 '20 12:02 sepich

The heuristic is 0.5x of the smallest block produced and since the default smallest block is 2h, the heuristic determines 1h. Further explained: this is because the WAL accumulates 1.5x the smallest block worth of data, and as it stops accepting writes for 0.5x it can safely flush 1.0x of the head block to disk, without having to worry about appends happening from earlier than that.
Your logs say the receive component returned status code 409, on 4xx errors the Prometheus remote write client drops the requests it's trying to make and goes to the next and tries that. Could you enable debug logs on Prometheus?

Feb 14 '20 18:02 brancz

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

Mar 15 '20 19:03 stale[bot]

Wish this wasn't stale - this is useful to have. Prometheus attempts to write them, Receive should accept them?

Oct 15 '20 17:10 Kampe

Still valid, and a big chunk of work to make it happen.

Some idea presented already in the past to Prometheus and Cortex teams: Start second, separate WAL just for older series.

Jun 04 '21 08:06 bwplotka

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Aug 04 '21 22:08 stale[bot]

Love to see the stale tag removed <3

Aug 24 '21 20:08 Kampe

Still valid!

Sep 01 '21 16:09 bwplotka

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Jan 09 '22 11:01 stale[bot]

Valid

Jan 10 '22 08:01 yeya24

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Apr 17 '22 04:04 stale[bot]

Seems there's work being done upstream on prometheus for support here: https://github.com/grafana/mimir-prometheus/compare/out-of-order

Apr 18 '22 23:04 Kampe

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Sep 21 '22 02:09 stale[bot]

https://github.com/prometheus/prometheus/pull/11075 Finally this is merged into Prometheus so probably we could give it a try.

Sep 21 '22 18:09 yeya24

Since the OOO feature has been merged into main, I will close this issue. Feel free to reopen if you think it is not addressed.

Nov 13 '22 03:11 yeya24

thanos thanos copied to clipboard

receive: Support backfilling data through remote write (in case of origin got disconnected)

thanos
thanos copied to clipboard