prometheus Remote write should best-effort start from a checkpoint instead of time.Now

Right now remote write starts sending samples with a timestamp after time.Now() when Prometheus started (more or less). On restart or config reload if remote write wasn't able to flush it's pending samples cleanly or was behind for some reason there will be samples missing in the remote storage. If remote write always starts from a checkpoint in some kind of checkpoint file for it's queue name/config hash we can avoid that situation.

see: https://github.com/prometheus/prometheus/pull/7710

May 10 '21 19:05 cstyan

I think the "best-effort" here would be to see if the last send data was not compressed i.e., it still lies in (tsdb) checkpoint and segments.

If the last send was from a segment and that segment still exists, resume from the next record using the offset (segment number + offset in that segment) in that segment (this needs to be stored in some file, as last successful sent). otherwise, start from the starting of the checkpoint, as we do not have any means (AFAIK) to know the at send. But this can lead to too much duplicate data being sent to remote storage, so not sure if we are happy with this.

WDYT?

May 26 '21 08:05 harkishen

I'm actually working on this at the moment, based on Nicole's original branch.

The main thing is that if the checkpoint file is not present, or if the segment mentioned in the checkpoint file no longer exists, we start from time.Now() as we do currently.

Jun 09 '21 20:06 cstyan

relevant "feature" requirements: https://github.com/prometheus/prometheus/pull/9862#issuecomment-979471947

Nov 25 '21 21:11 cstyan

Here is an idea how we can do it: For alerting rules, we make use of TSDB to store the "for" state of alerting rules, and after a restart we use these metrics to restore the state. Note that this metric is not exposed, rather directly put into TSDB. Similarly, how we expose the highest timestamp remote-written, we could ingest something similar to TSDB regularly (like the highest timestamp remote-written, or the WAL file number and the offset up to which it was written), and use this during startup to start remote-write from where we left. With the highest timestamp remote-written, we might still lose some samples (for example smaller timestamp present later in the WAL).

PS: it might not be a good idea to remove the highest timestamp metric from the /metrics output. So if we want to do it this way, we need to ingest the same metric under a different name into TSDB.

Feb 03 '23 13:02 codesome

@codesome We've been calling this feature "remote write checkpointing" so I'll refer to it as that here. Regardless of the mechanism we use for deciding what to write (I still think either just the segment # or a bytes/record offset into a segment is the most accurate) the main source of complexity is going to be indicating from remote write to the checkpointing code that up to that offset has been successfully written to the remote endpoint. IMO Right now we don't have boundaries we can easily delineate on except segment files, and because data coming into remote write is split across all shards we would need to accept a signal from the Watcher saying "I've sent you all the data for segment X" and then ensure that all the buffered samples from that point in time have been batched within a shard and successfully sent.

Feb 03 '23 18:02 cstyan

I would suggest extending this idea to allow a queue to prevent the deletion or truncation of the WAL to prevent data loss, I wasn't aware myself that it was this easy to skip/lose data in remote write. Postgres has this feature called “slot”, which, when activated, ensures that the sender does not delete the WAL until it has been received: https://www.postgresql.org/docs/16/warm-standby.html#STREAMING-REPLICATION-SLOTS

Apr 09 '24 08:04 machine424

Remote write has no ownership over the WAL, that would have to be okay-ed by the TSDB maintainers. If we were to have something like that it would need some kind of a time or size limit, the WAL should not be allowed to consume disk endlessly.

Apr 09 '24 18:04 cstyan

Yes, Postgres ended up adding some safeguards for that: max_slot_wal_keep_size e.g.

Apr 09 '24 19:04 machine424

@codesome @bwplotka @jesusvazquez are TSDB maintainers, they have final say on something that would change the WAL behaviour

Apr 12 '24 00:04 cstyan

Hello from the bug scrub: this seems like an important issue considering it can lead to data loss. @cstyan @codesome @bwplotka @jesusvazquez @rfratto do you agree it can lead to data loss? In which case we should find an owner to push this.

Jun 18 '24 11:06 krajorama

I agree this can cause data loss. The most obvious instances are due to a crash or restart, but a "restart" can also include surprising cases like Kubernetes moving a pod to a new node, where a user may not expect data loss to occur.

Jun 18 '24 14:06 rfratto

Hello from the bug scrub: this seems like an important issue considering it can lead to data loss. @cstyan @codesome @bwplotka @jesusvazquez @rfratto do you agree it can lead to data loss? In which case we should find an owner to push this.

Yes, 100% it can cause data loss. Making an improvement here is most relevant for agent-mode, Grafana Alloy/Agent, and components like Loki rulers, since in those modes there's no persistent storage of TSDB data but only the WAL. In these cases the WAL is not useful after a restart of the process. This has lead to less than ideal operations when on-call, like just deleting the entire WAL when the disk is nearing capacity or the process is OOM crashing.

I would be more than happy to guide someone through an implementation here, but I may not be able to until after the summer since I have already committed to a mentorship each for GSoC and LFX. I'd appreciate if anyone else could step in in the meantime if someone does express interest in working on this.

Jun 18 '24 19:06 cstyan

I would be more than happy to guide someone through an implementation here, but I may not be able to until after the summer since I have already committed to a mentorship each for GSoC and LFX. I'd appreciate if anyone else could step in in the meantime if someone does express interest in working on this.

@cstyan I'd be more than happy to learn and collaborate in this one, I'm not that much active in the conversation since I've been selected as Grafana Champion so I had other commitments as well, but I'd try my best to take out my time to contribute to this one.

Jun 19 '24 06:06 Sheikh-Abubaker

Seems like this directly affects the Prometheus Agent mode and any configuration with remote-write (the DaemonSet model which we are working on over at Prometheus-Operator, for example). I am wiling to contribute as well!

Jul 05 '24 11:07 mviswanathsai

It would impact Prometheus in either mode (the agent is really only used for remote write) when using remote write, as well as Grafana Agent + Grafana Alloy, or anything else that imports and uses Prometheus' Remote Write code.

Anyone who wants to work on this is welcome to. Ping me here or in the #prometheus-dev channel on the CNCF slack if you have questions.

Jul 05 '24 17:07 cstyan

Anyone who wants to work on this is welcome to. Ping me here or in the #prometheus-dev channel on the CNCF slack if you have questions.

@cstyan I would like to work on this

Jul 05 '24 17:07 Sheikh-Abubaker

And I can help push this. I'll start a doc to summarize all of what was tried before (I can see at least 3 closed PRs that tried to address this in different ways), and to initiate a discussion about how we want to proceed.

EDIT: And as @cstyan stated, this issue (making the watcher less amnesiac) is just the first step, it'll imply less "data loss", but we'll still need to make the queues guide the watcher as it doesn't really know if the data it read was really sent (that data is kept in memory and can be easily lost), then, at the end, we can think about a PG's slot like feature (here) to allow the queues/watcher to retain WALs on disk.

Jul 08 '24 10:07 machine424

I believe this is impacting us in our largest Prometheus instances not when we're restarting Prometheus but when we're reloading configuration and some remote_write detail has changed (relabeling rule changes most likely).

Here, WAL replay can easily take 2+ minutes for our largest instances. In this circumstance, we see a gap similar to the WAL replay duration where metrics have not been shipped.

Aug 05 '24 05:08 djluck

I pushed https://github.com/prometheus/prometheus/pull/14829 (for the first step: having a stateful watcher)

I’ll share the doc soon that explains the details of this change and outlines the next steps.

Sep 04 '24 15:09 machine424

prometheus prometheus copied to clipboard

Remote write should best-effort start from a checkpoint instead of time.Now

prometheus
prometheus copied to clipboard