agent Metrics: GC stale series separately from truncating WAL

Background

Today, Grafana Agent performs a metrics "garbage collection" every 60 minutes (when metrics collection is enabled). The garbage collection process does the following for a specific point in time (the "GC timestamp"):

Mark in-memory series that haven't received a write since the last GC for deletion.
Remove in-memory series that were marked for deletion during the previous GC.
Cut the current WAL segment and create a new one.
Create a new WAL checkpoint from the lower 2/3rds of segment files, composed of all active series records and samples newer than the GC timestamp.
Delete the lower 2/3rds of WAL segment files which composed the checkpoint.

The details of how the GC timestamp is determined is out of scope for this proposal.

When Grafana Agent first released, this GC process happened every minute, but was lowered 60 minutes due to the high constant IOPS introduced by the frequency of the GC.

Performing the GC every 60 minutes has the following effects compared to the original 60 second frequency:

Average IOPS rate is dramatically lowered, due to only needing to create checkpoints 60x less often
Memory usage increases, due to keeping stale series in-memory 60x longer

Proposal

The proposal is to supplement the current process with two following additions:

Use the staleness marker added by Prometheus' scraper as a marker for deletion.
Delete marked in-memory series every 5 minutes in a background goroutine. Only series that have been marked for deletion for at least 5 minutes should be deleted.

5 minutes is chosen slightly arbitrarily: it needs to be big enough to allow for flapping series to be unmarked for deletion, and 4 minutes is the maximum recommended scrape interval for any target.

This change will reduce the overall memory consumption of Grafana Agent in environments that experience frequent series churn by allowing stale series to be removed within 10 minutes instead of 2 hours.

Trade-offs

Pros

Lowers total memory usage in environments that have frequently changing targets or series churn

Cons

May cause more duplicate series records in the WAL: new series records will be created for targets which have flapping series that are scraped very infrequently. Being able to disable this enhancement may be a suitable workaround.

Memory Impact

Let's imagine a Prometheus target that returns a unique set of 1000 series every 60 seconds, and is scraped every 15 seconds.

Today, when the GC runs after the first hour, there will be 60k active series, and roughly all of them will be marked for deletion. During the second GC, after another hour, there will be a total of 120k active series, and the marked series will be deleted, bringing the total back down to 60k. The active series will continue fluctuating between 60k and 120k for this target for the process lifetime.

With this proposal, series will be marked as stale on subsequent scrapes, and deleted within 10 minutes. After 5 minutes, there will be 5,000 active series. The active series will continue fluctuating between 5,000 and 10,000 for this target for the process lifetime.

Summary

We are not able to directly correlate specific memory improvements to this change; too many factors determine the cost of an individual series. Similarly, we are not able to say whether this would improve agent memory usage in the general case.

We will only be able to say that this change deletes stale series roughly 12x faster than before, which may lower memory usage in some cases.

Upstreaming to Prometheus Agent

This change is something that Prometheus Agent can also benefit from. While we are still working on unifying the code and using Prometheus Agent directly, I propose we implement this downstream first to experiment, and then upstream once we know whether it's a good idea.

May 02 '22 12:05 rfratto

How strong is the "4 minutes is the maximum recommended scrape interval for any target" statement?

May 02 '22 13:05 mattdurham

How strong is the "4 minutes is the maximum recommended scrape interval for any target" statement?

🤷 That comes from Prometheus, since after 4 minutes Prometheus queriers automatically inject an implicit staleness marker.

May 02 '22 13:05 rfratto

How strong is the "4 minutes is the maximum recommended scrape interval for any target" statement?

IIRC the recommendation is even shorter This interval should not exceed 2m to avoid issues with staleness, and the caveats above around multiple intervals still apply. https://www.robustperception.io/keep-it-simple-scrape_interval-id/

And the recommendation is strong and that will likely not change, staleness is important to the actual TSDB as well.

May cause more duplicate series records in the WAL: new series records will be created for targets which have flapping series that are scraped very infrequently. Being able to disable this enhancement may be a suitable workaround.

This could be mitigated with a shorter default checkpointing interval, say 15 minutes

Aug 10 '22 19:08 cstyan