metrics
metrics copied to clipboard
feat: add capability to purge old histogram data
What
- add purge_timeout option to PrometheusBuilder
- run a purger that purges based on the purge_timeout
Implements third way as prescribed here to purge old histogram data:
- update the builder to generate a future which both drives the Hyper server future as well as a call to get_recent_metrics on an interval
Fixes https://github.com/metrics-rs/metrics/issues/245
As I read back through what I originally wrote... I think a major flaw with my proposal to do this by periodically calling render
is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.
I think what we'd actually have to do here is split the logic out of get_recent_metrics
that handles draining the histograms. Essentially, take this code and stick it into a new method on the struct -- let's call it run_upkeep
for this example -- and trim it down so it only iterates through each histogram, finds/creates the entry in self.distributions
, and then drains samples into the entry. After that, we'd add a call to that method right before this block, and in that block we'd remove the logic where it drains the histograms.
Does that make sense?
Thanks for the update!
I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.
Does this refer to the self.recency.should_store_*
call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.
Thanks for the update!
I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.
Does this refer to the
self.recency.should_store_*
call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.
(Sorry for the delayed response here.)
Yes, you're right that this would be the crux of the concern.
The example would be something like:
- the exporter is configured with an idle timeout of 10 seconds, and a purger interval of 5s
- the exporter is scraped at t=0, which includes metric A
- metric A is updated at t=5 (meaning it will go idle at t=15 if not updated by then)
- the purger runs at t=5 and t=10 and since nothing is idle yet, all it ends up doing is draining the histogram buckets
- the purger again runs at t=15, but now, metric A is considered idle, which removes metric A from the registry
- the exporter is scraper at t=30, which now no longer includes metric A
Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.
Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.
Ah this makes sense once I checked documentation for PrometheusBuilder::idle_timeout:
This behavior is driven by requests to generate rendered output, and so metrics will not be removed unless a request has been made recently enough to prune the idle metrics.
I have separated out the purge
action to only drain the histograms now.