Add the ability to expire internal metrics.

Open tobz opened this issue 2 years ago • 9 comments

This epic represents the work required to complete the ability to control the cardinality of our internal metrics through expiring old ones.

Context

Currently, internal metrics are captured via the metrics crate, to which we utilize a customized Recorder implementation that forwards them to a base-bones Registry that stores them, and then on an interval, we collect the registered metrics and forward them into the topology.

While this works well from a performance/simplicity standpoint, we currently do not expire metrics that are no longer used. This can lead to issues, such as the one described in #11821, where high cardinality label values can generate many unique metrics in the registry over time, which comes together as a slow memory leak that requires restarting Vector itself.

Additionally, this can also result in ever-growing outputs from metrics sinks, where now-stale-and-no-longer-used metrics continue to be emitted needlessly. We've addressed this in the past for specific sinks, such as the prometheus_exporter sink in #9769, but there should be a more general solution.

Proposal

We should upgrade our usage of metrics in order to provide the ability to expire internal metrics such that old/stale metrics eventually get removed from the registry entirely, and are no longer reported.

To do so involves a few steps that should be achievable in an incremental fashion:

upgrade to metrics-0.18.x/metrics-util-0.12.x to bring in support for metric handles and a more customizable Registry
bolt on the Recency module (part of metrics-util) so that we can begin tracking the generation of each metric to properly understand when a metric has or hasn't been updated since a particular observation of it
create our own storage implementation that melds the required generational tracking of Recency with the ability to consume the value of counters (gauges and histograms need no change here) such that we can switch to emitting incremental counter metrics rather than absolute ones

Upgrading `metrics`/`metrics-util`

This is required due to the introduction of "metric handles" and the more flexible Registry/Storage primitives, which will allow us to interact with the metric values such that we can mutate them rather than simply observe them.

Bolting on `Recency` support

This one is sort of self-explanatory: Recency provides the metric storage wrapper and logic to detect metrics that have truly not been updated within a given time period. This is required because gauges could technically be the same value when observed at two different times, even if it was in fact changed in between those two observations. Recency deals with this by tracking a "generation" for a metric, such that any operation on the metric at all increments an internal counter -- the generation -- so that you can tell, for both of those two hypothetical observations mentioned above: "oh, the generation is identical, so it hasn't been updated" or "oh, the generations are different, so it's definitely still being updated".

Creating customized storage for the metrics

This one feeds into being able to emit incremental counters.

Currently, we emit absolute metrics for anything we scrape from the internal metrics registry. This is fine for gauges, and even for histograms, we end up merging that stuff together. This is bad for counters, though.

It's bad because for many metrics sinks, they only emit incrementalized metrics downstream, so in order to incrementalize an absolute metric, you must observe it at least twice in order to begin tracking the delta between observations. This means that if we expired a counter, and then it came back to life -- maybe it's updated very infrequently -- we would potentially lose some of the increments to that counter between the first time we see it when it comes back to life and the second time we observe it.

By writing our own customized storage, we could provide a hook to consume a counter's value -- essentially, swap the value to 0 and take the previous value -- such that we could emit the consumed value as an incremental update. This would ensure that even if a metric is expired but comes back, we're always consuming its entire value so we never miss an update.

This would work, specifically, because we would only do expiration when iterating over the metrics and observing them, not automatically in the background, so we would always have a chance to observe the metric (and potentially consume its value if it's a counter) before removing it from the registry.

Caveats

One main thing we'll want to pay attention to is the change, if any, in performance. The changes in metrics-0.18.x were designed to be primarily beneficial to static metric keys, and those who could utilize metric handles directly, both of which we currently do not and cannot do.

This would likely mean some potential performance regressions in areas where we emit many metrics, or emit metrics tied to the rate of events flowing through the topology or specific component.

We've handled these types of issues before by doing local aggregation or some other change to reduce the impact, but this would be at a high level across all metrics that we currently emit, so we'd want to pay attention to changes i.e. as surfaced by soak tests, etc.

Tasks

[x] #13625
[x] #13865

Future Work

[ ] #13863

Mar 25 '22 20:03 tobz

@tobz, I'd like to help execute this plan step by step because our Kubernetes cluster installation suffers from the problem of metrcs number growth.

I have experience doing the same thing by customizing metrics storage for the Prometheus client_go, and I believe we'll cope with this issue under your guidance. What do you think? Is it ok for me to put the plan into action?

May 23 '22 09:05 nabokihms

Yeah, we're happy to have someone interested in working on this, and I'll be happy to mentor/guide you where I can.

Happy to chat on Vector's Discord (text or voice, alias is tobz) to explain things more in depth, etc.

May 23 '22 13:05 tobz

Hello,

Anyone know if there is any workaround to deal with this issue? I have a memory leak reported in this issue, which was closed and referenced here, and it is blocking the move of our data pipeline in Logstash to Vector.

Currently I'm dealing with it by restarting the vector process every week, but I'm not sure that if I add more pipeline into vector it will leak the memory faster and requires more frequently restarts, so this is blocking the further adoption of vector.

I'm not using the internal metrics for anything, is there any possibility to disable it and avoid this memory leak?

Jul 19 '22 17:07 leandrojmp

Yes, this issue is really painful 😞

@jszwedko has branch with removed problematic metrics which I am using now: https://github.com/vectordotdev/vector/tree/custom-build-without-high-cardinality-metrics

Jul 19 '22 17:07 fpytloun

I'm about to open a PR, but apparently it will only be a part of the future v0.24 release.

Jul 19 '22 18:07 nabokihms

@nabokihms This issue is on our roadmap to pursue in the near future. As you have indicated you have some changes in hand to help with this effort. Before we duplicate your efforts, could you let us know what approach you are taking on this problem?

Jul 21 '22 15:07 bruceg

Hi @nabokihms ! Apologies for pestering. We just wanted to check in to see if you still had plans to work on this or had already completed some work towards this so we can decide whether to move forward with this as we want to include it in the upcoming 0.24.0 release if we can.

Jul 28 '22 16:07 jszwedko

@jszwedko @bruceg I did not manage to go far with it and stuck with the way to detect staleness (and I also lost time to update dependencies, what Bruce've already committed).

Sorry for being a snail 😅 It looks like I cannot keep the pace here. Hope I will be able to help the project the other way.

Jul 28 '22 17:07 nabokihms

Hey @nabokihms ! No worries 😄 We are happy to pick it up, we just didn't want to step on your toes. We appreciate all of your contributions to Vector thus far.

Jul 28 '22 21:07 jszwedko

Hello @jszwedko,

I saw that a new option was implemented, expire_metrics_secs, would it help solve the memory issue leak from this issue? Currently I need to reboot vector twice a week to avoid it being killed by an OOM error.

Also the expire_metrics_secs has this note:

Be careful to set this value high enough to avoid expiring critical but infrequently updated internal counters.

What would be the impact? What are those internal metrics used for? I do not have any monitoring on the vector process, just use it to ship data to Kafka.

Oct 01 '22 17:10 leandrojmp

Be careful to set this value high enough to avoid expiring critical but infrequently updated internal counters.

What would be the impact? What are those internal metrics used for? I do not have any monitoring on the vector process, just use it to ship data to Kafka.

Hi @leandrojmp ! The internal metrics are exported by internal_metrics and used to drive vector tap (via Vector's API). If you don't use either of those you can safely set that number pretty low. It will resolve the issue mentioned by https://github.com/vectordotdev/vector/issues/11025

As you note, this issue is resolved so I'll close it. Thanks for bringing my attention to it.

Oct 04 '22 17:10 jszwedko

Hello @jszwedko,

Unfortunatelly this change in 0.24.1 did not solve the memory leak issue from #11025.

In reality I think it looks worse, I have to restart the service daily now to avoid a OutOfMemory killing the process.

Oct 08 '22 15:10 leandrojmp

vector vector copied to clipboard

Add the ability to expire internal metrics.

Context

Proposal

Upgrading metrics/metrics-util

Bolting on Recency support

Creating customized storage for the metrics

Caveats

Tasks

Future Work

vector
vector copied to clipboard

Upgrading `metrics`/`metrics-util`

Bolting on `Recency` support