client_rust Custom collector for multiple metrics

Hi! I've been looking at implementing a Prometheus collector for the recently announced tokio-metrics crate. Every scrape, I'd like to gather runtime metrics for the currently Tokio runtime. The problem is that doing so requires a non-trivial amount of up-front work to aggregate all of the stats across the N workers in the runtime, which I'd rather not do during every metric's encode function (following the custom metric example).

Instead I think it'd be ideal if there was a way to do something similar to the client_python Custom Collector example, which allows custom collectors to record values for multiple metrics at each scrape time - that'd avoid me having to duplicate work (non-atomically) on every scrape. Do you think such an API would be possible?

Alternatively if you know of another pattern to get around this, I'd love to hear it!

Feb 28 '22 21:02 sd2k

As you stated above, a clean solution is not possible today, i.e. would require some changes within prometheus-client. I am not opposed to supporting this use-case, though I need to put more thoughts into the design of a clean abstraction. Design suggestions are most welcome.

In the meantime would sharing some state between the custom metrics be an option? That shared state would be updated with a new aggregate by a single custom metric per scrape. Consecutive custom metrics within the same scrape don't need to aggregate themselves, but instead access the shared state.

Mar 01 '22 13:03 mxinden

Thanks for the reply! Yep, I thought as much, no worries - I'll have a think about potential APIs.

In the meantime would sharing some state between the custom metrics be an option? That shared state would be updated with a new aggregate by a single custom metric per scrape. Consecutive custom metrics within the same scrape don't need to aggregate themselves, but instead access the shared state.

Yeah, I've got a solution where each custom metric shares a reference to some shared state, but there's a few things about it that are suboptimal:

for the metrics to be 'static they can't hold references, so I'm having to use an Rc<RefCell<State>> instead (this isn't too bad though)
the only way I can determine whether one metric's encode call is within the same scrape as another metric's encode call is by either:
- storing the Instant that the state was last updated (and only updating the state if it's more than some Duration ago), or
- assuming that metrics are scraped in the same order that they're registered, so that I can set a flag in the state in the first metric's encode and unset it in the final metric's encode. I'm not super happy relying on that assumption though since it could change in future!

Mar 01 '22 16:03 sd2k

the only way I can determine whether one metric's encode call is within the same scrape as another metric's encode call is by either:

We could give each Prometheus scrape an ID and then make that ID available through the Encoder passed in EncodeMetric::encode. Not yet sure how much I would consider that to be a hack.

I think something worth exploring is the ability to register a Collector with a Registry. The Collector would have a method collect which returns an Iterator<Item = (Description, Metric)>. The text encode function could iterate that Iterator and proceed as it does with any other metric.

Mar 02 '22 10:03 mxinden

The problem is that [collecting metrics] requires a non-trivial amount of up-front work to aggregate all of the stats across the N workers in the runtime

Can you say more about this? What work is involved?

Mar 25 '22 04:03 08d2

I've now changed the implementation so that comment isn't quite correct, although my current implementation still feels a bit suboptimal.

Previously I'd thrown together a cumulative method for RuntimeMonitor, which basically iterated over the workers in the pool and calculated each of the metrics on RuntimeMetrics. I would have had to call cumulative for every metric since their encode calls can't share state, though.

I do something a bit different now - I've pushed it up to http://github.com/sd2k/tokio-metrics-prometheus in case it's of any use. I'd like to get it up to crates.io soon but want to figure out #57 first!

Mar 25 '22 19:03 sd2k

I'm going to describe some fundamental assumptions of Prometheus, because it's not totally clear to me if these crates abide those requirements. But I'm no expert and it's entirely possible that the issues you're talking about have nothing to do with this stuff. If that's true, or if I'm telling you stuff you already know, my apologies!

Prometheus operates in a "pull" model whereby Prometheus servers scrape targets on a regular interval. Scrape means make an HTTP GET which should yield the current state of all metrics (timeseries) known to the process. But reading the current state of a metric should — with some exceptions — not be an expensive operation. The expectation is that each timeseries value is maintained as a simple primitive value, which is cheap to both read and write.

(The exception is "func" metrics, like gauge funcs, which are implemented as functions that get called during scrapes, and return values. These can be expensive! But shouldn't be. Scrapes are expected to be fast.)

So...

. . . iterated over the workers in the pool and calculated each of the metrics on RuntimeMetrics . . .

When Prometheus performs a scrape, the expectation is that the HTTP handler is doing a bunch of relatively cheap, likely atomic, reads of simple primitive values. Which is to say that any "calculation" is expected to be done ahead of time.

edit: Another way of saying all of this is that encode should be encoding already-calculated values. Or, maybe, that metrics are expected to be long-lived values in a program, which are written-to by code that would update them, and read-from by the call chain that starts from a Prometheus scrape. Metrics can be modeled as functions that get called with each scrape, but this is an exceptional case that should be used only when necessary.

Is that not the case here?

Mar 26 '22 18:03 08d2

the only way I can determine whether one metric's encode call is within the same scrape as another metric's encode call is by either: . . .

A scrape will indeed invoke encode on a set of metrics, but encode shouldn't know anything about scrapes. Rather, encode should operate on an immutable snapshot, i.e. a copy, of a metric value, which scrapes should capture.

Mar 26 '22 19:03 08d2

Indeed, we're in agreement on most things here I think! In my specific case I'm trying to instrument an external crate so I don't have the ability to increment counters or anything when it's actually happening. Instead, the tokio_metrics::RuntimeMonitor::intervals method is the only way to get hold of the state I need and it comes in the form of an iterator, which I need to advance at the start of each scrape to get the current value.

The quoted part of your comment:

. . . iterated over the workers in the pool and calculated each of the metrics on RuntimeMetrics . . .

is a (poor) workaround I had to use due to the lack of such an API 🙂 I've since switched to a much less expensive method, but that still requires sharing state between multiple custom metrics which isn't a super clean implementation.

The aim of this issue is to provide a convenient API that allows efficient scraping of multiple metrics that represent some state that's out of my control, without resorting to complex state sharing.

What I'd like is to be able to:

create a struct that represents multiple metrics (in this case, each of the tokio-metrics metrics)
implement a trait provided by the prometheus_client crate (say EncodeMetrics - note the plural)
write an implementation of something like EncodeMetrics::encode_metrics, which is similar to EncodeMetric::encode but is permitted to encode multiple metrics, not just one (which isn't permitted for the existing trait - it's assumed to be encoding a single metric)

That would allow me to do something like:

struct TokioRuntimeCollector {
    // The iterator to get data since the last scrape
    runtime_monitor_iter: Box<dyn Iterator<Item = tokio_metrics::RuntimeMetrics>>,

    // The metrics we care about
    park_count: Counter,
    busy_duration: Counter<f64, Atomic<F64>>,
    // ...other metrics here
}

impl EncodeMetrics for TokioRuntimeCollector {
    fn encode_metrics(&mut self, &mut encoder: Encoder) -> Result<(), std::io::Error> {
        // Get data since the last scrape.
        let new = self.runtime_monitor_iter.next().unwrap();

        // Update state.
        self.park_count.inc_by(new.park_count);
        self.busy_duration.inc_by(new.busy_duration.as_secs_f64());

        // Encode new state
        self.park_count.encode(&mut encoder)?;
        self.busy_duration.encode(&mut encoder)?;
        // etc
    }
}

(Note: I'm not proposing that this is an API that would actually work, but hopefully it conveys my meaning).

Mar 26 '22 20:03 sd2k

The Collectors example in the Golang client docs explains this in better detail than I possibly could, too.

Mar 26 '22 20:03 sd2k

What I'd like is to be able to:

create a struct that represents multiple metrics (in this case, each of the tokio-metrics metrics)

Check.

implement a trait provided by the prometheus_client crate (say EncodeMetrics - note the plural)

Ah! So this shouldn't be EncodeMetrics, but rather CollectMetrics. I think the disconnect here may be that client_rust doesn't currently provide a well-defined abstraction layer between collection and encoding. A registry is something that holds long-lived mutable metric values which can be mutated and collected; a collector is typically a trait implemented by a registry which yields a "snapshot" of each metric which can be encoded; and an encoder is something that encodes those fixed metrics for scraping.

tl;dr: collecting != encoding

The aim of this issue is to provide a convenient API that allows efficient scraping of multiple metrics that represent some state that's out of my control, without resorting to complex state sharing.

I think not scraping but collecting? Which I think could be solved by defining a new collector trait?

Mar 26 '22 20:03 08d2

Yep, I think that's accurate! (I did find it strange that I was implementing something called encode - there's nothing encoding specific about what I'm doing.)

Mar 26 '22 20:03 sd2k

Cross referencing proposal here https://github.com/prometheus/client_rust/pull/82

Aug 29 '22 04:08 mxinden

client_rust client_rust copied to clipboard

Custom collector for multiple metrics

client_rust
client_rust copied to clipboard