client_rust src/collector: Introduce Collector abstraction

The Collector abstraction allows users to provide additional metrics and their description on each scrape.

See also:

https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#hdr-Custom_Collectors_and_constant_Metrics
https://github.com/prometheus/client_rust/issues/49
https://github.com/prometheus/client_rust/issues/29

I am opening this up as a "Draft" for early feedback.

Aug 29 '22 04:08 mxinden

Also //CC @sd2k

Aug 29 '22 04:08 mxinden

And //CC @08d2

Aug 29 '22 04:08 mxinden

Looks like it will work for the process collector use case! Thanks.

Aug 29 '22 05:08 dovreshef

Am I understanding correctly that the user of the lib is expected to reimplement the Collector trait such that calling collect() will do whatever is necessary to produce the metrics, possibly querying some external system?

If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.

Aug 29 '22 08:08 vladvasiliu

Am I understanding correctly that the user of the lib is expected to reimplement the Collector trait such that calling collect() will do whatever is necessary to produce the metrics, possibly querying some external system?

If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.

For context, collect() is called when Prometheus does an HTTP GET of the metrics endpoint for an instance of a job.

That call is subject to a few layers of (typically sub-second) timeouts and failsafes, and is expected to return more or less immediately. An implementation of collect() which blocked in any way that would benefit from async would almost certainly be a design error.

If the information returned by collect() needed to come from an external source, then it's I guess a de facto requirement that the retrieval is decoupled from the collecting.

Aug 29 '22 19:08 08d2

That's what I was thinking, although this example seems to be reading stuff (memory info) from an external system (though, in this case, it's probably unlikely that it should take long).

I have to think a bit more about this; it seems to me that once the collector is added to the registry, there's no easy way to only update it such that it doesn't share state between possibly concurrent calls to the /metrics endpoint, all the while updating said state outside collect().

Aug 29 '22 19:08 vladvasiliu

although this example seems to be reading stuff (memory info) from an external system

callLongGetter? Yeah, it's not obvious to me what that does, exactly... but I wouldn't take too much inspiration from the Java client, which hasn't been updated in a long while, and was never particularly idiomatic. Stick with the Go client, or maybe look at the more active exporters hosted in the Prometheus org.

there's no easy way to only update it such that it doesn't share state between possibly concurrent calls to the /metrics endpoint, all the while updating said state outside collect().

The nature of the problem demands shared state, for sure. (Or a "clever" workaround to avoid it, which, well.) But does shared state imply async, either necessarily or even ideally? I don't think so, but I could be wrong.

Aug 29 '22 19:08 08d2

callLongGetter? Yeah, it's not obvious to me what that does, exactly... but I wouldn't take too much inspiration from the Java client, which hasn't been updated in a long while, and was never particularly idiomatic. Stick with the Go client, or maybe look at the more active exporters hosted in the Prometheus org.

I was thinking more about collectMemoryMetricsLinux. The reason I gave this example was because it's the one mentioned on the docs.

The nature of the problem demands shared state, for sure. (Or a "clever" workaround to avoid it, which, well.) But does shared state imply async, either necessarily or even ideally? I don't think so, but I could be wrong.

I wouldn't say shared state implies async. I think the issue was the way I was thinking about somehow getting the external (non-shared) state as a collector of the shared state, which would have then required collect() to cover async collection.

This could probably be implemented the other way around, with the shared state added as a sub-collector to the non-shared one before returning. Which, if collect() is guaranteed to return quickly, could be called from an async context.

Aug 30 '22 08:08 vladvasiliu

I've been looking a bit more over this, and I think that, basically, a Registry is a Collector: it holds a bunch of metrics, right? That's how the Python client is implemented.

I think that if the lib provided an implementation of Collector for Registry, this would solve the majority of use cases.

Aug 30 '22 14:08 vladvasiliu

Am I understanding correctly that the user of the lib is expected to reimplement the Collector trait such that calling collect() will do whatever is necessary to produce the metrics, possibly querying some external system?

Correct.

If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.

Can you be more concrete? Can you describe a use-case where one would need an async collector?

That call is subject to a few layers of (typically sub-second) timeouts and failsafes, and is expected to return more or less immediately.

Prometheus does not expect the scrape call to return immediately. E.g. from the Prometheus scheduling section:

The default scrape timeout for Prometheus is 10 seconds.

If the information returned by collect() needed to come from an external source, then it's I guess a de facto requirement that the retrieval is decoupled from the collecting.

By default the collection should not be decoupled from the collection. Citing the Prometheus scheduling section:

Metrics should only be pulled from the application when Prometheus scrapes them, exporters should not perform scrapes based on their own timers. That is, all scrapes should be synchronous.

I think that if the lib provided an implementation of Collector for Registry, this would solve the majority of use cases.

Registry could implement Collector, though I have refrained from that thus far to reduce complexity, i.e. not to mix concepts. I expect a user to have a single Registry but many `Collectors.

I don't follow how Registry implementing Collector would solve an issue here. Would you mind expanding on that?

Aug 31 '22 06:08 mxinden

Also note that the encode function takes an immutable reference to a Registry, thus two Prometheus servers (e.g. for fault tolerance) may scrape the same endpoint concurrently.

Aug 31 '22 06:08 mxinden

Can you be more concrete? Can you describe a use-case where one would need an async collector?

My use case is exporting metrics that are gathered via multiple HTTP calls. If my HTTP client is async, collect() should be able to handle that.

Prometheus does not expect the scrape call to return immediately. E.g. from the Prometheus scheduling section:

Well, the question is whether collect() itself should return immediately. The other operations could be done outside it, while staying inside the global scrape.

I don't follow how Registry implementing Collector would solve an issue here. Would you mind expanding on that?

See the other point: this is a confusion about whether collect() should be guaranteed to be quick or not and whether a collector should be different from a registry.

If it's not, and random long operations can be done inside collect(), then I can see how a Collector can be different from a Registry, and thus providing an default implementation for Registry doesn't help, since every collector is expected to be different and do its own thing.

However, in the second situation, when the actual gathering of the metrics should be done outside of the collect() method, then basically this becomes generating two separate registries, and combining them before calling collect(). The idea wouldn't be to do the scraping outside of the call from prometheus, but only outside the call to collect().

Aug 31 '22 08:08 vladvasiliu

Looking at the Go implementation, specifically the example collector, it's clear that the collect() method is not expected to return "immediately".

This means that there's no particular reason to provide an implementation of Collector for Registry, but it can be useful to provide an alternative, async Collector, in case the actual job of collection is done asynchronously.

Aug 31 '22 08:08 vladvasiliu

That example describes how a collector might be "bolted on" to a legacy system. It's not indicative of best practice. But the point is largely moot, I think.

Sep 02 '22 05:09 08d2

My current examples why a collector would need to support async are:

collecting metrics from devices via e.g. Bluetooth
collecting container metrics comparable to cadvisor where the list of metrics is basically generated from a current state that is retrieved from the filesystem, containerd APIs etc.

While both approaches could also refresh the data in the background, having it directly connected to the scrape and being able to fetch fresh data on a scrape provides more reliable resolution than scraping previously scraped data.

Now, the data refresh could actually be handled in the http handler, but still it seems like an unnecessary detour.

Sep 02 '22 06:09 kwiesmueller

As a first iteration, I suggest we support synchronous Collector::collect only.

For those that need to call async methods in their Collector::collect call, I suggest you spawn the encode call in a new OS thread (or in a blocking-aware runtime task) and call any async methods via block_on in your Collector::collect implementation.
For those that need concurrency within your Collector::collect implementation (e.g. reaching out to two remote machines concurrently) I suggest you spawn two or more OS threads within your Collector::collect implementation and join the threads again.
For those that need the Collector::collect calls of two distinct Collector implementations to run concurrently, you would be out of luck for now.

In case folks see a large performance hit, e.g. due to thread spawns (even though that should be < 10 microseconds), or (1) and (2) add too much complexity in your Collector::collect implementation or folks need (3) I suggest we design an async Collector::collect as a next step.

Sep 04 '22 07:09 mxinden

@kwiesmueller Thanks for sharing. Let me know in case the threading approach is good enough for you for now, i.e. can collect from the many datasources you are targeting within the Prometheus scrape timeout.

If not, as mentioned above, let's design an async Collector::collect as a next step.

Sep 04 '22 07:09 mxinden

I rebased this pull request onto https://github.com/prometheus/client_rust/pull/100. Everything compiles, tests succeed.

To me the above validates that we should move forward with https://github.com/prometheus/client_rust/pull/100. That said, I do think the signatures in this pull request need more work (e.g. the workaround with MaybeOwned).

Oct 02 '22 11:10 mxinden

Very interested in this work are you looking for help or what would help move this forward.

Dec 03 '22 21:12 hexfusion

Very interested in this work

Thanks for the interest. Sorry for the silence.

I rebased this pull request on https://github.com/prometheus/client_rust/pull/105. I want to try out the new API with one of my projects (e.g. https://github.com/mxinden/kademlia-exporter/) before merging #105 and this pull request.

are you looking for help or what would help move this forward.

I would appreciate feedback on #105 and this pull request. You would be of great help by testing this pull request (it is on top of #105) within your application in need of the Collector abstraction @hexfusion.

Dec 06 '22 12:12 mxinden

I have some cycles today will take a look, thanks!

Dec 06 '22 12:12 hexfusion

Tested this patch with rust-libp2p and kademlia-exporter. While the Collector trait is still quite verbose (with Cow and MaybeOwned), it does function as expected and allows ad-hoc cheap metric generation.

https://github.com/mxinden/kademlia-exporter/pull/209

Dec 18 '22 20:12 mxinden