src/collector: Introduce Collector abstraction
The Collector abstraction allows users to provide additional metrics
and their description on each scrape.
See also:
- https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#hdr-Custom_Collectors_and_constant_Metrics
- https://github.com/prometheus/client_rust/issues/49
- https://github.com/prometheus/client_rust/issues/29
I am opening this up as a "Draft" for early feedback.
Also //CC @sd2k
And //CC @08d2
Looks like it will work for the process collector use case! Thanks.
Am I understanding correctly that the user of the lib is expected to reimplement the Collector trait such that calling collect() will do whatever is necessary to produce the metrics, possibly querying some external system?
If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.
Am I understanding correctly that the user of the lib is expected to reimplement the
Collectortrait such that callingcollect()will do whatever is necessary to produce the metrics, possibly querying some external system?If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.
For context, collect() is called when Prometheus does an HTTP GET of the metrics endpoint for an instance of a job.
That call is subject to a few layers of (typically sub-second) timeouts and failsafes, and is expected to return more or less immediately. An implementation of collect() which blocked in any way that would benefit from async would almost certainly be a design error.
If the information returned by collect() needed to come from an external source, then it's I guess a de facto requirement that the retrieval is decoupled from the collecting.
That's what I was thinking, although this example seems to be reading stuff (memory info) from an external system (though, in this case, it's probably unlikely that it should take long).
I have to think a bit more about this; it seems to me that once the collector is added to the registry, there's no easy way to only update it such that it doesn't share state between possibly concurrent calls to the /metrics endpoint, all the while updating said state outside collect().
although this example seems to be reading stuff (memory info) from an external system
callLongGetter? Yeah, it's not obvious to me what that does, exactly... but I wouldn't take too much inspiration from the Java client, which hasn't been updated in a long while, and was never particularly idiomatic. Stick with the Go client, or maybe look at the more active exporters hosted in the Prometheus org.
there's no easy way to only update it such that it doesn't share state between possibly concurrent calls to the /metrics endpoint, all the while updating said state outside collect().
The nature of the problem demands shared state, for sure. (Or a "clever" workaround to avoid it, which, well.) But does shared state imply async, either necessarily or even ideally? I don't think so, but I could be wrong.
callLongGetter? Yeah, it's not obvious to me what that does, exactly... but I wouldn't take too much inspiration from the Java client, which hasn't been updated in a long while, and was never particularly idiomatic. Stick with the Go client, or maybe look at the more active exporters hosted in the Prometheus org.
I was thinking more about collectMemoryMetricsLinux. The reason I gave this example was because it's the one mentioned on the docs.
The nature of the problem demands shared state, for sure. (Or a "clever" workaround to avoid it, which, well.) But does shared state imply async, either necessarily or even ideally? I don't think so, but I could be wrong.
I wouldn't say shared state implies async. I think the issue was the way I was thinking about somehow getting the external (non-shared) state as a collector of the shared state, which would have then required collect() to cover async collection.
This could probably be implemented the other way around, with the shared state added as a sub-collector to the non-shared one before returning. Which, if collect() is guaranteed to return quickly, could be called from an async context.
I've been looking a bit more over this, and I think that, basically, a Registry is a Collector: it holds a bunch of metrics, right? That's how the Python client is implemented.
I think that if the lib provided an implementation of Collector for Registry, this would solve the majority of use cases.
Am I understanding correctly that the user of the lib is expected to reimplement the
Collectortrait such that callingcollect()will do whatever is necessary to produce the metrics, possibly querying some external system?
Correct.
If so, I think there should be a way to provide an async collector; this will probably require implementing some alternative AsyncRegistry with matching AsyncCollector trait.
Can you be more concrete? Can you describe a use-case where one would need an async collector?
That call is subject to a few layers of (typically sub-second) timeouts and failsafes, and is expected to return more or less immediately.
Prometheus does not expect the scrape call to return immediately. E.g. from the Prometheus scheduling section:
The default scrape timeout for Prometheus is 10 seconds.
If the information returned by
collect()needed to come from an external source, then it's I guess a de facto requirement that the retrieval is decoupled from the collecting.
By default the collection should not be decoupled from the collection. Citing the Prometheus scheduling section:
Metrics should only be pulled from the application when Prometheus scrapes them, exporters should not perform scrapes based on their own timers. That is, all scrapes should be synchronous.
I think that if the lib provided an implementation of
CollectorforRegistry, this would solve the majority of use cases.
Registry could implement Collector, though I have refrained from that thus far to reduce complexity, i.e. not to mix concepts. I expect a user to have a single Registry but many `Collectors.
I don't follow how Registry implementing Collector would solve an issue here. Would you mind expanding on that?
Also note that the encode function takes an immutable reference to a Registry, thus two Prometheus servers (e.g. for fault tolerance) may scrape the same endpoint concurrently.
Can you be more concrete? Can you describe a use-case where one would need an async collector?
My use case is exporting metrics that are gathered via multiple HTTP calls. If my HTTP client is async, collect() should be able to handle that.
Prometheus does not expect the scrape call to return immediately. E.g. from the Prometheus scheduling section:
Well, the question is whether collect() itself should return immediately. The other operations could be done outside it, while staying inside the global scrape.
I don't follow how Registry implementing Collector would solve an issue here. Would you mind expanding on that?
See the other point: this is a confusion about whether collect() should be guaranteed to be quick or not and whether a collector should be different from a registry.
If it's not, and random long operations can be done inside collect(), then I can see how a Collector can be different from a Registry, and thus providing an default implementation for Registry doesn't help, since every collector is expected to be different and do its own thing.
However, in the second situation, when the actual gathering of the metrics should be done outside of the collect() method, then basically this becomes generating two separate registries, and combining them before calling collect(). The idea wouldn't be to do the scraping outside of the call from prometheus, but only outside the call to collect().
Looking at the Go implementation, specifically the example collector, it's clear that the collect() method is not expected to return "immediately".
This means that there's no particular reason to provide an implementation of Collector for Registry, but it can be useful to provide an alternative, async Collector, in case the actual job of collection is done asynchronously.
That example describes how a collector might be "bolted on" to a legacy system. It's not indicative of best practice. But the point is largely moot, I think.
My current examples why a collector would need to support async are:
- collecting metrics from devices via e.g. Bluetooth
- collecting container metrics comparable to cadvisor where the list of metrics is basically generated from a current state that is retrieved from the filesystem, containerd APIs etc.
While both approaches could also refresh the data in the background, having it directly connected to the scrape and being able to fetch fresh data on a scrape provides more reliable resolution than scraping previously scraped data.
Now, the data refresh could actually be handled in the http handler, but still it seems like an unnecessary detour.
As a first iteration, I suggest we support synchronous Collector::collect only.
- For those that need to call
asyncmethods in theirCollector::collectcall, I suggest you spawn theencodecall in a new OS thread (or in a blocking-aware runtime task) and call anyasyncmethods viablock_onin yourCollector::collectimplementation. - For those that need concurrency within your
Collector::collectimplementation (e.g. reaching out to two remote machines concurrently) I suggest you spawn two or more OS threads within yourCollector::collectimplementation and join the threads again. - For those that need the
Collector::collectcalls of two distinctCollectorimplementations to run concurrently, you would be out of luck for now.
In case folks see a large performance hit, e.g. due to thread spawns (even though that should be < 10 microseconds), or (1) and (2) add too much complexity in your Collector::collect implementation or folks need (3) I suggest we design an async Collector::collect as a next step.
@kwiesmueller Thanks for sharing. Let me know in case the threading approach is good enough for you for now, i.e. can collect from the many datasources you are targeting within the Prometheus scrape timeout.
If not, as mentioned above, let's design an async Collector::collect as a next step.
I rebased this pull request onto https://github.com/prometheus/client_rust/pull/100. Everything compiles, tests succeed.
To me the above validates that we should move forward with https://github.com/prometheus/client_rust/pull/100. That said, I do think the signatures in this pull request need more work (e.g. the workaround with MaybeOwned).
Very interested in this work are you looking for help or what would help move this forward.
Very interested in this work
Thanks for the interest. Sorry for the silence.
I rebased this pull request on https://github.com/prometheus/client_rust/pull/105. I want to try out the new API with one of my projects (e.g. https://github.com/mxinden/kademlia-exporter/) before merging #105 and this pull request.
are you looking for help or what would help move this forward.
I would appreciate feedback on #105 and this pull request. You would be of great help by testing this pull request (it is on top of #105) within your application in need of the Collector abstraction @hexfusion.
I have some cycles today will take a look, thanks!
Tested this patch with rust-libp2p and kademlia-exporter. While the Collector trait is still quite verbose (with Cow and MaybeOwned), it does function as expected and allows ad-hoc cheap metric generation.
https://github.com/mxinden/kademlia-exporter/pull/209