prom-client
prom-client copied to clipboard
Aggregation of t-digest types broken by default
In practice, summary data in prometheus can be hard to aggregate, because they lose a bit of data. I've been using a custom IPC setup to push summary data generated on workers to a master's summary metric to avoid integration. However, this is cumbersome and I'd like to use the default metrics aggregator provided by prom-client
. Can you speak to how this works in practice? It's a bit difficult for me to figure out.
I'm not sure what concretely you're asking? Are you asking how the aggregation works in this library before the scrape by prometheus?
I'm also not sure what "worker" and "master" is here. Child process? Cluster? The new Worker in node 10?
Yes, sorry for not writing that as well as I could have. I am asking about node's cluster module and specifically about how the summary instrument is aggregated in response to a Prometheus scrape. It seems like the cluster master sends an IPC request to each of the workers, but I'm not sure how the math works out. Other instrument types have straightforward ways to be aggregated, so I'm just curious if anything special needs to be done for summary.
I was motivated because after I switched to using the default aggregator registry for my summary metric, I noticed my metrics looked very different (several times higher in most quantiles.
@zbjornson ^
I think histograms and summaries behave the same as counters (which sum by default) within each individual sum/bucket. Tomorrow I can test to be sure.
Ah, I think you are correct. Here's a test script:
const cluster = require('cluster')
const Prometheus = require('prom-client')
if (cluster.isMaster) {
for (let i = 0; i < 3; i++) {
cluster.fork()
}
const metricsServer = require('express')()
const aggregatorRegistry = new Prometheus.AggregatorRegistry()
metricsServer.get('/metrics', function(req, res) {
aggregatorRegistry.clusterMetrics((err, metrics) => {
res.set('Content-Type', aggregatorRegistry.contentType)
res.end(metrics)
})
})
metricsServer.listen(8000)
} else {
const summary = new Prometheus.Summary({
name: 'test',
help: 'test'
})
summary.observe(1)
}
And the output:
$ curl http://localhost:8000/metrics
# HELP test test
# TYPE test summary
test{quantile="0.01"} 3
test{quantile="0.05"} 3
test{quantile="0.5"} 3
test{quantile="0.9"} 3
test{quantile="0.95"} 3
test{quantile="0.99"} 3
test{quantile="0.999"} 3
test_sum 3
test_count 3
I think this is probably not the behavior people would expect with default aggregation. I believe that summaries and histograms used the t-digest primitive, which appears to be built to be combined. If so, it seems like we could send the t-digest data over when summaries need to be aggregated.
Alternatively, might it be better for the library to do something under-the-covers where the master worker holds the "real" in-memory instrument, and the worker nodes, when observing measurements, merely pass those measurements over IPC to the master?
Unfortunately I know very little about how histogram and summary metrics work. Both of the options you suggested in your last two paragraphs sound reasonable though. @brian-brazil, @SimenB or @siimon might know what the best approach. Definitely sounds like something to fix.
Those quantiles aren't right, they should all be 1 given that input.
We're wondering what the best way to correct that is: send T-digest data for combining on the master, or sending the individual, pre-sum data for summing/combining on the master?
I think summing is the correct thing to do for histograms (since it's reporting absolute counts in each bin), so just summaries need to be corrected.
I think one problem is that the t-digest may be kinda large, so serializing/deserializing on every prometheus request is a bit rough. I think having prom-client perform under-the-cover IPC for observations for all metrics is probably better.
Any consensus on approach?
There is also a leaking behavior in the solution with t-digest like I described in #142 .
I renamed the title to be a bit more on the mark and am bumping the thread! Any news?