prom-client
prom-client copied to clipboard
enhancement: Cluster worker scrape timeouts/errors should be recorded
If a worker is heavily loaded or otherwise pathologically performant, a cluster master may time out when scraping that worker (https://github.com/siimon/prom-client/blob/master/lib/cluster.js#L51-L61). The cluster master then returns an incomplete set of data (since some subset of workers have returned data, but not necessarily all). I think it makes sense to record an error counter for clustered worker scrapes, such that it's exposed to downstream consumers.
If this feature makes sense, I am happy to implement it! (cc @dt-rush)
Workaround for now, as @xginn8 devised:
max(scrape_duration_seconds{job="vpn/cluster"}) < bool 5will return1when a scrape takes less than 5s, and0when it takes longer.combining that with the query we already have,
sum(vpn_online_devices{instance=~"$instance"}) * (max(scrape_duration_seconds{job="vpn/cluster"}) < bool 5) > 0will essentially drop any "incomplete" datapoints, which we can both a) visualize and b) ignore in the larger visualization, to avoid confusion
Dropping the incomplete datapoints query-side is one thing, but dropping them entirely seems to be worth something in my view, unless (this is the only workable "solution", in the case a cluster worker times out) some kind of interpolation logic client-side were included, to "guess" the missing cluster worker value according to its previous values, or according to its siblings.
If the number of cluster workers is known, and the number of cluster workers timing out is known (presently such metric are not exported), such an averaged "guess" of the timed out worker would be possible client-side with:
metric_A * (promclient_n_cluster_workers{metric="metric_A"} / ((promclient_n_cluster_workers{metric="metric_A"} -promclient_cluster_workers_timed_out{metric="metric_A"}) > 0))
If, say, 3 out of 5 cluster workers timed out, we'd have 2/5ths of the cluster workers reporting. Scaling that up by a factor of 5/(5 - 3) approximates what we might observe if all workers reported (assumption that workers are responsible for approximately equal portions of the aggregated total). This may not be a fair assumption, but if it is, allows an approximated picture despite worker timeouts, when they occur.
Additionally, graphs could be annotated at the point these "approximations" are occurring by checking when promclient_cluster_workers_timed_out > 0, or drop the metric, as well, in the case that promclient_cluster_workers_timed_out > 0.
I'm not super keen on interpolation (rather have no data than fake data), but agree this could be handled better. I don't know what Prometheus does during a scrape if we were to switch to all-or-none agglomeration -- would that work for you guys? How long of a timeout do you need to avoid missed metrics entirely? (I've only encountered them a handful of times personally.)
For the record, this also occurs when a worker has died, not just in the case of timeouts (this is what @xginn8 meant by "timeouts / errors")
I see there to be 2 problems here:
- hard-coded timeout
- portion of aggregate metrics can be dropped without any way to detect it
@zbjornson I think discarding an entire scrape due to failed worker(s) is the wrong approach, as we'd be throwing away useful data that may not be aggregated at all (from the master process). My preference would simply be to increment an error counter with some information about the missing scrape, such that it's clear to any consumers that the data is potentially incomplete. In this manner, those consumers can act on that as they wish, but the library makes no assumptions about the utility of that "partial" scrape.
In the case of a worker dying, no timeout will be sufficient since the PID will never come back.