prom-client icon indicating copy to clipboard operation
prom-client copied to clipboard

enhancement: Cluster worker scrape timeouts/errors should be recorded

Open xginn8 opened this issue 6 years ago • 6 comments

If a worker is heavily loaded or otherwise pathologically performant, a cluster master may time out when scraping that worker (https://github.com/siimon/prom-client/blob/master/lib/cluster.js#L51-L61). The cluster master then returns an incomplete set of data (since some subset of workers have returned data, but not necessarily all). I think it makes sense to record an error counter for clustered worker scrapes, such that it's exposed to downstream consumers.

If this feature makes sense, I am happy to implement it! (cc @dt-rush)

xginn8 avatar Oct 04 '19 09:10 xginn8

Workaround for now, as @xginn8 devised:

max(scrape_duration_seconds{job="vpn/cluster"}) < bool 5 will return 1 when a scrape takes less than 5s, and 0 when it takes longer.

combining that with the query we already have, sum(vpn_online_devices{instance=~"$instance"}) * (max(scrape_duration_seconds{job="vpn/cluster"}) < bool 5) > 0 will essentially drop any "incomplete" datapoints, which we can both a) visualize and b) ignore in the larger visualization, to avoid confusion

dt-rush avatar Oct 04 '19 17:10 dt-rush

Dropping the incomplete datapoints query-side is one thing, but dropping them entirely seems to be worth something in my view, unless (this is the only workable "solution", in the case a cluster worker times out) some kind of interpolation logic client-side were included, to "guess" the missing cluster worker value according to its previous values, or according to its siblings.

dt-rush avatar Oct 04 '19 17:10 dt-rush

If the number of cluster workers is known, and the number of cluster workers timing out is known (presently such metric are not exported), such an averaged "guess" of the timed out worker would be possible client-side with:

metric_A * (promclient_n_cluster_workers{metric="metric_A"} / ((promclient_n_cluster_workers{metric="metric_A"} -promclient_cluster_workers_timed_out{metric="metric_A"}) > 0))

If, say, 3 out of 5 cluster workers timed out, we'd have 2/5ths of the cluster workers reporting. Scaling that up by a factor of 5/(5 - 3) approximates what we might observe if all workers reported (assumption that workers are responsible for approximately equal portions of the aggregated total). This may not be a fair assumption, but if it is, allows an approximated picture despite worker timeouts, when they occur.

Additionally, graphs could be annotated at the point these "approximations" are occurring by checking when promclient_cluster_workers_timed_out > 0, or drop the metric, as well, in the case that promclient_cluster_workers_timed_out > 0.

dt-rush avatar Oct 04 '19 17:10 dt-rush

I'm not super keen on interpolation (rather have no data than fake data), but agree this could be handled better. I don't know what Prometheus does during a scrape if we were to switch to all-or-none agglomeration -- would that work for you guys? How long of a timeout do you need to avoid missed metrics entirely? (I've only encountered them a handful of times personally.)

zbjornson avatar Oct 04 '19 18:10 zbjornson

For the record, this also occurs when a worker has died, not just in the case of timeouts (this is what @xginn8 meant by "timeouts / errors")

I see there to be 2 problems here:

  • hard-coded timeout
  • portion of aggregate metrics can be dropped without any way to detect it

dt-rush avatar Oct 07 '19 16:10 dt-rush

@zbjornson I think discarding an entire scrape due to failed worker(s) is the wrong approach, as we'd be throwing away useful data that may not be aggregated at all (from the master process). My preference would simply be to increment an error counter with some information about the missing scrape, such that it's clear to any consumers that the data is potentially incomplete. In this manner, those consumers can act on that as they wish, but the library makes no assumptions about the utility of that "partial" scrape.

In the case of a worker dying, no timeout will be sufficient since the PID will never come back.

xginn8 avatar Oct 07 '19 19:10 xginn8