vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Backend metrics snapshotting is not compatible with semantics expected by Prometheus

Open vekterli opened this issue 6 years ago • 5 comments

Our current backend metric aggregation implementations are built around explicit snapshotting every N minutes, where each such snapshot effectively resets the tracked metric value internally. In other words, a counter, when observed externally, is not monotonically increasing over time. It will only be monotonically increasing within a particular snapshot period.

Although this simplifies tracking of minimum and maximum values within a snapshot period, it does not match the semantics of Prometheus metrics (aside from Gauge-style metrics, which obviously cannot have monotonic properties).

Ideally, we should introduce a new metric implementation in our backends that has support for the following:

  • Counters (monotonic)
  • Gauges
  • Histograms (monotonic per bucket). Possibly also Summaries for certain latency metrics; could use HdrHistogram implementation.
  • Dimensions ("labels" in the Prometheus data model). We already support this in our current implementations.
  • Prometheus exposition, at least in text format

To support legacy metric aggregators that expect pre-derived values, we should also support some form of snapshotting behind the scenes. Note that snapshotting of monotonically increasing values should be vastly simpler than what is currently done in the backend.

vekterli avatar Nov 15 '17 15:11 vekterli

This is work in progress.

bratseth avatar Nov 16 '17 14:11 bratseth

I am trying to put all your current metrics into Prometheus Gauges, just to create some simple graphs in Grafana.

I have seen that the metric names is a mix of _-separated, .-separated and camel cased names. Some also contains '[' and ']'.

For prometheus I have to convert all this into a pure lowercased _-separated string.

zoyvind avatar Jan 15 '18 11:01 zoyvind

I also notice that some metrics disappear when the cluster is idle, e.g. 95percentile of query_latency in container metrics. This makes it harder to track metrics into an external system.

zoyvind avatar Jan 15 '18 11:01 zoyvind

@Oracien do you know if this is still a problem, after the latest changes to vespa prometheus integration?

kkraune avatar Aug 10 '20 07:08 kkraune

@kkraune most of these issues have been resolved as far as I know. Counters do seem to be monotonically increasing over time and the format of metrics does appear to be metric_name{service="something-this", ...}. However, the last issue is still prevalent, if you are not running queries, then relevant query metrics such as rate does not appear. This should probably be fixed, I would guess it is something as simple as a filter (element != 0).

Oracien avatar Aug 10 '20 07:08 Oracien