zio
zio copied to clipboard
Support DataDog Distribution metrics
My use-case
As a DataDog user, I want to measure my app's response times across all hosts and see response time percentiles grouped by endpoint. In DataDog, the metric type capable of doing that is called Distribution: "it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure." It uses a special internal representation called DDSketch. Users don't have to specify buckets or percentiles manually when reporting the metric, just send raw data to the DataDog server, most commonly through the DataDog agent. As seen from the screenshot, I have to explicitly enable percentile queries (I assume because it increases indexing costs) and I can select arbitrary percentile distributions for that metric.


DataDog agents running on host machines support an extension of the StatsD protocol called DogStatsD. The protocol supports sending metric values for distribution metric type. There's no need from metric reporting applications to do any aggregation, it is done by the datadog agent (and DataDog in case of the distribution metric). The application only sends raw values.
Is this supported in zio.metrics? No
Two of the zio.metric
metrics capable of measuring statistical distributions of data is Histgoram
and Summary
, which directly correspond with the Prometheus/OpenMetrics metric Histogram/Summary types. Of these two, Summary
calculates percentiles on the client-side, which makes it generally not aggregatable across labels/hosts. Histogram
counts observations in buckets, which can flexibly be aggregated on the server-side across various dimensions, however, unlike Prometheus, this representation is not supported by DataDog, even though its internal DDSketch algorithm seems to be working in a similar way. As seen from the screenshot, I cannot see percentiles of my data distribution for a metric which is uploaded as a set of gauge's with labels specifying bucket counts:

Proposed solution
The solution I envisioned would provide new MetricEvent
handler to zio-metrics-connectors which uses the DogStatsD protocol, making it able to report distribution metrics. The changes would be the following:
1. Notify metric listener every time a metric is updated
DogStatD relies on the application to send raw values for each metric type. Instead of collecting them in a MetricState
and periodically take a snapshot of the metric states and process them (as it is currently done), we should extend MetricsRegistry
to be able to notify listeners as soon as a metric is updated. I also see this being mentioned in the ScalaDoc comments of zio.metrics.MetricClient, however the implementation seems missing.
2. Use zio.metrics.Histogram metric type for datadog distribution
I assume that we don't want any platform-specific metric type in zio.metrics
, instead we should have generic types that can work with any metrics reporting platform. Is this assumption true?
zio.metrics.Gauge
and zio.metrics.Rate
already have a direct correspondance with DataDog metric types so I wouldn't change these.
zio.metrics.Sumamry
is actually quite similar to what's called Histogram
in DataDog. It is used for calculating a set of percentiles at the client side. (calculating it in the app or the datadog agent both counts as client-side)
Since the gauges produced by zio.metrics.Histogram
are currently not interpreted correctly by DataDog, I believe we should use that type for DataDog's distribution metric. Furthermore, the use-case of this metric is the same as Datadog's distribution: it is used to represent a global statistical distribution of a set of values calculated across a distributed infrastructure.
To send Distributions, the dogstatsd
client would ignore details of MetricState.Histogram
, as it would only send raw values to the datadog agent. This means that DataDog users would be able create a histogram like this and still use it properly:
// We send raw data to the datadog agent so we can ignore internal state
Metric.histogram("response_time", Boundaries(Chunk.empty))
Note that this can cause compatibility issues if this user was trying to switch the metric platform without changing their code, for example when trying to migrate from DataDog to Prometheus. However, as we see there's no one-to-one mapping between Prometheus/DataDog metrics so I can't think of a scenario where a user wouldn't have to change at least small parts of their code when migrating between metric metric reporting platforms.
I am happy to create PRs for this issue if you agree with the direction.
Yes we need to bring back the MetricListener
interface to support this.