smokescreen icon indicating copy to clipboard operation
smokescreen copied to clipboard

Prometheus stats naming and histogram cleanup

Open alexmv opened this issue 1 year ago • 0 comments

The default Prometheus metrics contain histograms for in and out byte counts. However, they use the default histogram buckets, which means that they measure up to 10 bytes, and down to 0.005 bytes:

# HELP cn_bytes_in
# TYPE cn_bytes_in histogram
cn_bytes_in_bucket{role="",le="0.005"} 0
cn_bytes_in_bucket{role="",le="0.01"} 0
cn_bytes_in_bucket{role="",le="0.025"} 0
cn_bytes_in_bucket{role="",le="0.05"} 0
cn_bytes_in_bucket{role="",le="0.1"} 0
cn_bytes_in_bucket{role="",le="0.25"} 0
cn_bytes_in_bucket{role="",le="0.5"} 0
cn_bytes_in_bucket{role="",le="1"} 0
cn_bytes_in_bucket{role="",le="2.5"} 0
cn_bytes_in_bucket{role="",le="5"} 0
cn_bytes_in_bucket{role="",le="10"} 0
cn_bytes_in_bucket{role="",le="+Inf"} 1
cn_bytes_in_sum{role=""} 22925
cn_bytes_in_count{role=""} 1
# HELP cn_bytes_out
# TYPE cn_bytes_out histogram
cn_bytes_out_bucket{role="",le="0.005"} 0
cn_bytes_out_bucket{role="",le="0.01"} 0
cn_bytes_out_bucket{role="",le="0.025"} 0
cn_bytes_out_bucket{role="",le="0.05"} 0
cn_bytes_out_bucket{role="",le="0.1"} 0
cn_bytes_out_bucket{role="",le="0.25"} 0
cn_bytes_out_bucket{role="",le="0.5"} 0
cn_bytes_out_bucket{role="",le="1"} 0
cn_bytes_out_bucket{role="",le="2.5"} 0
cn_bytes_out_bucket{role="",le="5"} 0
cn_bytes_out_bucket{role="",le="10"} 0
cn_bytes_out_bucket{role="",le="+Inf"} 1
cn_bytes_out_sum{role=""} 516
cn_bytes_out_count{role=""} 1

Per the prometheus best practices, bytes is the canonical unit to use here, which means that we should be adjusting the bucketcount. The same applies to a lesser degree with proxy_duration_ms_timer_bucket -- but in general Prometheus timing metrics use seconds as their unit, rather than shifting the buckets upwards. cn_duration_bucket, by contrast, is in seconds (and doesn't use TimingWithTags, possibly for that reason?)

Based on that same link, the names for the Prometheus metrics are also pretty non-standard -- cn_atpt_total would more canonically be smokescreen_connection_attempt_count, for instance. We could hardcode a mapping of statsd name to prometheus name in sanitisePrometheusMetricName, but I don't love that -- nor polluting the interface with a function per metric name.

alexmv avatar Jul 13 '23 17:07 alexmv