knot-resolver icon indicating copy to clipboard operation
knot-resolver copied to clipboard

[knot-resolver 6] Prometheus histogram broken

Open Jean-Daniel opened this issue 1 year ago • 1 comments

Actually, Prometheus requires cumulative value in buckets, that is each bucket should be count of all values in the bucket + count in smaller buckets.

Instead of reporting:

resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.001"} 4.542029e+06
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.01"} 889205.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.05"} 127169.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.1"} 8488.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.25"} 12968.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.5"} 7277.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="1.0"} 4574.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="1.5"} 768.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="+Inf"} 953.0
resolver_response_latency_count{instance_id="kresd:kresd1"} 953.0

the reported metrics should be

resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.001"} 4.542029e+06
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.01"} 5431234.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.05"} 5558403.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.1"} 5566891.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.25"} 5579859.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="0.5"} 5587136.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="1.0"} 5591710.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="1.5"} 5592478.0
resolver_response_latency_bucket{instance_id="kresd:kresd1",le="+Inf"} 5593431.0
resolver_response_latency_count{instance_id="kresd:kresd1"} 5593431.0

Jean-Daniel avatar Sep 24 '24 19:09 Jean-Daniel

Met the same issue. It looks the number is not cumulative value in some buckets.

It caused our premetheus exporter (otel collecter) error:

2025-06-17T22:07:32.118Z	error	internal/queue_sender.go:57	Exporting failed. Dropping data.	{"otelcol.component.id": "googlecloud", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[22] (metric.type=\"workload.googleapis.com/resolver_response_latency\", metric.labels={\"instance_id\": \"kresd:kresd2\", \"cloud_platform\": \"aws_ec2\", \"cloud_region\": \"global\", \"host_name\": \"ip-172-31-79-91.us-west-2.compute.internal\", \"cloud_account_id\": \"186520995770\", \"cloud_availability_zone\": \"global\", \"host_image_id\": \"ami-0f72881cd8392994c\", \"service_instance_id\": \"localhost:8054\", \"host_id\": \"i-06c0e39715ac6d675\", \"host_type\": \"r6i.xlarge\", \"service_name\": \"knot\", \"cloud_provider\": \"aws\"}): Field points[0].distributionValue had an invalid value: Distribution bucket_counts(1) has a negative count.; timeSeries[20] (metric.type=\"workload.googleapis.com/resolver_response_latency\", metric.labels={\"cloud_platform\": \"aws_ec2\", \"cloud_account_id\": \"186520995770\", \"host_name\": \"ip-172-31-79-91.us-west-2.compute.internal\", \"cloud_provider\": \"aws\", \"instance_id\": \"kresd:kresd0\", \"cloud_availability_zone\": \"global\", \"host_id\": \"i-06c0e39715ac6d675\", \"service_instance_id\": \"localhost:8054\", \"service_name\": \"knot\", \"host_image_id\": \"ami-0f72881cd8392994c\", \"cloud_region\": \"global\", \"host_type\": \"r6i.xlarge\"}): Field points[0].distributionValue had an invalid value: Distribution bucket_counts(1) has a negative count.; timeSeries[21] 

tozh avatar Jun 17 '25 22:06 tozh

https://gitlab.nic.cz/knot/knot-resolver/-/merge_requests/1731

vcunat avatar Aug 19 '25 15:08 vcunat