unbound_exporter icon indicating copy to clipboard operation
unbound_exporter copied to clipboard

unbound_response_time_seconds missing cached responses

Open codl opened this issue 2 years ago • 2 comments
trafficstars

The help text for the unbound_response_time_seconds histogram says: "Query response time in seconds"

I thought this meant it would measure the time unbound takes to respond to every client query, however it does not seem to include queries served from cache

The munin plugin plots total cache hits along with the histogram, putting them under the lowest histogram bucket

munin chart

I'm not sure it's possible in Prometheus to do histogram quantile calculation over a histogram + another stray series interpreted as an extra bucket. Perhaps unbound_response_time_seconds should include cache hits in the lowest bucket? At least this should be documented

codl avatar Feb 23 '23 07:02 codl

An interesting question! Cache hits and cache misses will have a completely different distribution, so it's probably hard to represent them nicely in a single set of histogram buckets. We could add a label cache="hit" vs cache="miss" but the buckets would still be suboptimal for one or the other situation.

I can also see, though, why you would be interested in the question of "what is the performance my end-users see, covering both hits and misses."

it does not seem to include queries served from cache

Can I ask what you're basing this on? I don't know one way or the other what the answer is.

jsha avatar Feb 23 '23 18:02 jsha

I can also see, though, why you would be interested in the question of "what is the performance my end-users see, covering both hits and misses."

That's exactly it 🙂

Can I ask what you're basing this on?

It was a guess based on some surprising results I was seeing on my dashboard, reinforced by checking out the munin setup, and then experimentation confirmed my guess.

I started a new unbound server and repeated the same query a few times, checking unbound-control stats_noreset after each query, and found that the first answer was counted in one of the buckets and subsequent answers were not. I also found through experimentation that background "prefetch" queries don't seem to be counted in the histogram either. I thought maybe the histogram measured outgoing recursion time, regardless of whether it is user-facing or not.

Caveat emptor, I didn't check local authority zones, forward zones, etc, I can't say if those are counted or not.

codl avatar Feb 24 '23 06:02 codl