histogram_quantile does not work correctly for higher quantiles (missing sorting of the buckets)?
Function histogram_quantile does not work correctly for higher quantiles.
The image below shows the edge case, for quantile 1.0 but the same is true for high quantiles, close to 1.0 e.g. 0.9999.
Above, you can see that buckets "0.256" and above have counts 4284. The 1-quantile should be 0.256 but on the top graph you can see that it is calculated to be 8.192.
The source code for the function that computes the quantile in Prometheus: https://github.com/prometheus/prometheus/blob/main/promql/quantile.go#L73
The same for m3: https://github.com/m3db/m3/blob/18430610ed4501dcef1b8da39291dd6cc527681c/src/query/functions/linear/histogram_quantile.go#L216
The sort.Search is supposed to return the smallest bucket but it seems that the m3 version does not sort the buckets first. Perhaps m3 version assumes that buckets are sorted somewhere outside the function?
Is this the issue that https://github.com/m3db/m3/commit/0d01d62d082428993fd09f7f369d48eb719e58f9 was intended to fix?
@jodzga - per @markj-db have you looked at 0d01d62 to see if that fixes this issue? Also, if you have a fix in mind, we would be happy to review a PR.
Thanks @markj-db and @gibbscullen for following up.
Mark, you are right lack of sorting is not the problem. The m3 implementation does sort the buckets by upperBound in sanitizeBuckets function: https://github.com/m3db/m3/blob/18430610ed4501dcef1b8da39291dd6cc527681c/src/query/functions/linear/histogram_quantile.go#L196
I figured out what the problem is.
I was using the following query: histogram_quantile(1.0, sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by (le)).
The problem is that the rate function, while doing its magic, turns counts into fractions. Most fractions can't be expressed exactly as a floating point number (IEEE754 standard). The resulting number that represents the fraction is just an approximation that uses up all bits of mantissa. Addition of such numbers is neither commutative nor associative operation, see example here: https://play.golang.org/p/5jGVFNhGZo8.
As a consequence, the sum of counts differ very slightly depending on in what order numbers were summed up. Here is an example graph that shows this problem:

From my initial example:

Buckets 0.256 and above seem to have the same count.
Here is the same data point but showing numbers with full precision:

A a quick and dirty workaround is to get rid of the "noise" by rounding off some of the mantissa to make sum commutative and associative again. Unfortunately AFAIK Prometheus does not have a function for rounding number to certain precision so, as a workaround, I can simply multiply numbers by very large number, round to integer and then divide by the same very large number.
For example, instead of: histogram_quantile(1.0, sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by (le)) I can use: histogram_quantile(1.0, sum(round(1000000000*rate(coredns_dns_request_duration_seconds_bucket{}[5m]))) by (le) / 1000000000).
I can think of a couple of ways this issue could be "mitigated" in Prometheus/m3. These are just brainstorming ideas, not fully thought through:
- Do not depend on associativity and commutativity of
sumby making order of operations deterministic e.g. sort all numbers before summing them up. - Get back associativity and commutativity by making
rateand other functions return numbers that do not use full mantissa (decrease precision).
@gibbscullen I believe this is a pretty serious problem. AFAICT histogram_quantile queries for high percentiles will return wrong result. I'd love to help out but I am not a golang developer and onboarding (getting to the point where I can go through dev cycle: build, test, modify) could take me a significant amount of time. Would someone from the m3 maintainers team be interested in collaborating on fixing this issue?
@jodzga the example you gave of sum(rate(foo[5m])) - sum(rate(foo[5m])) is interesting because it implies the order of values being summed is non-deterministic. Perhaps it would be useful to look at how sum() is implemented. My naive read of the M3 code suggests that query execution is parallelized:
https://github.com/m3db/m3/blob/master/src/query/executor/state.go#L192
This could well be a source of non-determinism if not handled carefully -- I'm not sure how the query is distributed or how any reduction operations are done.
@jodzga @markj-db sounds like you may already be working on this together, but one idea would be to change the default here https://github.com/m3db/m3/blob/2e5700a97a914b723f61fa7773ebf5a1020626e7/src/cmd/services/m3query/config/config.go#L279-L280 to Prometheus Engine (prometheus)