fix(om2): histograms and negative observed values
OM1.0 required that the Sum of Histograms is not represented when there are negative observations in a histogram.
This PR is removing this requirement in OM2.0. Due to:
The requirement was never implemented by the Go and Java instrumentation libraries. Enforcing it now would be breaking.
The requirement makes it impossible to implement the use case where the user wants to measure the Sum anyway. Which means for example that you'll not be able to calculate average from Sum/Count.
~~The PromQL engine does not take the Sum into account when doing counter reset detection, thus it does not matter that it can decrease.~~
We already warned users in the documentation about the possibility of Sum decreasing and not being usable for rate() 10 years ago: PR.
And native histograms will not take Sum into account when calculating counter resets during rate() , thus this problem won't come up.
Note1: the python reference implementation did follow the requirement.
Note 2: this PR does not make Sum mandatory, that is a different question.
The PromQL engine does not take the Sum into account when doing counter reset detection,
This is only true for native histograms, but not for classic histograms.
(FTR: I proposed to improve the counter reset handling for summaries and classic histograms at KubeCon Berlin in 2017. My proposal was ultimately rejected, so I guess we should not change course now and instead encourage native histograms including NHCB.)
The PromQL engine does not take the Sum into account when doing counter reset detection,
This is only true for native histograms, but not for classic histograms.
(FTR: I proposed to improve the counter reset handling for summaries and classic histograms at KubeCon Berlin in 2017. My proposal was ultimately rejected, so I guess we should not change course now and instead encourage native histograms including NHCB.)
I've reworded the PR description and I'll copy the final text into the commit message once we agree on it. Are you ok with making the change in the specification otherwise?
I think the only way of solving this problem properly (beyond getting rid of classic histograms and summaries altogether) is to require PromQL to detect a counter reset in the sum via different means (historically by looking at the count, but nowadays we could also look at the CT).
I don't know how to solve this given that the Prometheus community has decided to not do that. Maybe just leaving it as is in practice (which is arguably what this PR proposes) is the least bad way, but I don't feel I should make this call about OMv2.
I think the only way of solving this problem properly (beyond getting rid of classic histograms and summaries altogether) is to require PromQL to detect a counter reset in the sum via different means (historically by looking at the count, but nowadays we could also look at the CT).
I don't know how to solve this given that the Prometheus community has decided to not do that. Maybe just leaving it as is in practice (which is arguably what this PR proposes) is the least bad way, but I don't feel I should make this call about OMv2.
I agree that the solution is native histograms and this PR does not want to actually solve the problem of negative values in Sum. This PR is just about getting rid of a requirement that's not implemented by anyone and just makes things more complicated.
cc @fstab @csmarchbanks
Also, just to note the above comment - the requirement to not expose _sum when there are negative observations is implemented in client_python today which is the reference client for OpenMetrics. So I wouldn't say it is not implemented by anyone. That said, I don't think it needs to be a MUST, and the fact that I can no longer use averages with negative observations is a pretty big downside.
Also, just to note the above comment - the requirement to not expose
_sumwhen there are negative observations is implemented in client_python today which is the reference client for OpenMetrics. So I wouldn't say it is not implemented by anyone. That said, I don't think it needs to be a MUST, and the fact that I can no longer use averages with negative observations is a pretty big downside.
noted
Related issue about Sum allowing NaN or not: https://github.com/prometheus/client_golang/issues/1275#issuecomment-2827320887