client_golang
client_golang copied to clipboard
histograms: Support exemplars in pure Native Histograms
This depends on https://github.com/prometheus/client_model/issues/61 .
Exemplars can be added to the buckets of conventional histograms but not to native histograms. So far, the way of sending exemplars is by also rendering a conventional histogram. With the issue above fixed, we have a way of sending exemplars independently of conventional buckets. However, without conventional buckets, we don't have a heuristics which exemplars to pick. Native histograms can have so many buckets that adding an examplar to each is infeasible.
This is P2 as explained in the milestone's description.
Idea: keep a schema for the exemplars. At the beginning the exemplar schema would be the same as the buckets schema.
The exemplar schema "buckets" will cover the same range of values as the normal buckets, but at a much lower resolution. We could keep a maximum N exemplars per exemplar schema bucket (M). So max MxN. We'd automatically get selection of sampled exemplars per logarithmic range.
Note that exemplars directly added to native histograms MUST have a timestamp, see https://github.com/prometheus/client_model/issues/61 and https://github.com/prometheus/prometheus/pull/13021 for details.
Since https://github.com/prometheus/client_model/issues/61 is solved, I would love to take this issue. If ok, you can assign to me @beorn7 @carrieedwards
Thank you @fatsheep9146 . I have assigned this issue to you. Feel free to discuss selection heuristics on Slack, if you need inspiration beyond what has been discussed above.
Before I support exemplars in pure native histogram in client golang, should we release the client_model with a new tag v0.6.0 then update the dependency in client golang to use the new client model? @beorn7
True. I'll add the new tag ASAP (hopefully later today).
The dependency in prometheus server should also be updated, I will also do this after the new tag is ready. @beorn7
The tag is in place! https://github.com/prometheus/client_model/releases/tag/v0.6.0
Broad idea for the heuristics:
- Let the user configure a number of exemplars n to expose per histogram (default could be around 10).
- The first n
ObserveWithExemplarcalls create an exemplar that we keep for now. - Whenever an
ObserveWithExemplarcall happens after that, first add the new exemplar to the current set of exemplars, and then delete one according to the following priority list:- The oldest exemplar that is older than a certain threshold (maybe 5m, could also be configurable).
- Find the two exemplars that are closest to each other on an exponential scale. Of those, delete the older one.
- A reset of the histogram should also delete all exemplars (so that we never expose an exemplar that doesn't fall into any of the populated buckets).
This should lead to an even spread of exemplars, while not keeping any individual exemplar for overly long.
"Find the two exemplars that are closest to each other on an exponential scale." How should I understand this sentence? Does it mean that I should first take the logarithm of the values in all exemplars, and compare logarithm of newly added exemplar to all of existing exemplars, found the exemplar which has the closest logarithm with newly added exemplar?
For example, there exists two exemplars which has value 4, 6, the histogram schema is 0, the max count of exemplar is 2. When we add a new exemplar which value is 5, we should compare the diff of logarithm with existing exemplar log2(5) - log2(4) = 0.321 log2(6) - log2(5) = 0.263
So we should drop 6, keep 4, 5. Do I understand this correctly?
@beorn7
Yes, sounds about right.
A naive implementation of this could be quite expensive, though, especially if many exemplars are to be kept around. However, for one I assume the number of exemplars will be generally small (around 10?). Also, you can probably play many tricks, like the following:
- Keep the logarithm of each exemplar around.
- Maintain a list of exemplars sorted by value (so finding the two closest exemplars, it's O(n), not O(n²)).
- Maintain a list of exemplars sorted by time (so it's easy to find the oldest exemplar).