prom-client icon indicating copy to clipboard operation
prom-client copied to clipboard

On optimizing histogram findBounds()

Open jdmarshall opened this issue 6 months ago • 3 comments

Apologies if this has become word salad. This was the contents of my morning walk and some mental exercises last night.

There's a histogram problem that OTEL figured out a couple years ago, and that's the linear bucket scan in the observe() logic. As you increase the number of buckets per label set you increase hosting costs but also collection time. And if you try to generalize your bucket sizes over different classes of traffic, (eg KV store access and write timing and overall response times with the same bucket sets), you can make a real mess.

They solved the insertion problem with a binary search. I'm not convinced that is the best solution to the problem. It might be worth considering other options as any changes to observe() result in diminishing returns if another better option comes along later.

The most obvious problem with BST is that we aren't looking for an equality, we are looking for an inequality. So we need to find the leftmost bucket that is greater than the our sample value. We are in effect looking for a 'between' rather than a match. So when you get close you need to look at one bucket and its subsequent and next bucket. A binary search is fine for that but other data structures like sorted heaps less so.

Secondly, insertions are not uniformly distributed, they are normally distributed. Most bucket sets I have thus far encountered represent a geometric progression of values. Most of the time every second bucket is twice the bound, and the bucket between them is the average of the two, resulting in an effective exponent of around 1.4. This geometric distribution of bucket bounds leads to some buckets having most of the entries and the outliers having few or sometimes zero. Most of those extra buckets are about making sure the p95 and p99 statistics don't artifact so badly that they look like hallucinations.

... except under heavy load most of these bets are off. Histogram cost is proportional to the samples per second, so when server load goes up due to traffic rather than request cost, a bigger and bigger slice of the CPU time is spent on telemetry instead of retiring requests. Especially in Node, if you aren't using cluster mode, because every stat calculation is blocking the event loop. Timers being by far the worst in this regard because they also occupy additional memory (Little's Law), but bucket lookups can also cost L1 cache.

We would be doing the users a service if findBounds required fewer lookups per sample, and particularly for slightly larger buckets than the long-term mean, because as system contention increases the p50 response times will shift right, looking more like the typical the p75 response times. Or p90. So ideally the most popular bucket should be quickest to find and the next ±2 buckets are nearly as efficient.

That suggests a priority balanced Treap, which would be simple enough to implement given that bucket.count is exactly the priority of each bucket. But I don't know how many users would configure enough buckets for that to win out over binary search.

The simplest thing that could work would be to track the n/2 bucket, and do a linear scan of the first or second half of the bucket set based on value <= middle.upperBound. Or the n/3 bucket, if the long tail pattern tends to hold.

jdmarshall avatar Jun 26 '25 19:06 jdmarshall

Now that I'm more familiar with the benchmark scripts, I'm seeing that in 1:1 comparisons between gauge and histogram, histogram and gauge are roughly on a par. It's the labels that drop the write performance by 2 orders of magnitude (~95x)

Unfortunately this gap seems to get bigger in node 24 rather than narrower, and I'm unsure why.

jdmarshall avatar Jul 05 '25 06:07 jdmarshall

There's a consistent category error in the benchmarks for this code. Every example I can find exclusively observes the exact same value in a loop, which is not representative of real world application behavior, and puts all of the results for a run into a single bucket, which then hides any regressions or improvements made to data locality in the code.

   histogram.observe(labels, 1);

jdmarshall avatar Jul 07 '25 00:07 jdmarshall

A simpler improvement to histogram would be to optimize labels to skip over revalidating the labels for each call.

jdmarshall avatar Jul 11 '25 14:07 jdmarshall