opentelemetry-specification Stabilize exponential histograms

Exponential histograms have implementations in Java, Dotnet, Go, and Python.

There's still the idea of the zero threshold, which needs to be defined in the proto and likely needs to be configurable in the SDK in some fashion. However, this can be added in a backwards compatible way.

If there are no other issues, we should stabilize the exponential histogram data model and aggregation.

Sep 23 '22 22:09 jack-berg

+1 we should push this forward.

There are some potential improvements / issues that we can discuss in the PR. https://cloud-native.slack.com/archives/C01NP3BV26R/p1663690523370849

Sep 30 '22 21:09 reyang

I've been analyzing the exponential histogram java implementation to determine if the design has any inherent performance issues. Some takeaways:

Computing exponential bucket indexes is not materially different than explicit bucket histogram. The log function needed at scales > 10 is slower than bit shifting strategies, but only the most demanding applications will take issue with a single log call being on the hot path. We can add a MaxScale to accommodate these users, but don't necessarily need it initially.
Exponential histogram memory allocation is not materially different than explicit bucket histograms. This makes sense given that the circular array data structure used to track exponential bucket counts is not too different than the array used to track explicit bucket counts. The java implementation was doing some naive things that caused excess allocation which were addressed here and here.

Here are a some benchmarks to support these takeaways:

Computing bucket index

HistogramBenchmark (source): characterizes bucket indexing times while trying to control against things unlikely to occur recording each measurement. The exponential histograms in this example use a max scale of 0 to control against different scales impacting results. Summary of most relevant results:

Benchmark                                                                       (aggregation)                (valueGen)  Mode  Cnt      Score      Error   Units
HistogramBenchmark.aggregate_5Threads                                 EXPLICIT_DEFAULT_BUCKET          GAUSSIAN_LATENCY  avgt   10  38317.681 ±  333.509   ns/op
HistogramBenchmark.aggregate_5Threads                                  EXPLICIT_SINGLE_BUCKET          GAUSSIAN_LATENCY  avgt   10  27149.226 ±  412.018   ns/op
HistogramBenchmark.aggregate_5Threads                       EXPONENTIAL_SMALL_CIRCULAR_BUFFER          GAUSSIAN_LATENCY  avgt   10  27005.325 ± 1958.334   ns/op
HistogramBenchmark.aggregate_5Threads                             EXPONENTIAL_CIRCULAR_BUFFER          GAUSSIAN_LATENCY  avgt   10  26878.814 ±  306.520   ns/op

ExponentialHistogramIndexerBenchmark (source): characterizes the difference in time to compute exponential bucket indexes when scale is positive, negative, or zero. Real applications perform much more work when recording a measurement (e.g. build attributes, find relevant timeseries for attributes, aggregate parts of the histogram like sum, min, max, etc) so the difference will be much less pronounced. This benchmark zooms in on the difference between computing bucket index with a logarithm vs. bitshifting. Summary of most relevant results:

Benchmark                                                              (scale)  Mode  Cnt       Score       Error   Units
ExponentialHistogramIndexerBenchmark.computeIndex                            1  avgt    5  137272.227 ± 12231.299   ns/op
ExponentialHistogramIndexerBenchmark.computeIndex                            0  avgt    5   15225.018 ±   448.921   ns/op
ExponentialHistogramIndexerBenchmark.computeIndex                           -1  avgt    5   15376.533 ±   527.138   ns/op

Memory allocation

HistogramCollectBenchmark (source): characterizes the memory allocation associated with collecting from different histogram aggregations, showing differences between exponential bucket vs. explicit bucket and cumulative vs. delta. Summary of most relevant results:

Benchmark                                                                                     (aggregationGenerator)  (aggregationTemporality)  Mode  Cnt           Score           Error   Units
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm                             EXPLICIT_BUCKET_HISTOGRAM                     DELTA    ss    5    26925732.800 ±    433897.702    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm                             EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    31928064.000 ±    214185.867    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm                  DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                     DELTA    ss    5    30989452.800 ±    784163.837    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm                  DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    30526232.000 ±    847271.704    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm           ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                     DELTA    ss    5    20794800.000 ±    450113.512    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm           ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    20323672.000 ±    515137.009    B/op

In summary, I can't find any reason to recommend using explicit bucket histograms over exponential histograms on the basis of performance in the java implementation. I don't think there are any performance reasons to hold stabilizing exponential histograms.

Dec 14 '22 21:12 jack-berg

I support the call to stabilize the exponential histogram as we have it. I agree there are probably useful additions that we can find, but we should be able to make them from where we are. @gouthamve any comment?

Dec 14 '22 23:12 jmacd

@jack-berg

Real applications perform much more work when recording a measurement (e.g. build attributes, find relevant timeseries for attributes, aggregate parts of the histogram like sum, min, max, etc) so the difference will be much less pronounced.

Your results showed a computation time of 7-8ns (results divided by the number of repetitions = 2000) for indexers which are not based on the logarithm (scale = 0, -1). Our benchmarks (see https://github.com/open-telemetry/opentelemetry-specification/issues/2987#issuecomment-1331725589) showed that fast histogram implementations like DDSketch or DynaHist achieve insertion times (including updating min/max, checking for NaNs, and resizing arrays for counting) that are below 10ns. You got 68ns for just the log-based index computation (index = 1). I am not sure if this overhead can really be neglected in practice.

Dec 15 '22 06:12 oertl

I would add a similar comment. The benchmarks that I did in Go show the logarithm costing in the 10-20ns range, while a table lookup could save half of that cost. That's why the other costs associated with the histogram data structure outweigh the benefits of table lookup, to me.

Dec 15 '22 15:12 jmacd

@oertl are you suggesting the design of exponential histograms be adjusted in some way before stabilizing?

Here's how I see it:

The vast majority of applications won't care about the performance of the log approach. As slow as the log approach is relative to the bit shifting and table lookup approaches, it's still just one log call and 68ns.
SDKs can choose to cater to the small slice of users that care by offering better performing indexing implementation when the scale allows it. I.e. log method when scale > 10; table lookup method when 10 >= scale > 0; bit shifting method when scale <= 0.
Users that care about performance can use the proposed MaxScale to force an optimal indexing implementation to be used if available in the SDK.

What more can we do?

This feature is incredibly useful and continuing to hold off on stabilizing it is a disservice to users. Unless there's a change that we can't make incrementally, we should stabilize.

Dec 15 '22 15:12 jack-berg

@jack-berg please do not get stopped. I just realized that your benchmark is testing an inefficient implementation. If the difference in indexing is really 7ns vs 68ns, that could be problematic from a performance standpoint. But probably the difference is smaller if the implementation gets optimized. In my opinion, the benchmarks presented are not suitable for assessing the runtime contribution of indexing to the overall histogram insertion.

Dec 15 '22 17:12 oertl

In my opinion, the benchmarks presented are not suitable for assessing the runtime contribution of indexing to the overall histogram insertion.

Ok, I understand your comment now and agree. Thanks for clarifying!

Dec 15 '22 18:12 jack-berg

opentelemetry-specification opentelemetry-specification copied to clipboard

Stabilize exponential histograms

opentelemetry-specification
opentelemetry-specification copied to clipboard