opentelemetry-specification
opentelemetry-specification copied to clipboard
Stabilize exponential histograms
Exponential histograms have implementations in Java, Dotnet, Go, and Python.
There's still the idea of the zero threshold, which needs to be defined in the proto and likely needs to be configurable in the SDK in some fashion. However, this can be added in a backwards compatible way.
If there are no other issues, we should stabilize the exponential histogram data model and aggregation.
+1 we should push this forward.
There are some potential improvements / issues that we can discuss in the PR. https://cloud-native.slack.com/archives/C01NP3BV26R/p1663690523370849
I've been analyzing the exponential histogram java implementation to determine if the design has any inherent performance issues. Some takeaways:
-
Computing exponential bucket indexes is not materially different than explicit bucket histogram. The log function needed at scales > 10 is slower than bit shifting strategies, but only the most demanding applications will take issue with a single
log
call being on the hot path. We can add a MaxScale to accommodate these users, but don't necessarily need it initially. - Exponential histogram memory allocation is not materially different than explicit bucket histograms. This makes sense given that the circular array data structure used to track exponential bucket counts is not too different than the array used to track explicit bucket counts. The java implementation was doing some naive things that caused excess allocation which were addressed here and here.
Here are a some benchmarks to support these takeaways:
Computing bucket index
HistogramBenchmark (source): characterizes bucket indexing times while trying to control against things unlikely to occur recording each measurement. The exponential histograms in this example use a max scale of 0 to control against different scales impacting results. Summary of most relevant results:
Benchmark (aggregation) (valueGen) Mode Cnt Score Error Units
HistogramBenchmark.aggregate_5Threads EXPLICIT_DEFAULT_BUCKET GAUSSIAN_LATENCY avgt 10 38317.681 ± 333.509 ns/op
HistogramBenchmark.aggregate_5Threads EXPLICIT_SINGLE_BUCKET GAUSSIAN_LATENCY avgt 10 27149.226 ± 412.018 ns/op
HistogramBenchmark.aggregate_5Threads EXPONENTIAL_SMALL_CIRCULAR_BUFFER GAUSSIAN_LATENCY avgt 10 27005.325 ± 1958.334 ns/op
HistogramBenchmark.aggregate_5Threads EXPONENTIAL_CIRCULAR_BUFFER GAUSSIAN_LATENCY avgt 10 26878.814 ± 306.520 ns/op
ExponentialHistogramIndexerBenchmark (source): characterizes the difference in time to compute exponential bucket indexes when scale is positive, negative, or zero. Real applications perform much more work when recording a measurement (e.g. build attributes, find relevant timeseries for attributes, aggregate parts of the histogram like sum, min, max, etc) so the difference will be much less pronounced. This benchmark zooms in on the difference between computing bucket index with a logarithm vs. bitshifting. Summary of most relevant results:
Benchmark (scale) Mode Cnt Score Error Units
ExponentialHistogramIndexerBenchmark.computeIndex 1 avgt 5 137272.227 ± 12231.299 ns/op
ExponentialHistogramIndexerBenchmark.computeIndex 0 avgt 5 15225.018 ± 448.921 ns/op
ExponentialHistogramIndexerBenchmark.computeIndex -1 avgt 5 15376.533 ± 527.138 ns/op
Memory allocation
HistogramCollectBenchmark (source): characterizes the memory allocation associated with collecting from different histogram aggregations, showing differences between exponential bucket vs. explicit bucket and cumulative vs. delta. Summary of most relevant results:
Benchmark (aggregationGenerator) (aggregationTemporality) Mode Cnt Score Error Units
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm EXPLICIT_BUCKET_HISTOGRAM DELTA ss 5 26925732.800 ± 433897.702 B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm EXPLICIT_BUCKET_HISTOGRAM CUMULATIVE ss 5 31928064.000 ± 214185.867 B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM DELTA ss 5 30989452.800 ± 784163.837 B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM CUMULATIVE ss 5 30526232.000 ± 847271.704 B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM DELTA ss 5 20794800.000 ± 450113.512 B/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM CUMULATIVE ss 5 20323672.000 ± 515137.009 B/op
In summary, I can't find any reason to recommend using explicit bucket histograms over exponential histograms on the basis of performance in the java implementation. I don't think there are any performance reasons to hold stabilizing exponential histograms.
I support the call to stabilize the exponential histogram as we have it. I agree there are probably useful additions that we can find, but we should be able to make them from where we are. @gouthamve any comment?
@jack-berg
Real applications perform much more work when recording a measurement (e.g. build attributes, find relevant timeseries for attributes, aggregate parts of the histogram like sum, min, max, etc) so the difference will be much less pronounced.
Your results showed a computation time of 7-8ns (results divided by the number of repetitions = 2000) for indexers which are not based on the logarithm (scale = 0, -1). Our benchmarks (see https://github.com/open-telemetry/opentelemetry-specification/issues/2987#issuecomment-1331725589) showed that fast histogram implementations like DDSketch or DynaHist achieve insertion times (including updating min/max, checking for NaNs, and resizing arrays for counting) that are below 10ns. You got 68ns for just the log-based index computation (index = 1). I am not sure if this overhead can really be neglected in practice.
I would add a similar comment. The benchmarks that I did in Go show the logarithm costing in the 10-20ns range, while a table lookup could save half of that cost. That's why the other costs associated with the histogram data structure outweigh the benefits of table lookup, to me.
@oertl are you suggesting the design of exponential histograms be adjusted in some way before stabilizing?
Here's how I see it:
- The vast majority of applications won't care about the performance of the log approach. As slow as the log approach is relative to the bit shifting and table lookup approaches, it's still just one
log
call and 68ns. - SDKs can choose to cater to the small slice of users that care by offering better performing indexing implementation when the scale allows it. I.e. log method when scale > 10; table lookup method when 10 >= scale > 0; bit shifting method when scale <= 0.
- Users that care about performance can use the proposed MaxScale to force an optimal indexing implementation to be used if available in the SDK.
What more can we do?
This feature is incredibly useful and continuing to hold off on stabilizing it is a disservice to users. Unless there's a change that we can't make incrementally, we should stabilize.
@jack-berg please do not get stopped. I just realized that your benchmark is testing an inefficient implementation. If the difference in indexing is really 7ns vs 68ns, that could be problematic from a performance standpoint. But probably the difference is smaller if the implementation gets optimized. In my opinion, the benchmarks presented are not suitable for assessing the runtime contribution of indexing to the overall histogram insertion.
In my opinion, the benchmarks presented are not suitable for assessing the runtime contribution of indexing to the overall histogram insertion.
Ok, I understand your comment now and agree. Thanks for clarifying!