opentelemetry-rust icon indicating copy to clipboard operation
opentelemetry-rust copied to clipboard

Adding two level hashing in metrics hashmap

Open lalitb opened this issue 1 year ago • 13 comments
trafficstars

Changes

This PR is opened just for the discussion, of one of the possible changes to improve the performance of concurrent recording/updation of measurements in the hot-path. In today's (Feb 20th, 2024) community meeting, we discussed multiple approaches

  • Using dashmap - sharded hashmap.
  • Using flurry - uses fine-grained atomic locks in hashmap
  • Using sharde-slab - building hashmap using concurrent data-structures provided by this crate.
  • Using thread-local hashmaps in the hot-path to record measurements and then merge them during collection. The collection would be a performance-heavy process. And the SDK doesn't have control over the life cycle of threads created by the application, so exiting a thread by the application will also remove all it's aggregated data.
  • Cijo spoke about the lock-free approach for hot-path as used in otel-dotnet, where the first level of hashmap only stores the index to the aggregated data.

The approach in this PR is modifying the ValueMap store values in two-level hashing to minimize lock contention. It's basically a more simpler form of sharding as provided by dashmap.

Existing:

struct ValueMap<T: Number<T>> {
    values: Mutex<HashMap<AttributeSet, T>>, 
    has_no_value_attribute_value: AtomicBool,
    no_attribute_value: T::AtomicTracker,
}

PR:

struct ValueMap<T: Number<T>> {
    buckets: Arc<HashMap<u8, Mutex<HashMap<AttributeSet, T>>>>, // ? or RWLock
    has_no_value_attribute_value: AtomicBool,
    no_attribute_value: T::AtomicTracker,
    total_count: AtomicUsize,
}

First level of hashing is to distribute the values across the fixed set of 256 buckets. Each bucket is guarded by its own mutex and contains a second-level hash map for storing AttributeSet to aggregation mapping.

Please note

  • This approach won't see any improvement in single-threaded application, as we still lock. Infact, the complexity added by sharding may slightly impact the performance. I didn't see any perf improvement/degradation in benchmark tests as they are single threaded.
  • If all the attributes are hashed to the same bucket ( rare scenario), there won't be any perf improvement as all the threads will contest for the same lock.
  • This approach may impact locality of reference during collection cycle, as the OS can allocate these buckets in different segments of memory, and so mayn't be cached together.

All above are the limitation of any sharded hashmap (including dashmap).

This is the result of our metrics stress test in my machine. However, I don't think results are really consistent across machines, so good if someone can test it too:

Threads Main Throughput (millions/sec) PR Throughput (millions/sec) PR Throughput (millions/sec) with hashbrown
3 6.6 6.6 6.8
5 10.1 12.5 13.2
8 5.6 18.9 20.0
10 4.5 22.0 24.0

Based on the above results- with PR, the perf seems to be increasing significantly with the number of threads. In the main branch, the performance increases with threads till a threshold and then starts degrading.

Benchmark result. This PR won't improve performance there, as these are single-threaded tests. There is slight retrogress for two-level indirection, which gets compensated (in fact improved) with hashbrown/ahash:

Test time (Main branch) time (PR branch) time (PR with hashbrown)
AddNoAttrs 30.994 ns 31.286 ns 31.544 ns
AddNoAttrsDelta 31.009 ns 31.009 ns 31.455 ns
AddOneAttr 169.76 ns 184.00 ns 154.39 ns
AddOneAttrDelta 172.54 ns 185.37 ns 157.08 ns
AddThreeAttr 396.93 ns 393.36 ns 368.69 ns
AddThreeAttrDelta 396.41 ns 393.68 ns 368.60 ns
AddFiveAttr 706.21 ns 718.80 ns 672.07 ns

Hashbrown/ahash seems widely adopted crates in terms of number of downloads, and also secure enough for external ddos attacks - https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks

Merge requirement checklist

  • [ ] CONTRIBUTING guidelines followed
  • [ ] Unit tests added/updated (if applicable)
  • [ ] Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • [ ] Changes in public API reviewed (if applicable)

lalitb avatar Feb 21 '24 08:02 lalitb

Codecov Report

Attention: Patch coverage is 90.93333% with 34 lines in your changes are missing coverage. Please review.

Project coverage is 69.6%. Comparing base (f203b03) to head (d973c4d). Report is 9 commits behind head on main.

Files Patch % Lines
opentelemetry-sdk/src/metrics/internal/sum.rs 83.0% 34 Missing :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1564     +/-   ##
=======================================
+ Coverage   69.3%   69.6%   +0.3%     
=======================================
  Files        136     136             
  Lines      19637   19946    +309     
=======================================
+ Hits       13610   13894    +284     
- Misses      6027    6052     +25     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 21 '24 08:02 codecov[bot]

Thanks! This is amazing! As discussed in the SIG call, a key bottleneck today is the contention among every Update() calls (irrespective of if they are updating the same metric point (timeseries) or not). This eases that a lot, leading to way better scaling, as seen in the throughput increases for higher number of cores!

Lets sit on this for 1- 2 days to see if any concerns, before proceeding to do similar for other instrument types. Also, lets do this after the pending release.

cijothomas avatar Feb 21 '24 14:02 cijothomas

@lalitb love the performance gains! Were you able to test this impl against some of the single level hashmap alternative implementations?

Lets sit on this for 1- 2 days to see if any concerns, before proceeding to do similar for other instrument types. Also, lets do this after the pending release.

@cijothomas If it's truly for discussion a week seems like a more reasonable timeframe.

hdost avatar Feb 21 '24 23:02 hdost

Were you able to test this impl against some of the single level hashmap alternative implementations?

@hdost Do you have any alternative implemetations in mind, can test that.

I tested dashmap (which is also a sharded hashmap) but the performance was worse than our existing implementation - https://github.com/open-telemetry/opentelemetry-rust/compare/main...lalitb:metrics-dashmap?expand=1

main branch (using hashmap): Number of threads: 8 Throughput: 6,640,600 iterations/sec Throughput: 6,203,800 iterations/sec

using dashmap: Number of threads: 8 Throughput: 1,096,800 iterations/sec Throughput: 1,184,600 iterations/sec

As tested separately, the dashmap was much faster for concurrent read-only operations, but our scenario is concurrent updates. Also, we were using dashmap in the old metrics implementation, but was removed in the new PR (didn't find the reason in history). In general, reluctant to take dependency on any external crate unless it is widely used :)

lalitb avatar Feb 22 '24 00:02 lalitb

From what it seems there's no public interface effect with this so let's take the win for now and see if we can't make further improvements later 👍

hdost avatar Feb 22 '24 03:02 hdost

I see you've got some further improvements, how's the performance improvement for you ?

hdost avatar Feb 26 '24 18:02 hdost

I see you've got some further improvements, how's the performance improvement for you ?

The additional improvements are with (optionally) using the fast hashing with hashbrown/ahash. Will update the improvements later from the same machine I used as base earlier. As of now, I am replicating the same changes for other aggregation.

lalitb avatar Feb 27 '24 20:02 lalitb

I see you've got some further improvements, how's the performance improvement for you ?

The additional improvements are with (optionally) using the fast hashing with hashbrown/ahash. Will update the improvements later from the same machine I used as base earlier. As of now, I am replicating the same changes for other aggregation.

Updated the latest throughputs in the PR description. The perf-boost using hashbrown/ahash is marginal.

lalitb avatar Feb 28 '24 07:02 lalitb

Also added benchmark results, they are single-threaded, so the changes won't improve those., though we see the improvement with the use of hashbrown/ahash.

lalitb avatar Feb 28 '24 20:02 lalitb

As discussed during the community meeting, have added the tests for overflow and concurrent measurement recordings.

lalitb avatar Mar 07 '24 05:03 lalitb

Just an update, I am still working on it, particularly to mitigate the challenges around concurrency and atomicity - The main issue is making sure the operations of checking if we're under the cardinality limit and then acting on it (like inserting a new entry) happen seamlessly without any race conditions or inconsistencies due to concurrent updates. This actually turned out to be more complex than initially anticipated, especially when trying to do so without compromising on performance. Still working on it, hoping to have a solution during this week :)

lalitb avatar Mar 19 '24 06:03 lalitb

Just an update, I am still working on it, particularly to mitigate the challenges around concurrency and atomicity - The main issue is making sure the operations of checking if we're under the cardinality limit and then acting on it (like inserting a new entry) happen seamlessly without any race conditions or inconsistencies due to concurrent updates. This actually turned out to be more complex than initially anticipated, especially when trying to do so without compromising on performance. Still working on it, hoping to have a solution during this week :)

Let us know if this is ready for another review. If not, we can mark as draft again.

@hdost I am waiting for confirmation that this is ready for review, as we have open conversations that are not marked resolved.

cijothomas avatar Apr 21 '24 19:04 cijothomas

Let us know if this is ready for another review. If not, we can mark as draft again.

Sorry for the delay. Will revisit it now. If it takes more time, will move it to draft.

lalitb avatar Apr 30 '24 15:04 lalitb

There have been substantial perf improvements in metrics instrumentation with #1833. Will revisit this PR after logs beta release. Closing this for now to keep the PR list minimal.

lalitb avatar Jul 02 '24 20:07 lalitb