opentelemetry-rust
opentelemetry-rust copied to clipboard
Adding two level hashing in metrics hashmap
Changes
This PR is opened just for the discussion, of one of the possible changes to improve the performance of concurrent recording/updation of measurements in the hot-path. In today's (Feb 20th, 2024) community meeting, we discussed multiple approaches
- Using
dashmap- sharded hashmap. - Using
flurry- uses fine-grained atomic locks in hashmap - Using
sharde-slab- building hashmap using concurrent data-structures provided by this crate. - Using
thread-localhashmaps in the hot-path to record measurements and then merge them during collection. The collection would be a performance-heavy process. And the SDK doesn't have control over the life cycle of threads created by the application, so exiting a thread by the application will also remove all it's aggregated data. - Cijo spoke about the lock-free approach for hot-path as used in otel-dotnet, where the first level of hashmap only stores the index to the aggregated data.
The approach in this PR is modifying the ValueMap store values in two-level hashing to minimize lock contention. It's basically a more simpler form of sharding as provided by dashmap.
Existing:
struct ValueMap<T: Number<T>> {
values: Mutex<HashMap<AttributeSet, T>>,
has_no_value_attribute_value: AtomicBool,
no_attribute_value: T::AtomicTracker,
}
PR:
struct ValueMap<T: Number<T>> {
buckets: Arc<HashMap<u8, Mutex<HashMap<AttributeSet, T>>>>, // ? or RWLock
has_no_value_attribute_value: AtomicBool,
no_attribute_value: T::AtomicTracker,
total_count: AtomicUsize,
}
First level of hashing is to distribute the values across the fixed set of 256 buckets. Each bucket is guarded by its own mutex and contains a second-level hash map for storing AttributeSet to aggregation mapping.
Please note
- This approach won't see any improvement in single-threaded application, as we still lock. Infact, the complexity added by sharding may slightly impact the performance. I didn't see any perf improvement/degradation in benchmark tests as they are single threaded.
- If all the attributes are hashed to the same bucket ( rare scenario), there won't be any perf improvement as all the threads will contest for the same lock.
- This approach may impact
locality of referenceduring collection cycle, as the OS can allocate these buckets in different segments of memory, and so mayn't be cached together.
All above are the limitation of any sharded hashmap (including dashmap).
This is the result of our metrics stress test in my machine. However, I don't think results are really consistent across machines, so good if someone can test it too:
| Threads | Main Throughput (millions/sec) | PR Throughput (millions/sec) | PR Throughput (millions/sec) with hashbrown |
|---|---|---|---|
| 3 | 6.6 | 6.6 | 6.8 |
| 5 | 10.1 | 12.5 | 13.2 |
| 8 | 5.6 | 18.9 | 20.0 |
| 10 | 4.5 | 22.0 | 24.0 |
Based on the above results- with PR, the perf seems to be increasing significantly with the number of threads. In the main branch, the performance increases with threads till a threshold and then starts degrading.
Benchmark result. This PR won't improve performance there, as these are single-threaded tests. There is slight retrogress for two-level indirection, which gets compensated (in fact improved) with hashbrown/ahash:
| Test | time (Main branch) | time (PR branch) | time (PR with hashbrown) |
|---|---|---|---|
| AddNoAttrs | 30.994 ns | 31.286 ns | 31.544 ns |
| AddNoAttrsDelta | 31.009 ns | 31.009 ns | 31.455 ns |
| AddOneAttr | 169.76 ns | 184.00 ns | 154.39 ns |
| AddOneAttrDelta | 172.54 ns | 185.37 ns | 157.08 ns |
| AddThreeAttr | 396.93 ns | 393.36 ns | 368.69 ns |
| AddThreeAttrDelta | 396.41 ns | 393.68 ns | 368.60 ns |
| AddFiveAttr | 706.21 ns | 718.80 ns | 672.07 ns |
Hashbrown/ahash seems widely adopted crates in terms of number of downloads, and also secure enough for external ddos attacks - https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks
Merge requirement checklist
- [ ] CONTRIBUTING guidelines followed
- [ ] Unit tests added/updated (if applicable)
- [ ] Appropriate
CHANGELOG.mdfiles updated for non-trivial, user-facing changes - [ ] Changes in public API reviewed (if applicable)
Codecov Report
Attention: Patch coverage is 90.93333% with 34 lines in your changes are missing coverage. Please review.
Project coverage is 69.6%. Comparing base (
f203b03) to head (d973c4d). Report is 9 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| opentelemetry-sdk/src/metrics/internal/sum.rs | 83.0% | 34 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #1564 +/- ##
=======================================
+ Coverage 69.3% 69.6% +0.3%
=======================================
Files 136 136
Lines 19637 19946 +309
=======================================
+ Hits 13610 13894 +284
- Misses 6027 6052 +25
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks! This is amazing! As discussed in the SIG call, a key bottleneck today is the contention among every Update() calls (irrespective of if they are updating the same metric point (timeseries) or not). This eases that a lot, leading to way better scaling, as seen in the throughput increases for higher number of cores!
Lets sit on this for 1- 2 days to see if any concerns, before proceeding to do similar for other instrument types. Also, lets do this after the pending release.
@lalitb love the performance gains! Were you able to test this impl against some of the single level hashmap alternative implementations?
Lets sit on this for 1- 2 days to see if any concerns, before proceeding to do similar for other instrument types. Also, lets do this after the pending release.
@cijothomas If it's truly for discussion a week seems like a more reasonable timeframe.
Were you able to test this impl against some of the single level hashmap alternative implementations?
@hdost Do you have any alternative implemetations in mind, can test that.
I tested dashmap (which is also a sharded hashmap) but the performance was worse than our existing implementation - https://github.com/open-telemetry/opentelemetry-rust/compare/main...lalitb:metrics-dashmap?expand=1
main branch (using hashmap):
Number of threads: 8
Throughput: 6,640,600 iterations/sec
Throughput: 6,203,800 iterations/sec
using dashmap:
Number of threads: 8
Throughput: 1,096,800 iterations/sec
Throughput: 1,184,600 iterations/sec
As tested separately, the dashmap was much faster for concurrent read-only operations, but our scenario is concurrent updates. Also, we were using dashmap in the old metrics implementation, but was removed in the new PR (didn't find the reason in history). In general, reluctant to take dependency on any external crate unless it is widely used :)
From what it seems there's no public interface effect with this so let's take the win for now and see if we can't make further improvements later 👍
I see you've got some further improvements, how's the performance improvement for you ?
I see you've got some further improvements, how's the performance improvement for you ?
The additional improvements are with (optionally) using the fast hashing with hashbrown/ahash. Will update the improvements later from the same machine I used as base earlier. As of now, I am replicating the same changes for other aggregation.
I see you've got some further improvements, how's the performance improvement for you ?
The additional improvements are with (optionally) using the fast hashing with hashbrown/ahash. Will update the improvements later from the same machine I used as base earlier. As of now, I am replicating the same changes for other aggregation.
Updated the latest throughputs in the PR description. The perf-boost using hashbrown/ahash is marginal.
Also added benchmark results, they are single-threaded, so the changes won't improve those., though we see the improvement with the use of hashbrown/ahash.
As discussed during the community meeting, have added the tests for overflow and concurrent measurement recordings.
Just an update, I am still working on it, particularly to mitigate the challenges around concurrency and atomicity - The main issue is making sure the operations of checking if we're under the cardinality limit and then acting on it (like inserting a new entry) happen seamlessly without any race conditions or inconsistencies due to concurrent updates. This actually turned out to be more complex than initially anticipated, especially when trying to do so without compromising on performance. Still working on it, hoping to have a solution during this week :)
Just an update, I am still working on it, particularly to mitigate the challenges around concurrency and atomicity - The main issue is making sure the operations of checking if we're under the cardinality limit and then acting on it (like inserting a new entry) happen seamlessly without any race conditions or inconsistencies due to concurrent updates. This actually turned out to be more complex than initially anticipated, especially when trying to do so without compromising on performance. Still working on it, hoping to have a solution during this week :)
Let us know if this is ready for another review. If not, we can mark as draft again.
@hdost I am waiting for confirmation that this is ready for review, as we have open conversations that are not marked resolved.
Let us know if this is ready for another review. If not, we can mark as draft again.
Sorry for the delay. Will revisit it now. If it takes more time, will move it to draft.
There have been substantial perf improvements in metrics instrumentation with #1833. Will revisit this PR after logs beta release. Closing this for now to keep the PR list minimal.