flurry
flurry copied to clipboard
Sharded counter optimisation
This is picking up from the great work from @jimvdl in this PR: https://github.com/jonhoo/flurry/pull/101
I've added the crate here: https://github.com/JackThomson2/fast-counter which is using a basic sharded concurrent counter which is identified using a thread_local. It seems at worse case it's slightly slower but with more cores it starts to pull ahead.
These are the results using the conc-bench repo on my 16 core machine:

Codecov Report
Merging #109 (c92dbf2) into master (85ac469) will decrease coverage by
6.39%. The diff coverage is90.00%.
Additional details and impacted files
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/map.rs | 75.72% <88.88%> (-4.86%) |
:arrow_down: |
| src/node.rs | 73.84% <100.00%> (-4.69%) |
:arrow_down: |
| src/serde_impls.rs | 83.33% <0.00%> (-7.98%) |
:arrow_down: |
| src/rayon_impls.rs | 91.66% <0.00%> (-7.00%) |
:arrow_down: |
| src/iter/traverser.rs | 82.89% <0.00%> (-6.95%) |
:arrow_down: |
| src/iter/mod.rs | 100.00% <0.00%> (ø) |
|
| src/lib.rs | 16.12% <0.00%> (ø) |
|
| src/raw/mod.rs | 80.00% <0.00%> (+0.27%) |
:arrow_up: |
Nice, thanks for pushing this forward!
I took a look at the code for
fast-counter, and had a couple of thoughts:
- I think
THREAD_IDcan be aCell<usize>instead of anUnsafeCell. Should let you get rid of theunsafeblock you have in there.- On
nightlyyou can useThreadId::as_u64rather than keepingTHREAD_IDyourself.- It's not entirely clear to me why you have stable and nightly separated? The use of
#[thread_local]doesn't feel like it's very important, and in terms of performance the difference seems fairly minor.
Thanks for the feedback, all great points, I'll try address these soon.
The reason for the night / stable just happened as an effect of me starting with the nightly first, and originally the performance difference was greater. I left it was an isize as the previous PR used this approach, happy to update it to a usize 👍
Hey @jonhoo been looking into this and a few thoughts and findings:
- I tested using the
ThreadID::as_u64this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently - I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular
AtomicUsizeon every core count. - I will look at removing the nightly option it does look like the results are very close now so probably not needed.
- I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a
usizeand take ausizebut use theisizeinternally.
Thanks again for reviewing this
Sorry for the super slow reply!
I tested using the
ThreadID::as_u64this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently
Looking at the std source, it seems like the Arc is in the Thread type, whereas the ThreadId type is just a plain integer. I'm guessing this ends up slow because you're either calling thread::current() on every access (which is slow) or keeping the Thread around, which means going through the Arc each time to get to the numeric ID. I think the way to fix this is to store the ThreadId in a thread-local. That way you should have super cheap access to it (ThreadId::as_u64 is just a field access) without having to keep an ID counter yourself.
I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular
AtomicUsizeon every core count.
That's super weird! The code for Cell is really just an UnsafeCell with #[repr(transparent)], and Cell::set just does mem::replace(unsafe { self.0.get() }, v), which should cost exactly as much as just setting the value, and certainly not as much as running atomic operations. Could you give the raw numbers you observed to see if we can figure out what's going on?
I will look at removing the nightly option it does look like the results are very close now so probably not needed.
Nice!
I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a
usizeand take ausizebut use theisizeinternally.
Ah, yes, that's a great point, and a totally valid reason for using an isize for the sharded counter. I think I'd still prefer for the API to return a usize because the sum should never be negative, which I think we can simply assert with a debug_assert! (so it doesn't affect release build performance). It's fine to make add take an isize though I think. It's a little weird, but not the end of the world.
Thanks again for reviewing this
Thanks for digging so deep!
Actually, on second thought, in a concurrent setting with lazily updated counters, it's entirely possible that the sum of the counters ends up being negative for some shorter period of time. So I think we should actually leave the return value as an isize :+1:
Looks like there are some CI failures now?
I've been following along somewhat closely and happy to see that you guys made good progress. 🎉 If either of you ever need assistance (with anything really) you can always @ me :) (although I'm fairly certain you both got this in the bag). Well done Jack & Jon!
Looks like there are some CI failures now?
Saw these, I'll take a look at resolving them when I get a chance! Thanks for running the workflow