trafficstars

This is picking up from the great work from @jimvdl in this PR: https://github.com/jonhoo/flurry/pull/101

I've added the crate here: https://github.com/JackThomson2/fast-counter which is using a basic sharded concurrent counter which is identified using a thread_local. It seems at worse case it's slightly slower but with more cores it starts to pull ahead.

These are the results using the conc-bench repo on my 16 core machine:

315967174_6042340619130631_2502339656878077868_n

This change is

Aug 14 '22 11:08 JackThomson2

Codecov Report

Merging #109 (c92dbf2) into master (85ac469) will decrease coverage by 6.39%. The diff coverage is 90.00%.

Additional details and impacted files

Impacted Files	Coverage Δ
src/map.rs	`75.72% <88.88%> (-4.86%)`	:arrow_down:
src/node.rs	`73.84% <100.00%> (-4.69%)`	:arrow_down:
src/serde_impls.rs	`83.33% <0.00%> (-7.98%)`	:arrow_down:
src/rayon_impls.rs	`91.66% <0.00%> (-7.00%)`	:arrow_down:
src/iter/traverser.rs	`82.89% <0.00%> (-6.95%)`	:arrow_down:
src/iter/mod.rs	`100.00% <0.00%> (ø)`
src/lib.rs	`16.12% <0.00%> (ø)`
src/raw/mod.rs	`80.00% <0.00%> (+0.27%)`	:arrow_up:

Aug 20 '22 18:08 codecov[bot]

Nice, thanks for pushing this forward!

I took a look at the code for fast-counter, and had a couple of thoughts:

I think THREAD_ID can be a Cell<usize> instead of an UnsafeCell. Should let you get rid of the unsafe block you have in there.

On nightly you can use ThreadId::as_u64 rather than keeping THREAD_ID yourself.

It's not entirely clear to me why you have stable and nightly separated? The use of #[thread_local] doesn't feel like it's very important, and in terms of performance the difference seems fairly minor.

Thanks for the feedback, all great points, I'll try address these soon.

The reason for the night / stable just happened as an effect of me starting with the nightly first, and originally the performance difference was greater. I left it was an isize as the previous PR used this approach, happy to update it to a usize 👍

Aug 21 '22 18:08 JackThomson2

Hey @jonhoo been looking into this and a few thoughts and findings:

I tested using the ThreadID::as_u64 this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently
I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular AtomicUsize on every core count.
I will look at removing the nightly option it does look like the results are very close now so probably not needed.
I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a usize and take a usize but use the isize internally.

Thanks again for reviewing this

Sep 07 '22 13:09 JackThomson2

Sorry for the super slow reply!

I tested using the ThreadID::as_u64 this proved to be significantly slower, a quick look in godbolt it seems we have to deal with Arcs etc. The result was 2-3x slower consistently

Looking at the std source, it seems like the Arc is in the Thread type, whereas the ThreadId type is just a plain integer. I'm guessing this ends up slow because you're either calling thread::current() on every access (which is slow) or keeping the Thread around, which means going through the Arc each time to get to the numeric ID. I think the way to fix this is to store the ThreadId in a thread-local. That way you should have super cheap access to it (ThreadId::as_u64 is just a field access) without having to keep an ID counter yourself.

I tried using the Cell over Unsafe Cell. This again made it slower, I'm not sure if the compiler can't optimise as well or what. But from my benchmark it showed the counter being slower than a regular AtomicUsize on every core count.

That's super weird! The code for Cell is really just an UnsafeCell with #[repr(transparent)], and Cell::set just does mem::replace(unsafe { self.0.get() }, v), which should cost exactly as much as just setting the value, and certainly not as much as running atomic operations. Could you give the raw numbers you observed to see if we can figure out what's going on?

I will look at removing the nightly option it does look like the results are very close now so probably not needed.

Nice!

I looked at holding a usize, the issue I found was that it is possible for a different cell to be decremented i.e. going into negatives. Would you want us to return a usize and take a usize but use the isize internally.

Ah, yes, that's a great point, and a totally valid reason for using an isize for the sharded counter. I think I'd still prefer for the API to return a usize because the sum should never be negative, which I think we can simply assert with a debug_assert! (so it doesn't affect release build performance). It's fine to make add take an isize though I think. It's a little weird, but not the end of the world.

Thanks again for reviewing this

Thanks for digging so deep!

Oct 08 '22 23:10 jonhoo

Actually, on second thought, in a concurrent setting with lazily updated counters, it's entirely possible that the sum of the counters ends up being negative for some shorter period of time. So I think we should actually leave the return value as an isize :+1:

Oct 08 '22 23:10 jonhoo

Looks like there are some CI failures now?

Dec 04 '22 00:12 jonhoo

I've been following along somewhat closely and happy to see that you guys made good progress. 🎉 If either of you ever need assistance (with anything really) you can always @ me :) (although I'm fairly certain you both got this in the bag). Well done Jack & Jon!

Dec 04 '22 21:12 jimvdl

Looks like there are some CI failures now?

Saw these, I'll take a look at resolving them when I get a chance! Thanks for running the workflow

Dec 05 '22 11:12 JackThomson2

flurry
flurry copied to clipboard

Sharded counter optimisation

Codecov Report

flurry flurry copied to clipboard

Sharded counter optimisation

Codecov Report

flurry
flurry copied to clipboard