Improvement on cache invalidation
I've recently been fiddling with a multi threaded application using crossbeam channels. On threadripper I noticed that performance would degrade rapidly when the sender and receiver were on different CCX's (or in other words when cache wasn't shared between sender and receiver).
With a bit of digging I found that in the array implementation of channels does suffer from the buffer not being cach aligned.
I wrapped the buffer in a CachePadded and it improved significantly in my tests over 2x in some cases. that said obviously the tests only capture a tiny bit and using a single 64 bit value in them definitely is the extreme case to trigger this edge case. Still it looks lice a nice improvement.
I will keep this as a draft for now as while the benchmarks looks nice real world impact I measured is not as big as I hoped :tm: so I think I have a bit more digging to do.
this
running 24 tests
test bounded_0::create ... bench: 45 ns/iter (+/- 0)
test bounded_0::mpmc ... bench: 48,288,434 ns/iter (+/- 7,449,535)
test bounded_0::mpsc ... bench: 79,192,749 ns/iter (+/- 2,066,971)
test bounded_0::spmc ... bench: 86,323,323 ns/iter (+/- 2,397,011)
test bounded_0::spsc ... bench: 29,002,652 ns/iter (+/- 1,547,061)
test bounded_1::create ... bench: 195 ns/iter (+/- 3)
test bounded_1::mpmc ... bench: 20,014,231 ns/iter (+/- 436,181)
test bounded_1::mpsc ... bench: 109,318,843 ns/iter (+/- 4,693,295)
test bounded_1::oneshot ... bench: 180 ns/iter (+/- 1)
test bounded_1::spmc ... bench: 97,771,406 ns/iter (+/- 3,625,496)
test bounded_1::spsc ... bench: 18,986,039 ns/iter (+/- 238,972)
test bounded_n::mpmc ... bench: 5,376,086 ns/iter (+/- 422,042)
test bounded_n::mpsc ... bench: 11,749,680 ns/iter (+/- 560,767)
test bounded_n::par_inout ... bench: 13,453,292 ns/iter (+/- 966,845)
test bounded_n::spmc ... bench: 89,016,467 ns/iter (+/- 3,262,106)
test bounded_n::spsc ... bench: 4,137,098 ns/iter (+/- 375,743)
test unbounded::create ... bench: 109 ns/iter (+/- 1)
test unbounded::inout ... bench: 39 ns/iter (+/- 0)
test unbounded::mpmc ... bench: 3,024,718 ns/iter (+/- 179,688)
test unbounded::mpsc ... bench: 5,306,185 ns/iter (+/- 362,481)
test unbounded::oneshot ... bench: 175 ns/iter (+/- 2)
test unbounded::par_inout ... bench: 10,732,447 ns/iter (+/- 542,191)
test unbounded::spmc ... bench: 92,086,599 ns/iter (+/- 1,790,785)
test unbounded::spsc ... bench: 1,303,073 ns/iter (+/- 16,593)
master
running 24 tests
test bounded_0::create ... bench: 45 ns/iter (+/- 0)
test bounded_0::mpmc ... bench: 47,513,539 ns/iter (+/- 7,685,319)
test bounded_0::mpsc ... bench: 79,297,255 ns/iter (+/- 1,721,529)
test bounded_0::spmc ... bench: 86,583,535 ns/iter (+/- 2,025,047)
test bounded_0::spsc ... bench: 29,433,918 ns/iter (+/- 3,792,133)
test bounded_1::create ... bench: 120 ns/iter (+/- 5)
test bounded_1::mpmc ... bench: 19,896,780 ns/iter (+/- 523,015)
test bounded_1::mpsc ... bench: 106,761,448 ns/iter (+/- 4,330,258)
test bounded_1::oneshot ... bench: 138 ns/iter (+/- 3)
test bounded_1::spmc ... bench: 100,886,592 ns/iter (+/- 2,866,250)
test bounded_1::spsc ... bench: 28,713,988 ns/iter (+/- 1,218,632)
test bounded_n::mpmc ... bench: 6,456,962 ns/iter (+/- 516,168)
test bounded_n::mpsc ... bench: 13,604,237 ns/iter (+/- 338,683)
test bounded_n::par_inout ... bench: 12,855,325 ns/iter (+/- 1,735,288)
test bounded_n::spmc ... bench: 97,568,112 ns/iter (+/- 3,793,568)
test bounded_n::spsc ... bench: 2,035,692 ns/iter (+/- 753,005)
test unbounded::create ... bench: 112 ns/iter (+/- 2)
test unbounded::inout ... bench: 39 ns/iter (+/- 0)
test unbounded::mpmc ... bench: 3,014,406 ns/iter (+/- 308,277)
test unbounded::mpsc ... bench: 5,213,754 ns/iter (+/- 159,838)
test unbounded::oneshot ... bench: 165 ns/iter (+/- 1)
test unbounded::par_inout ... bench: 10,640,906 ns/iter (+/- 743,346)
test unbounded::spmc ... bench: 91,300,215 ns/iter (+/- 2,178,182)
test unbounded::spsc ... bench: 1,480,523 ns/iter (+/- 47,803)
Note that this will significantly increase memory usage of channels which is not really desirable (Since with this change a slot value’s size will be aligned to a multiple of 128 bytes, at least on x86-64).
That's a good point, especially for small values the memory growth would be quite a bit, OTOH especially for them the performance difference is significant too.
I'm not sure what the right trade off is, perhaps it'd be better suited as a own flavor.