crossbeam [wip] Cumulative micro opts (1-2% perf. gain)

Well, I REALLY want my other prs to be merged (or decided to be rejected, so that i can decide to fork crossbeam for my work....). All of them got stale for long time for some reason.

So, I'm wondering whether it could help by actively contributing to crossbeam or not, in order to indicate my willingness to spend my time in exchange of my prs got being reviewed. :)

here it is. Assorted of micro ops, i came up with. note that I haven't evaluated each in depth. This pr is to start triage them from more knowledgeable people. so that I can create more serious prs for each.

General theme of them is to reduce instructions.

(note to self):

inject to-be-dropped buffer from receiver to sender
new with at-the-end of index so that initial alloc can happen even with non-option buffer

Mar 08 '24 07:03 ryoqun

So, I'm wondering whether it could help by actively contributing to crossbeam or not, in order to indicate my willingness to spend my time in exchange of my prs got being reviewed. :)

I appreciate the contribution, but a patch that mixes various different-purpose changes without any particular explanation of each change is also the kind of patch that takes time to review...

As for the changes:

Subword (8-bit and 16-bit) atomic can be very inefficient on platforms like riscv, and benchmarks that actually use it in the unbounded queue have shown that it doesn't perform that well even on x86_64 (https://github.com/smol-rs/concurrent-queue/pull/13). I think atomic of the same size as c_int would probably be ok, but it needs a separate benchmark anyway.
Separation of slots to states and msgs may increase the chance of encountering false sharing (https://github.com/crossbeam-rs/crossbeam/pull/462). I think it is possible that it could be handled well in the way described in https://arxiv.org/pdf/1908.04511, but it needs a separate benchmark anyway.
At this point I realize that there is no information on how the 1-2% gain was actually measured. For example, if the result of all this change is as little as a 1% improvement in a single microbenchmark, the worth of the improvement relative to the complexity of the change is questionable.
assume_init_drop -> ptr::read + drop is expected to degrade performance when messages are relatively large due to the extra moves.
Most of the other changes seem to be code shuffling, but it was not clear from what little I saw what exactly you were trying to improve.

If this PR's purpose is to help merge your other PRs, I honestly think it would be better to add tests to complete them. It is not easy to review PRs that add new APIs but have no tests (and there is no possibility of such PRs being merged as is).

May 02 '24 19:05 taiki-e

(thanks for various feedback; I'll appreciate very much; I'll reply to these later. focusing on #1040 for now)

May 04 '24 13:05 ryoqun