wireguard-go Replace mix of atomics and rwmutex with mutex around key material

This reduces the overhead of various code paths as there are less pipeline delays and more code can be inlined. The mutex fast path is a single atomic operation, and this mutex is never held for very long.

Microbenchmark results in each commit show no micro-scale difference, macro scale tests demonstrate a slight speed up, though well within the noise of other external effects: small improvement in arm64 test on t4g.nano from 642mbps to 646mbps, x64 test on m6i.xlarge from 975mbps to 1.02gbps.

Sep 13 '22 02:09 raggi

Why do you suppose the micro tests show no performance improvement but the macro tests do?

Feb 07 '23 23:02 zx2c4

Why do you suppose the micro tests show no performance improvement but the macro tests do?

The shortest summary is that my guess is that the micro-benchmarks speculate and pipeline better. RWMutex is a lot more expensive than Mutex even on the happy path, as not only does it do more (several loads, one from TLS, at least two of its own atomics), but it also has more branch paths and instruction pressure. The fast path for Mutex is a single cas operation that probably can be inlined.

Personally I think the simplification is probably of more value, I did testing on the performance primarily to be sure that it didn't get slower, but found that it did get materially faster for some system topologies. RWLocks are as a general rule probably not great to put in the per-packet path - at least not general purpose implementations of them.

Feb 09 '23 23:02 raggi

I remain sort of skeptical about this and it'd be nice to have some more real data on it. If RWLocks are a bottleneck here, wouldn't that mostly already be hidden by the much heavier usage of an RWLock in IndexTable's Lookup function?

Mar 03 '23 13:03 zx2c4

To highlight again:

Personally I think the simplification is probably of more value, I did testing on the performance primarily to be sure that it didn't get slower, but found that it did get materially faster for some system topologies.

What are you skeptical about, performance concerns? As stated, they're not all that significant.

What I experienced, and the reason I split this PR out early on in our work was that I ran into this code along performance concerns, and it had intricate behaviors that are involved in both performance and correctness, and it is in a form that is harder to study than is necessary. This approach is far simpler, and benchmarks slightly faster. The primary motivation for landing this PR should be that is is simpler, and it does not regress, it improves on performance (albeit slightly).

Could you expand on some specific concerns that can be addressed?

Mar 15 '23 22:03 raggi