aerospike-common
aerospike-common copied to clipboard
Replace the yield with an isb on Arm.
The yield instruction is treated as a nop on Arm processors which is very different than the x86 pause instruction that stalls execution for ~40 cycles.
An ISB serializes the pipeline and has been shown to be roughly analogous to the pause delays and is used is other databases for spinloops and adaptive spin loops where not hammering the cache line is important.
@AGSaidi We are evaluating this pull request. I did run a multi-threaded spinlock test on macOS m1, and both "yield" and "isb" instructions resulted in nearly 100% cpu usage on each of the blocked threads. I realize cpu usage and power consumption are not exactly correlated, so this test might be misleading or Apple silicon might have a "pause like" implementation of "yield".
Can you point to a spinlock test and platform that demonstrates the advantage of "isb" on power consumption?
@BrianNichols you'll still see 100% cpu utilization, the application is still using the core completely, but the key point here is it's going to iterate around the spinloop fewer timer. These fewer times mean less loads for the memory location that is being spun on into the memory system and that generally saves power and improves performance. The Performance improvement comes from two angles. 1. If there are any adaptive spin loops that have been tuned for 'pause' on intel this will make teh same tuning apply for Arm as opposed to ending early. 2. The fact that the memory system isn't saturated with loads from the fast loop means that the unlock is observed more quickly.
@AGSaidi Sounds reasonable. We should have a decision by next week.
Great. Please run your performance tests on Graviton and see if there are tests that improve. We've seen substantial improvements for lock contention workloads in other databases.
The pull request has been accepted.