Investigate use of `std::atomic_flag` instead of `std::binary_semaphore`
std::atomic_flag is the only atomic primitive guaranteed to be lock free. It would be interesting to see if this has any positive impact on performance over std::binary_semaphore.
Did a quick benchmark on quick-bench and it seems that std::binary_semaphore has the best performance when it comes to ping/pong which I think matches well with how we're using it in the thread pool (as a signal mechanism).
https://quick-bench.com/q/JkZjpTgsjQkSiyI20IRcZtEFNso

I wrote a quick benchmark and ran it locally on windows and am getting inconsistent results. I think I will need to do a more proper test with pyperf on a Linux machine to get better numbers.
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
451.75 |
2.21 |
1.8% |
53.95 |
std::atomic_flag |
| 99.2% |
455.40 |
2.20 |
2.0% |
54.42 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
444.76 |
2.25 |
2.1% |
53.20 |
std::atomic_flag |
| 98.4% |
452.16 |
2.21 |
3.0% |
53.78 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
485.05 |
2.06 |
0.3% |
57.51 |
std::atomic_flag |
| 103.0% |
470.99 |
2.12 |
0.8% |
57.44 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
457.13 |
2.19 |
1.9% |
55.43 |
std::atomic_flag |
| 96.9% |
471.90 |
2.12 |
3.8% |
56.81 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
481.77 |
2.08 |
2.8% |
56.86 |
std::atomic_flag |
| 105.2% |
457.91 |
2.18 |
4.1% |
54.98 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
465.74 |
2.15 |
0.4% |
55.85 |
std::atomic_flag |
| 101.3% |
459.84 |
2.17 |
0.6% |
55.00 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
453.31 |
2.21 |
0.9% |
54.72 |
std::atomic_flag |
| 95.0% |
477.07 |
2.10 |
2.4% |
56.92 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
477.71 |
2.09 |
0.8% |
57.59 |
std::atomic_flag |
| 101.6% |
470.13 |
2.13 |
0.6% |
56.57 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
477.80 |
2.09 |
1.1% |
57.12 |
std::atomic_flag |
| 100.4% |
475.79 |
2.10 |
1.8% |
56.92 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
479.51 |
2.09 |
1.2% |
58.10 |
std::atomic_flag |
| 102.9% |
465.78 |
2.15 |
4.3% |
55.52 |
std::binary_semaphore |
here are 10 runs on a linux system with pyperf system tune set up:
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
406.13 |
2.46 |
1.1% |
48.78 |
std::atomic_flag |
| 93.8% |
432.78 |
2.31 |
1.9% |
52.17 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
409.78 |
2.44 |
0.7% |
49.23 |
std::atomic_flag |
| 106.2% |
385.80 |
2.59 |
0.7% |
46.45 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
402.90 |
2.48 |
0.7% |
48.30 |
std::atomic_flag |
| 101.1% |
398.52 |
2.51 |
0.8% |
47.77 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
397.33 |
2.52 |
0.4% |
47.64 |
std::atomic_flag |
| 104.8% |
379.24 |
2.64 |
0.7% |
45.69 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
372.06 |
2.69 |
0.9% |
44.75 |
std::atomic_flag |
| 88.7% |
419.39 |
2.38 |
2.2% |
50.05 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
420.61 |
2.38 |
0.9% |
50.24 |
std::atomic_flag |
| 112.4% |
374.31 |
2.67 |
0.8% |
45.01 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
394.11 |
2.54 |
0.9% |
47.54 |
std::atomic_flag |
| 97.8% |
403.07 |
2.48 |
0.6% |
48.64 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
406.72 |
2.46 |
0.7% |
48.58 |
std::atomic_flag |
| 105.2% |
386.67 |
2.59 |
1.1% |
46.27 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
409.09 |
2.44 |
0.3% |
49.27 |
std::atomic_flag |
| 107.2% |
381.71 |
2.62 |
1.0% |
45.74 |
std::binary_semaphore |
| relative |
ms/op |
op/s |
err% |
total |
Thread signaling |
| 100.0% |
394.07 |
2.54 |
0.7% |
47.26 |
std::atomic_flag |
| 90.8% |
434.10 |
2.30 |
0.6% |
52.22 |
std::binary_semaphore |
@jtd-formlabs Thanks for doing that. It seems like they're essentially the same. At most we'd be saving tens of milliseconds, so it seems like it's not worth it to me.