saturn
saturn copied to clipboard
Optimize work-stealing deque
This PR optimizes the work-stealing deque.
The graphs on current-bench show the progress of optimizing the work-stealing deque quite nicely:
Here is a run of the benchmarks on my M3 Max before the optimizations:
➜ saturn git:(rewrite-bench) ✗ dune exec --release -- ./bench/main.exe -budget 1 'Work' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[
{
"name": "spawns over time/1 worker",
"value": 49.86378542204304
},
{
"name": "spawns over time/2 workers",
"value": 44.30827252087946
},
{
"name": "spawns over time/4 workers",
"value": 84.66239655969726
},
{
"name": "spawns over time/8 workers",
"value": 177.69250680679536
}
]
Here is a run after the optimizations:
➜ saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 'Work' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[
{
"name": "spawns over time/1 worker",
"value": 61.20210013480126
},
{
"name": "spawns over time/2 workers",
"value": 121.42420051856216
},
{
"name": "spawns over time/4 workers",
"value": 235.72974008156237
},
{
"name": "spawns over time/8 workers",
"value": 428.2572470382373
}
]
General approach:
- Add benchmark(s). In this case the benchmark simulates a scheduler running parallel fibonacci.
- Avoid false sharing. With false sharing there will be too much noise to get any other useful optimizations done. In this case the top and bottom indices and the root record had to be padded to avoid false sharing. This already gave a big performance improvement.
- Avoid other forms of contention. Contention tends to mask any benefits from further optimizations. In this case there wasn't much of that except for avoiding some unnecessary loads and stores. With parallel data structures, any reads and writes of shared locations (atomic or non-atomic) should be viewed with suspicion.
- Usual micro-optimizations such as avoiding float array pessimization, indirections, costly operations, avoiding dependency chains, avoiding unnecessary work, avoiding unnecessary fences, making the common case fast (at the expense of less common case), ... Unless main contention issues are addressed micro-optimizations are difficult to make as noise and stalls from contention mask the improvements.
I created a PR adding new benchmarks for the work-stealing deque #130 and I added a commit to this PR to optimize the work-stealing deque to avoid performance pitfalls on those benchmarks.
Here is a run from before adding the commit:
➜ saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 -diff bench-high.json deque
Saturn_lockfree Work_stealing_deque:
time per spawn/1 worker:
16.60 ns = 0.85 x 19.52 ns
spawns over time/1 worker:
60.23 M/s = 1.18 x 51.23 M/s
time per spawn/2 workers:
16.73 ns = 0.37 x 45.64 ns
spawns over time/2 workers:
119.58 M/s = 2.73 x 43.82 M/s
time per spawn/4 workers:
17.38 ns = 0.51 x 34.40 ns
spawns over time/4 workers:
230.12 M/s = 1.98 x 116.27 M/s
time per spawn/8 workers:
18.65 ns = 0.28 x 65.65 ns
spawns over time/8 workers:
429.00 M/s = 3.52 x 121.86 M/s
time per message/1 adder, 1 taker:
110.91 ns = 1.07 x 103.71 ns
messages over time/1 adder, 1 taker:
18.03 M/s = 0.94 x 19.28 M/s
time per message/1 adder, 2 takers:
167.84 ns = 0.90 x 187.33 ns
messages over time/1 adder, 2 takers:
17.87 M/s = 1.12 x 16.01 M/s
time per message/1 adder, 4 takers:
261.86 ns = 0.94 x 278.85 ns
messages over time/1 adder, 4 takers:
19.09 M/s = 1.06 x 17.93 M/s
time per message/one domain (FIFO):
16.34 ns = 0.91 x 17.86 ns
messages over time/one domain (FIFO):
61.18 M/s = 1.09 x 55.98 M/s
time per message/one domain (LIFO):
18.95 ns = 1.00 x 18.89 ns
messages over time/one domain (LIFO):
52.78 M/s = 1.00 x 52.93 M/s
Here is a run after the commit:
➜ saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 -diff bench-high.json deque
Saturn_lockfree Work_stealing_deque:
time per spawn/1 worker:
16.60 ns = 0.85 x 19.52 ns
spawns over time/1 worker:
60.26 M/s = 1.18 x 51.23 M/s
time per spawn/2 workers:
16.66 ns = 0.37 x 45.64 ns
spawns over time/2 workers:
120.04 M/s = 2.74 x 43.82 M/s
time per spawn/4 workers:
17.29 ns = 0.50 x 34.40 ns
spawns over time/4 workers:
231.33 M/s = 1.99 x 116.27 M/s
time per spawn/8 workers:
18.60 ns = 0.28 x 65.65 ns
spawns over time/8 workers:
430.18 M/s = 3.53 x 121.86 M/s
time per message/1 adder, 1 taker:
35.15 ns = 0.34 x 103.71 ns
messages over time/1 adder, 1 taker:
56.89 M/s = 2.95 x 19.28 M/s
time per message/1 adder, 2 takers:
47.34 ns = 0.25 x 187.33 ns
messages over time/1 adder, 2 takers:
63.37 M/s = 3.96 x 16.01 M/s
time per message/1 adder, 4 takers:
85.12 ns = 0.31 x 278.85 ns
messages over time/1 adder, 4 takers:
58.74 M/s = 3.28 x 17.93 M/s
time per message/one domain (FIFO):
16.61 ns = 0.93 x 17.86 ns
messages over time/one domain (FIFO):
60.21 M/s = 1.08 x 55.98 M/s
time per message/one domain (LIFO):
15.84 ns = 0.84 x 18.89 ns
messages over time/one domain (LIFO):
63.12 M/s = 1.19 x 52.93 M/s
Performance on the SPMC style benchmarks improved significantly.
The optimization in the last commit (caching the index used by thieves) is mentioned in the original paper that introduces the work-stealing deque (section 2.3 Avoid top accesses in pushBottom).