saturn
saturn copied to clipboard
MPMC 2-stack queue
This PR implements a MPMC queue using two stacks. This is a new lock-free queue algorithm / data structure. It uses two stacks for the tail and head of the queue. Operations on the tail (pushes) and head (pops) are like with a Treiber stack. A simple lock-free algorithm is used to transfer elements from the tail to head after a pop is attempted on an empty head (and the tail is non-empty).
The interesting feature of this queue is that it seems to outperform an optimized Michael-Scott queue (see #122) on many machines. Here are results from a benchmark run on my M3 Max:
➜ saturn git:(mpmc-two-stack-queue) ✗ dune exec --release -- ./bench/main.exe -budget 1 'Two' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[
{
"name": "messages over time/one domain",
"value": 57.04506560182545
},
{
"name": "messages over time/1 nb adder, 1 nb taker",
"value": 141.66814237648308
},
{
"name": "messages over time/1 nb adder, 2 nb takers",
"value": 118.94340356581118
},
{
"name": "messages over time/2 nb adders, 1 nb taker",
"value": 108.01087885571833
},
{
"name": "messages over time/2 nb adders, 2 nb takers",
"value": 103.93427612115202
}
]
As one can see, the (median) thruput is substantially higher than that of the optimized Michael-Scott queue (see #122) and that seems to be the case on most machines I've tested this on. Most interestingly, however, on the "fermat" machine we use for benchmarking, the Michael-Scott queue seems to perform better. See my comment below for a possible explanation.
On poor performance of this queue on Opteron (i.e. fermat)
On the "fermat" machine we use for benchmarking, the Michael-Scott queue seems to perform better than this queue. The fermat machine has an Opteron (piledriver) CPU from 2013, which probably isn't the best target machine for optimizing multicore algorithms. The paper Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask appears to have a particularly good explanation of the special characteristics of Opteron CPUs and a workaround suggestion:
A store to a shared or owned cache line on the Opteron induces an unnecessary broadcast of invalidations, even if all the involved cores reside on the same node (see Table 2). This results in a 3-fold increase of the latency of the store operation. In fact, to avoid this issue, we propose to explicitly maintain the cache line to the modified state. This can be easily achieved by calling the prefetchw x86 instruction before any load reference to that line. Of course, this optimization should be used with care because it disallows two cores to simultaneously hold a copy of the line.
I believe this might explain why this queue does not outperform the MS queue on fermat. In this queue design the tail is just a single atomic location updated with a typical lock-free get and compare-and-set loop. To illustrate:
let rec lock_free () =
let before = Atomic.get atomic in (* potentially puts cache line to shared/owned state *)
(* ... *)
if not (Atomic.compare_and_set atomic before after) then (* write is now 3x expensive *)
lock_free ()
What happens on Opteron in case of contention (i.e. multiple threads access a likely empty queue) is that the get puts the cache line corresponding to the tail to the shared mode. This then means that the compare-and-set that targets the location becomes very expensive. The paper suggest to use prefetchw on Opteron to avoid this problem. Imagining that OCaml had a way to generate a prefetchw instruction, the loop would become:
let rec lock_free () =
Atomic.prefetch_data_into_caches_in_anticipation_of_a_write atomic; (* prefetchw *)
let before = Atomic.get atomic in (* cache line remains in modified/exclusive state *)
(* ... *)
if not (Atomic.compare_and_set atomic before after) then (* write is now faster *)
lock_free ()
This could reduce the cost of the compare_and_set by up to a factor of three. Currently, if the queue is close to empty, each message transmitted potentially incurs the "3-fold" write cost twice (once for push and once for pop).
You might ask why this doesn't make the Michael-Scott queue perform just as badly on fermat. Well, in the MS queue adding a node involves updating two locations: first the next field of the previous tail is updated and then the tail field of the queue is updated. The first update always targets a freshly allocated location, so is less likely to suffer from being put into shared mode. Furthermore, the first update is already the point at which the node becomes visible to poppers and poppers do not access the tail field of the queue.
Out of curiosity I tried to test the theory of whether one could improve the performance of this queue on Opteron by writing to the cache line (non-atomically) before reading from the cache line to avoid the cache line being put into the shared/owned mode (assuming that is the problem). I made the following changes to the code:
modified src_lockfree/two_stack_queue.ml
@@ -12,7 +12,17 @@
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE. *)
-module Atomic = Transparent_atomic
+module Atomic = struct
+ include Transparent_atomic
+
+ let[@inline] get_for_set (x : _ t) =
+ Array.unsafe_set (Obj.magic x : int array) 6 0; (* NOTE: A padded atomic! *)
+ get x
+
+ let[@inline] fenceless_get_for_set (x : _ t) =
+ Array.unsafe_set (Obj.magic x : int array) 6 0; (* NOTE: A padded atomic! *)
+ fenceless_get x
+end
type 'a t = { head : 'a head Atomic.t; tail : 'a tail Atomic.t }
@@ -67,7 +77,7 @@ let rec push t value backoff = function
if move != Obj.magic () then begin
let (Snoc move_r) = move in
begin
- match Atomic.get t.head with
+ match Atomic.get_for_set t.head with
| H (Head head_r as head) when head_r.counter < move_r.counter ->
let after = rev move in
if
@@ -88,9 +98,10 @@ and push_with t value backoff counter prefix =
if new_tail != prefix then push t value backoff new_tail
else if not (Atomic.compare_and_set t.tail prefix (T after)) then
let backoff = Backoff.once backoff in
- push t value backoff (Atomic.fenceless_get t.tail)
+ push t value backoff (Atomic.fenceless_get_for_set t.tail)
-let push t value = push t value Backoff.default (Atomic.fenceless_get t.tail)
+let push t value =
+ push t value Backoff.default (Atomic.fenceless_get_for_set t.tail)
exception Empty
@@ -103,7 +114,7 @@ let rec pop_as : type a r. a t -> _ -> (a, r) poly -> a head -> r =
match poly with Value -> cons_r.value | Option -> Some cons_r.value
else
let backoff = Backoff.once backoff in
- pop_as t backoff poly (Atomic.fenceless_get t.head)
+ pop_as t backoff poly (Atomic.fenceless_get_for_set t.head)
| H (Head head_r as head) -> begin
match Atomic.fenceless_get t.tail with
| T (Snoc snoc_r as move) ->
@@ -112,14 +123,14 @@ let rec pop_as : type a r. a t -> _ -> (a, r) poly -> a head -> r =
match poly with
| Value -> snoc_r.value
| Option -> Some snoc_r.value
- else pop_as t backoff poly (Atomic.fenceless_get t.head)
+ else pop_as t backoff poly (Atomic.fenceless_get_for_set t.head)
else
let tail = Tail { counter = snoc_r.counter; move } in
let new_head = Atomic.get t.head in
if new_head != H head then pop_as t backoff poly new_head
else if Atomic.compare_and_set t.tail (T move) (T tail) then
pop_moving_as t backoff poly head move tail
- else pop_as t backoff poly (Atomic.fenceless_get t.head)
+ else pop_as t backoff poly (Atomic.fenceless_get_for_set t.head)
| T (Tail tail_r as tail) ->
let move = tail_r.move in
if move == Obj.magic () then pop_emptyish_as t backoff poly head
@@ -148,7 +159,7 @@ and pop_moving_as :
end
else
let backoff = Backoff.once backoff in
- pop_as t backoff poly (Atomic.fenceless_get t.head)
+ pop_as t backoff poly (Atomic.fenceless_get_for_set t.head)
else pop_emptyish_as t backoff poly head
and pop_emptyish_as : type a r. a t -> _ -> (a, r) poly -> (a, _) tdt -> r =
@@ -158,8 +169,10 @@ and pop_emptyish_as : type a r. a t -> _ -> (a, r) poly -> (a, _) tdt -> r =
match poly with Value -> raise_notrace Empty | Option -> None
else pop_as t backoff poly new_head
-let pop t = pop_as t Backoff.default Value (Atomic.fenceless_get t.head)
-let pop_opt t = pop_as t Backoff.default Option (Atomic.fenceless_get t.head)
+let pop t = pop_as t Backoff.default Value (Atomic.fenceless_get_for_set t.head)
+
+let pop_opt t =
+ pop_as t Backoff.default Option (Atomic.fenceless_get_for_set t.head)
The results are not entirely conclusive:
The 1st result from the right is after dropping the experiment and the 6th results from the right is the first result with some of the extra writes. Using eyeball statistics it would seem like there is potentially some improvement. Of course, performing an actual write to memory is quite different from prefetching a cache line in anticipation of a write.
The fermat machine has an Opteron (piledriver) CPU from 2013, which probably isn't the best target machine for optimizing multicore algorithms.
I agree, that CPU is not supported by its vendor anymore. See https://www.amd.com/en/support/download/drivers.html and https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/amd-ucode/README?showmsg=1, notice that lack of microcode updates since 2018: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/log/amd-ucode/microcode_amd_fam15h.bin.asc That doesn't mean they are not useful (I still have some tests that I run on Opterons), but shouldn't be the main optimization target.
I'd suggest using at least a Zen1 CPU for AMD, and Skylake for Intel.
Here are some results from AMD Ryzen 9 7950X using OCaml 5.2.0-rc+fp on Fedora 39/40 (using -diff base.json, where base.json was obtained on latest master):
Saturn_lockfree Queue:
time per message/one domain:
110.33 ns = 1.00 x 110.30 ns
messages over time/one domain:
9.06 M/s = 1.00 x 9.07 M/s
time per message/1 nb adder, 1 nb taker:
321.40 ns = 1.00 x 322.87 ns
messages over time/1 nb adder, 1 nb taker:
6.22 M/s = 1.00 x 6.19 M/s
time per message/1 nb adder, 2 nb takers:
364.39 ns = 1.14 x 319.16 ns
messages over time/1 nb adder, 2 nb takers:
8.23 M/s = 0.88 x 9.40 M/s
time per message/2 nb adders, 1 nb taker:
241.21 ns = 0.78 x 308.96 ns
messages over time/2 nb adders, 1 nb taker:
12.44 M/s = 1.28 x 9.71 M/s
time per message/2 nb adders, 2 nb takers:
372.38 ns = 1.01 x 368.26 ns
messages over time/2 nb adders, 2 nb takers:
10.74 M/s = 0.99 x 10.86 M/s
The short comparison here doesn't show confidence intervals, but rerunning it shows similar variation on master:
Saturn_lockfree Queue:
time per message/one domain:
110.83 ns = 1.00 x 110.30 ns
messages over time/one domain:
9.02 M/s = 1.00 x 9.07 M/s
time per message/1 nb adder, 1 nb taker:
317.72 ns = 0.98 x 322.87 ns
messages over time/1 nb adder, 1 nb taker:
6.29 M/s = 1.02 x 6.19 M/s
time per message/1 nb adder, 2 nb takers:
346.13 ns = 1.08 x 319.16 ns
messages over time/1 nb adder, 2 nb takers:
8.67 M/s = 0.92 x 9.40 M/s
time per message/2 nb adders, 1 nb taker:
261.80 ns = 0.85 x 308.96 ns
messages over time/2 nb adders, 1 nb taker:
11.46 M/s = 1.18 x 9.71 M/s
time per message/2 nb adders, 2 nb takers:
340.20 ns = 0.92 x 368.26 ns
messages over time/2 nb adders, 2 nb takers:
11.76 M/s = 1.08 x 10.86 M/s
Note that the Saturn_lockfree Queue refers to the Michael-Scott based queue. The code in this branch adds a benchmark for Saturn_lockfree Two_stack_queue. If you run dune exec --release -- ./bench/main.exe -budget=1 -brief Queue Two you can see results for both in this branch. The M-S queue in this branch is the same as in main and suffers from a few known performance issues. There is another PR #122 that optimizes the M-S queue.
doesn't show confidence intervals
Yes, the "statistics" in multicore-bench is very rudimentary. The idea has been to move that to the benchmarking service/frontend and just have the benchmarking db store the raw results. This way it should then be possible to view/analyze the data in multiple ways. I'm not sure at what point we might get around to do that, however.