ocaml Atomic.modify, Atomic.modify

Many uses of Atomic.compare_and_set are in retry loops that can be factored out by these new functions. Consider:

https://github.com/ocaml-multicore/ocaml-multicore/blob/4469f89/stdlib/printexc.ml#L268-L272

(* old version *)
let rec register_printer fn =
  let old_printers = Atomic.get printers in
  let new_printers = fn :: old_printers in
  let success = Atomic.compare_and_set printers old_printers new_printers in
  if not success then register_printer fn

(* new version *)
let register_printer fn = Atomic.modify (List.cons fn) printers

https://github.com/ocaml-multicore/ocaml-multicore/blob/4469f89/testsuite/tests/parallel/mctest.ml#L222-L226

(* old version *)
  let rec finish () =
    let v = Atomic.get counter in
    if not (Atomic.compare_and_set counter v (v+1)) then finish ();
    if v + 1 = procs then exit 0

(* new version *)
  let finish () =
    (* this could also be done with the more specialized Atomic.fetch_and_add *)
    let next = Atomic.modify_get (fun v -> v+1, v+1) in
    if next = procs then exit 0

https://github.com/ocaml-bench/sandmark/blob/642d16edeb1ad68e5fbe5019035f6fc4f8223dea/benchmarks/multicore-structures/treiber_stack.ml

(* old version *)
let rec add_node s n =
  let old_top = Atomic.get s.top in
  Atomic.set n.next old_top ;
  if Atomic.compare_and_set s.top old_top (Some n) then () else add_node s n

(* new version *)
let add_node s n =
  Atomic.modify (fun old_top ->
    Atomic.set n.next old_top ; Some n
  ) s.top

https://github.com/dune-universe/dune-universe/blob/eef415c39aa4533cf5fad448e2b9a31e15ee03b8/packages/uring.0.1/lib/uring/uring.ml

(* old *)
let rec update_gc_roots fn =
  let old_set = Atomic.get gc_roots in
  let new_set = fn old_set in
  if not (Atomic.compare_and_set gc_roots old_set new_set) then
    update_gc_roots fn

(* new *)
let update_gc_roots fn = Atomic.modify fn gc_roots

Remark:

Some more elaborate uses of compare_and_set are in more complex retry loops (see for example the Domainslib.Chan implementation)
The current definitions do not call cpu_relax () or any other backoff logic. Some uses in the wild would be expressible if we parametrized over the backoff function. I don't know if this is a choice that users should be forced to make.

Nov 29 '21 23:11 gasche

This looks useful to me, but I suggest a change to the documentation. The current text:

    [modify f r] computes a new value for [r] by applying [f] to its
    current value, sets this new value or retries (calling [f] again)
    if [r] was concurrently changed to a physically different value.

sounds like it includes a solution to the ABA problem. The text should be changed to make it clear it does not. (Maybe "if [r] now holds a physically different value" rather than "if [r] was concurrently changed to a physically different value"? Or maybe an explicit disclaimer. Or, alternatively, implement a modify that does solve ABA, although the types will have to be slightly different)

Nov 30 '21 16:11 stedolan

implement a modify that does solve ABA

This sounds wonderful! (I hate ABA problems with a passion...) How would you go about that?

Nov 30 '21 17:11 xavierleroy

The current definitions do not call cpu_relax () or any other backoff logic. Some uses in the wild would be expressible if we parametrized over the backoff function. I don't know if this is a choice that users should be forced to make.

What about an optional argument? Then we can argue about the default backoff function...

Nov 30 '21 17:11 xavierleroy

What about an optional argument? Then we can argue about the default backoff function...

Why not indeed. I'm waiting for someone to jump in and claim that it is almost always the wrong default to not backoff, and that we should force users to think about this by making it non-optional. But I don't have enough experience to make that claim. (The code I found in the wild often does not backoff, which may be an argument one way or another.)

Nov 30 '21 18:11 gasche

I might be very naive here, but is ABA a problem in OCaml here? Quoting from the wikipedia page:

In multithreaded computing, the ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and "value is the same" is used to indicate "nothing has changed". […] A common case of the ABA problem is encountered when implementing a lock-free data structure. If an item is removed from the list, deleted, and then a new item is allocated and added to the list, it is common for the allocated object to be at the same location as the deleted object due to MRU memory allocation. A pointer to the new item is thus often equal to a pointer to the old item, causing an ABA problem.

However, OCaml values are a bit special (assuming no mutation, which one wouldn't mix with atomics anyway). Since modify keeps a handle on the old value, it cannot be GC'd, so any new value will be physically distinct. Hence either the value is an unboxed value, in which case physical equality is enough; or it's heap allocated and then you never go back to "A" unless it's the old object. For example, with the stack, you can't get back the same cons cell with a different content in another thread, since the old cons cell is still alive.

Am I missing something?

Nov 30 '21 18:11 c-cube

This sounds wonderful! (I hate ABA problems with a passion...) How would you go about that?

When you have a GC lying around, you can ensure freshness by making a new allocation each time. Below I'm going to use a "mutable" field, even though I won't mutate it, to prevent the optimiser getting any ideas. (Or, more formally, because physical equality is only determinstically defined for allocations containing mutable fields):

type 'a box = { mutable thing : 'a }
type 'a aba_atomic = 'a box Atomic.t

let read (a : 'a aba_atomic) =
  (Atomic.get a).thing

let rec modify f (a : 'a aba_atomic) =
  let v_old = Atomic.get r in
  let v_new = { thing = f v_old.thing } in
  if Atomic.compare_and_set r v_old v_new
  then ()
  else modify f r

If ABA occurs, then the contents of the ref will not be physically equal to v_old, because there's a fresh allocation on each modify.

@c-cube you're right, as long as you're consing something on each modify. For that reason, you don't need the pattern above for such cases, although the pattern above works in all cases by forcing a cons each time.

Nov 30 '21 18:11 stedolan

Thank you @stedolan, that makes sense to me indeed. I think this particular pattern of Atomic.modify should be labelled clearly as something that works well with immutable values. If you have internal mutation, a mutex is probably best.

People have long claimed that one of the benefits of FP is that it's easier to parallelize, and this might be a concrete case where this is true! :grin:

Nov 30 '21 18:11 c-cube

Fun story time: today I was writing some Atomic code, and I wanted to atomically cons a value to a list, so I thought about submitting an Atomic.cons PR. Then I realized that Atomic.modify was more expressive, and:

$ git checkout -b Atomic.modify
fatal: a branch named 'Atomic.modify' already exists

🤦

My intuition is that at least Atomic.cons would be very nice to have even in 5.0, and Atomic.modify is still a nice generalization. Last time we got lost in a discussion of the ABA problem, but could we maybe try again and see if we can come to a decision this time?

Re. ABA: I propose to just reformulate the documentation to make it clear that we don't solve it. Re. cpu_relax: I propose to do without it for now, and wait to see what happens. (I dislike adding optional arguments after the fact, but I also dislike adding optional arguments in the first place, oh well...)

Oct 26 '22 08:10 gasche

It's too late to add features to 5.0. This PR is going to compete with the many that could go into 5.1 if only we had more reviewing power.

Oct 26 '22 08:10 xavierleroy

This looks like a good addition to the stdlib Atomic module. I copied the modify function defined here and it seems to work quite well. Overall useful set of functions to have.

Jan 25 '23 14:01 bikallem

I agree with @c-cube here regarding the ABA problem. @stedolan's example seems to prevent the case v_old == v_new but this is not a concern here (if f ≝ fun x -> x then the original works as expected). The issue where the v_old as a physical location is modified to denote a different value (typically by doing free followed by malloc in non-GC-ed languages) is avoided here.

There can be more higher-level incarnations of the ABA problem (involving mutable OCaml values), but as far as the ABA problem of CAS on Atomic.t is concerned, I believe that this implementation or any other naive one avoids it.

Jan 25 '23 17:01 gadmm

Shouldn't your Backoff logic actually usleep once the backoff time gets large enough, to also improve fairness on contended systems? This is what the runtime backoff code does, in a somewhat obscure way: the first Max_spins = 1000 iterations are just a cpu_relax, and then we usleep.

https://github.com/ocaml/ocaml/blob/1c5e7488430196341cae155e119719d0fa69d800/runtime/caml/platform.h#L86-L93

https://github.com/ocaml/ocaml/blob/1c5e7488430196341cae155e119719d0fa69d800/runtime/platform.c#L223-L240

Feb 15 '23 15:02 gasche

@polytypic This is a blog post about contention with DynamoDB transactions and I am not convinced that it is relevant for typical use-cases of CAS loops over Atomic.t.

Feb 16 '23 08:02 gadmm

That article and that blog post discuss the implementation of spin locks. You are wrong in thinking that they should be relevant to the present PR. It would better help you making your points if you avoid having us read irrelevant papers.

Feb 17 '23 17:02 gadmm

Your last message does support the idea that low-level concurrent stacks (and probably other CAS-based shared structures) need a backoff strategy for scalability in high-contention scenarios. I think that @gadmm made a technically correct point that busy-wait loops are very different beasts that obviously benefit from contention-avoidance -- in an unpleasant way. I was also interested in the DynamoDB post on the importance of backoff-time randomization. One thing that was a bit worrying in the DynamoDB is the cost of the backoff strategy on throughput. In the paper you cite on the exponential-backoff stack, there is no particular throughput cost to backoff on their benchmarks, so this is more reassuring. This difference does support the idea that the difference in scale between DynamoDB and typically lockfree data structures means that we cannot fully transfer performance intuitions from one to another.

I was already rather convinced that having backoff by default is a reasonable approach. I wonder if we can find a backoff strategy that is "good enough" to be unconditionally enabled by a standard Atomic.modify-style function. (It doesn't need to consistently outperform all approaches, but it should be unsurprising and decent for broad usage.) I would rather not make the API more complex by making the backoff scheme configurable, if we can avoid it.

@polytypic thanks for your pointer on your update function. Just like fetch_add vs. add_fetch, we could have a get_modify variant of modify_get that returns the previous value. (I haven't looked at this in full details yet so I can't comment on the rest of your API choices.)

Unrelatedly: I'm happy to learn that the ABA issues appear to have magically solved themselves since I opened the PR thanks to @c-cube. ABA is hard to think about for a non-expert like me and I'm enthusiastic about not thinking about it at all.

Feb 18 '23 14:02 gasche

This last paper was the first one that showed up for me on duckduckgo for "lock-free data structure exponential backoff". To me it has been an example where it is better to spread accesses in space rather than time (here with an "elimination array"). I still think it makes my point, even if it displays a Treiber stack with exponential backoff as a control. (Besides, the paper does not really try to justify the control with backoff in time. I am worried about fairness but the paper only shows average latency—where should I start if I cannot even read a proper paper that presents the claims of people who defend the backoff in time?)

But I think this is missing the point anyway. The question is not whether in a given highly-contended situation adding a backoff (in time) can reduce cache coherence traffic and improve throughput (by giving priority to the winning thread). It is also not about helping programmers with specialised data structures (who can trivially write their own specialised loop). Most of the time you write basic CAS loops with uncontended situations in mind; otherwise you first try to rephrase your program to avoid contention or you think about using a specialised data structure to avoid reimplementing your own. Note that it is both possible that @polytypic has encountered specific scenarios where a backoff in time was a solution, and at the same time that it is not desirable to have a backoff parameter for Atomic.modify.

As a concrete example of avoiding contention, instead of incrementing atomically the next object id, the runtime allocates object ids in chunks of 1024 per domain. (Trying, instead, to reduce cache-coherence traffic by giving way to other threads is pretty low on the list of good ideas.) As another illustration of contention-avoidance strategies, here you have benchmarks for adaptive multi-counters (spreading a counter in space), which is pretty enlightening for its performance measurements. The best thing to do is probably very dependent on context and we have not yet seen evidence of backoff (in time) being a general solution (evidence which I have tried to find, independently of the various papers and blog posts @polytypic gave us which did not convince me).

Now trying to find a "default" backoff strategy sounds even harder given the lack of source material, but consider also that you would want it to be parametric in the user's scheduler (cf. similar discussions about a thread-safe Lazy).

As it stands though I have the impression that this whole discussion on backoff strategy stems from initially putting lock-free CAS loops and busy-waiting loops in the same bag. See e.g. the fact that the PR description mentions Domainslib.Chan, and cpu_relax as a backoff strategy (which is only advocated in Intel's optimization manual in relationship with spin locks). I am even more convinced than before that there should be no backoff parameter to avoid disseminating this confusion.

Feb 20 '23 00:02 gadmm

Coming from Clojure which has atoms ("shared, synchronous, independent state"), modify/modify_get would be very helpful in writing thread-safe code.

A couple of thoughts on the discussion based on my experience with Atoms in Clojure:

I agree with @gadmm that there's no need for a backoff parameter. It will complicate the function implementation and confuse users as to the intended purpose of the function.
I think the function parameter order should be [modify* r f]. I know that they apply f to r so [f r] mirrors that, but 1) every other function in Atomics has the atomic first and keeping that consistent makes learning and usage easier, and 2) if you're writing an anonymous function, it will probably span multiple lines which obscures the targeted atomic. The at_exit change in this PR is an example of why the atomic should go first.
I think that modify_get's signature should be 'a -> ('a -> 'a) -> 'a which is to say, the return value of f is the new value of the atomic. First, returning a new value means that the "get" of modify_get has different semantics than the "get" of get, which returns the value of the atomic. That's confusing and inconsistent. Second (and more importantly), forcing f to return two objects, one of which will be returned to the caller, and the other is set as the new Atomic value, complicates all modify functions for the rare instance where you want a different return value than new state of the atomic. Requiring that f returns a pair means that you can't use most built-in functions anymore.

For example, if you wanted to atomically build a list of objects and check the length after, you couldn't write

let new_list = modify_get r (List.cons "Hello world") in
if 10 <= List.count new_list
  then func_1 new_list
  else func_2 new_list

You'd have to write

let new_list = modify_get r (fun l -> let new_list = List.cons "Hello world" l in new_list, new_list) in
if 10 <= List.count new_list
  then func_1 new_list
  else func_2 new_list

This is especially evident if you think of fetch_and_add as sugar over modify_get:

let fetch_and_add r i = modify_get r ((+) i)

Compare this with Clojure's swap! which returns the new-val:

(let [new-list (swap! r cons "hello world")] ; prepend the string "hello world" to an atomic list
  (if (<= 10 (count new-list)) ; the length of the list is now 10 or more
    (func-1 new-list)
    (func-2 new-list)))

If you want a specialized return value that depends on the value of the old value while still modifying the new state (as in the docstring example), that could be a third function, something like modify_pluck.

Aug 21 '24 18:08 NoahTheDuke

ocaml
ocaml copied to clipboard

Atomic.modify, Atomic.modify_get

ocaml ocaml copied to clipboard

Atomic.modify, Atomic.modify_get

ocaml
ocaml copied to clipboard