I had a quick skim at Idle_domains.ml, and I am surprised by the amount of low-level, unsafe or possibly-unsafe features used everywhere. My understanding is that idle-domains is supposed to orchestrate somewhat compute-intensive "scheduler" functions. Why not optimize for simplicity and correctness instead of performance in that scenario?

I'm by no means a parallelism expert. It may be the case that the distribution of "work durations" is similar to the distribution of data in typical functional programs: some very large, but a vast majority of them in fact quite small. Is it the idea? (Do we actually know about this?)

What it feels like when reading the code is that you have become so used to using advanced, low-level, unsafe programming patterns that you are using them liberally without thinking twice about it. (But maybe there are good reasons!) More precisely:

The library opens with let null () = Obj.magic (), a function from type unit -> 'a that is completely unsafe. That sounds very dangerous. If you want to make up fake values at some types, then restrict your nulls to those types. But why not just use 'a option instead, are those nulls performance-critical?
Then the second occurrence of Obj.magic, to coerce from Domain.id to private Domain.id, is unnecessary. (You don't know how to use private apparently, which is okay, I'll show you at the end of this post.)
You use Multicore_magic all over the place. This may be necessary, but is it? (How would array-of-references fare in practice instead of dealing with padding unsafely by hand?)
you constantly use Array.unsafe_set and Array.unsafe_get, including in places where it is clearly unsafe (it would be easy to write a well-typed client of your library that breaks memory safety completely). Have you considered tolerating the small costs of bound-checking instead?

magic-free use of private:

module Managed_id : sig
  type t = private Domain.id
  val self : unit -> t
end = struct
  type t = Domain.id
  let self = Domain.self ()
end

unsafe use of unsafe_get

let next (id : managed_id) = Array.unsafe_get !next_sibling (id :> int)

(* unsafe consumer code: *)
let weird = Array.init 1024 (fun () ->
  Domain.spawn (fun () ->
    Idle_domains.(self () |> next)))

(For the record: this is wildly unsafe because next performs an Array.get on an array of size recommended_domain_count, whereas the domain identifiers used here are going to be much higher than that on my machine).

I hope you don't mind the frank feedback, but this festival of unsafe code would make me shy away from the library if I was considering using it in production. Sure, there are performance constraints here, so it may be that we need some unsafe code. But the current implementation is basically all done within a big unsafe { ... } block.

Jan 17 '23 14:01 gasche

Thanks for the feedback!

The current implementation is still mostly for my experimental uses, meaning that I want to understand what is the minimal possible overhead imposed by a library of this kind. I consider that to be vital for a framework that would then supposedly allow one to write high-performance stuff on top of it. If a framework like this imposes significant overheads, it will just not be used (or that would be my assumption).

Some more comments here and (in later replies as I have time):

My understanding is that idle-domains is supposed to orchestrate somewhat compute-intensive "scheduler" functions.

Hmm... Not exactly. I don't want to assume that schedulers are expensive. On the contrary, a scheduler would ideally be quickly started and stopped depending on demand. And that is kind of the assumption that the current design of idle-domains makes: schedulers can be quickly started and stopped.

On the other hand, a scheduler might orchestrate compute-intensive work and benefit from being able to spawn new schedulers on idle domains to share the load. IOW, schedulers should be relatively cheap and the work that they perform might not be cheap, but ideally that would not be an assumption either. A scheduler for asynchronous computation does not necessarily perform a lot of compute-intensive work, but it might ideally want to be able to suspend and resume with minimal overheads. I'd like those kinds of schedulers to also be profitably written on top of something like idle-domains. During the periods when such an async scheduler has no work, I'd like it to be possible to use the domain for running other schedulers.

Consider libraries like domainslib or Eio, for example. When I say "libraries" in plural, I really mean to think of multiple independently written libraries like that. (BTW, my goal here isn't to pick on domainslib, for example, but to rather point out an issue with it and propose a potential path towards a solution.) Currently the design of domainslib uses a pool of worker domains. I believe such a design will not work well with multicore OCaml. As mentioned in the README, the problem is that domains are just too expensive and non-compositional. Libraries creating their own pools of workers are in direct competition and incompatible due to the costs of domains in multicore OCaml. Eio does a similar thing by essentially reserving a domain (or domains) for running fibers. Either case, you don't want to have more than one library doing that (pools or reserved domains).

Now, for another exercise, imagine writing domainslib on top of Eio or vice versa. My experience is that it is not going to give good results. They use very different algorithms for scheduling and make very specific assumptions about their intended usage.

So, the idea here is to try to come up with something different or lower level that allows one to write libraries like domainslib or Eio and have them profitably share domains.

I currently think that the design of idle-domains might not yet properly support libraries like Eio where fibers are pinned to domains.

Jan 17 '23 17:01 polytypic

You use Multicore_magic all over the place. This may be necessary, but is it? (How would array-of-references fare in practice instead of dealing with padding unsafely by hand?)

The difference is qualitative in nature.

The problem with an "array-of-references" style approach is that it actually does not properly address false sharing. It makes the mutable locations slightly sparser, at the expense of using more memory and more indirection. The extra sparseness somewhat reduces the probability of accidental false sharing (50% perhaps).

The explicit padding approach, OTOH, directly addresses accidental false sharing. It makes sure that consecutively allocated padded objects have no accidental false sharing between them. Unfortunately, it cannot prevent unpadded objects from being allocated just at the front of padded objects. However, if you use padding for all long-lived frequently accessed objects, you are likely to essentially eliminate false sharing (99%). It is not just a matter of luck or somewhat reduced probability.

To continue, I might, for example, want to use an array of padded (or otherwise guaranteed not to be false shared) references or atomics in some multicore algorithms. This could give a data structure where e.g. individual "slots" could be allocated for communication and would be guaranteed (99%) to be free of false-sharing. I would not use an array of unpadded references in the hope that it would reduce false sharing (50%).

Jan 17 '23 17:01 polytypic

The extra sparseness somewhat reduces the probability of accidental false sharing (50% perhaps).

Is this just a guess, or have you measured this?

I'm not expert, but I would be willing to believe that there are usage modes of arrays of references that actually reliably prevent false sharing, but depend on assumptions on the OCaml runtime. Ideally, each reference should be allocated by its own domain. (But other approaches may work.) Have you been able to measure false sharing using this approach?

Jan 17 '23 17:01 gasche

Is this just a guess, or have you measured this?

This is based on my understanding of computer architecture and the OCaml memory model and runtime. It is not a guess and it is not based on measurements. IOW, I'm not looking for a statistical solution. I'm looking for a deterministic solution. Just padding is not quite the ideal solution, but probably best that can be done at the moment in OCaml. (Ideally one would be able to control both alignment and padding/size of objects.)

(On a 64-bit arch.) A reference or atomic object occupies 2 consecutive aligned words or 16 bytes. The first word of an object is the header word. A single word or entry in an array is 8 aligned bytes. False sharing happens when one mutable location and another location happen to be in the same cache line. Cache lines are typically 64 or 128 bytes and start at aligned 64 or 128 byte boundaries. The 50% refers to the probability of false sharing considering a densely packed array/record of 8 byte words vs a memory region densely packed with 16 byte references or atomics.

Ideally, each reference should be allocated by its own domain.

And then the GC promotes those to the major heap, or you otherwise publish the references to other domains, and you have false sharing again.

Note that false sharing is most problematic with long lived objects. Including constant data that happens to end up next to frequently mutated object.

Perhaps in some cases one might be able to design algorithms that use short lived objects instead of long lived objects. Short lived objects will introduce problems on their own and that is also not really a solution to eliminate false sharing per se. Basically, as soon as you publish those objects to other domains, you have the possibility of false sharing.

As I commented elsewhere, before I started using padding to eliminate accidental false sharing, benchmark results tended to vary wildly. That might even be the biggest advantage of using something like padding to essentially eliminate false sharing as that eliminates a huge amount of noise from any performance measurements.

Jan 17 '23 18:01 polytypic

And then the GC promotes those to the major heap, or you otherwise publish the references to other domains, and you have false sharing again.

As far as I know, the major heap uses per-domain page-aligned pools, so I'm not sure we can have false sharing between values promoted from the minor heaps of different domains.

Jan 17 '23 19:01 gasche

Regarding unsafe code, some general OCaml recommendations would be:

never use Obj.magic
only use Array.unsafe_{get,set} when you can prove that the index is within bounds (and then why not include the proof argument in comments)

Jan 17 '23 20:01 gasche

As far as I know, the major heap uses per-domain page-aligned pools, so I'm not sure we can have false sharing between values promoted from the minor heaps of different domains.

That is interesting. But it doesn't actually change potential false sharing.

False sharing is used to describe the performance pitfall of having two or more CPU cores access logically unrelated locations in memory that happen to map to a single cache line aligned region and where at least one of the CPU cores is writing.

It doesn't matter who (which CPU core or domain) allocated the locations that are involved in false sharing. What matters is that two or more cores eventually end up accessing the falsely shared locations and at least one of them is writing.

However, false sharing is really just a handy name for a specific performance pitfall. What I'd recommend is to get a basic understanding of the MSI protocol. Modern systems use somewhat more complex protocols, but a basic understanding of MSI should be enough to understand the most important phenomena.

Below is an even further simplified model of shared memory based on the MSI protocol. While the model is not precise, it should actually help to explain many phenomena, including the false sharing -performance pitfall.

Simplified MSI model of shared memory

The memory is divided into locations called cache lines. Each core has its own copy of the whole memory (all cache lines) and additionally considers each of its own cache lines to be in one of the following three states:

Modified
Shared
Invalid

You could initially assume that the whole memory of each core, all cache lines, are in the Shared state. This means that all the cores have the same value stored in corresponding cache lines. State changes do not happen spontaneously. Some core needs to perform a read from or a write to a cache line.

When a core reads a cache line in Shared state, it can do that for essentially free. Writing to a cache line in the Shared state is a bit more expensive. It means the state of the cache line in the core that performs the write is changed to the Modified state. Additionally, the state of the corresponding cache line in all the other cores is changed to the Invalid state.

When a cache line of a core is in the Modified state, the core can both read and write the cache line for free. The state of the cache line needs to be Invalid in all the other cores and neither read nor write require changing the state of the cache line in those other cores.

When a core wants to read a cache line in the Invalid state, it needs another core that has that cache line in either Modified or Shared state to send a copy. If the sender was in Modified state, then the sender must change state to Shared. The reader also marks the state Shared. Writing to an Invalid cache line also requires another core that has that cache line in Modified or Shared state to send a copy of the cache line and then all the cores that have the cache line in Modified or Shared state must change their state to Invalid. The writer sets the state to Modified.

The various state changes and transmissions of cache line contents have potentially different costs. Transitions from the Invalid state to other states tend to be expensive.

While drastically simplified, this model should help to understand the performance of many multicore algorithms.

For example, I believe you referred to the use of references in the work-stealing deque of domainslib. You might want to think about how many and which kind of cache line state changes and cache line transfers various operations on the deque introduce. The total cost is not just about "false sharing". As a starting point, writing to a reference means that the corresponding cache line becomes Modified in the core that performed the write. All other cores change that cache line to the Invalid state.

Jan 17 '23 21:01 polytypic

let next (id : managed_id) = Array.unsafe_get !next_sibling (id :> int)

(* unsafe consumer code: *)
let weird = Array.init 1024 (fun () ->
  Domain.spawn (fun () ->
    Idle_domains.(self () |> next)))

Yes, with the current implementation that would do nasty things. 👍

Of course, the idea is that Idle_domains.self () may only be called from a managed domain. So, to be safe, Idle_domains.self () should check that it is not called from other domains. Alternatively Idle_domains.self could actually be removed from the API. It is convenient, but perhaps not strictly necessary.

BTW, I'd actually love it to be so that managed domain ids would be in the [0, n[ range. It would eliminate the need to have Idle_domains.next and Idle_domains.all in the API (and maybe add e.g. Idle_domains.count ()) and ensure that domain local storage could be as efficient as possible.

Thanks again for the feedback!

It was especially helpful for me to have to think about what to write to the first response and it also gave me some potential ideas to explore next (i.e. on how to support async schedulers). Also happy to receive further feedback. I'm especially interested in thoughts on the API in general.

I understand the concerns you raised about the use of unsafe language features. Like I said, the implementation is written as it is mostly to understand whether it is actually feasible to add this kind of abstraction layer without having it cause prohibitive costs. It seems promising, the overheads seem very low, but it is not 100% free compared to the implementations in par-ml that don't use idle-domains. I plan to rewrite all the variants in par-ml to use idle-domains.

Jan 17 '23 22:01 polytypic

My mental model of false sharing is that data races on a cache line (access from two CPUs, at least one of them writing (or CASing)) are expensive.

In the case of a domain-index array, the typical issue is that all elements of the array sit contiguously in memory and many share a cache line, so event if domains access separate elements they can race at the cache-line level.

My understanding of the current OCaml runtime is that values allocated by different domains will end up in disjoint pages during their lifetime (already in the minor heap, and then also in the major heap; allocators try to ensure that separate cores allocate memory separately for better locality and to avoid weird NUMA effects).

So for a domain-indexed array of references, if there is no write to the array (after the initialization phase) and each reference is indexed by the domain, my understanding is that there would be no false sharing between the references: as long as distinct domains access only "their own element", they are accessing them in disjoint cache lines and not creating races.

Now, in absence of padding, it remains of course possible that something else is placed on the same cache line as the "domain-indexed reference of a domain", for example an atomic variable that does get written from other domains, and that this would introduce false sharing -- but then shouldn't this atomic variable be padded instead?

Jan 18 '23 05:01 gasche

My mental model of false sharing is that data races on a cache line (access from two CPUs, at least one of them writing (or CASing)) are expensive.

In the case of a domain-index array, the typical issue is that all elements of the array sit contiguously in memory and many share a cache line, so event if domains access separate elements they can race at the cache-line level.

Yes, in terms of the simplified MSI model, the cache line becomes Invalid in all the cores except the one that wrote. As the other cores continue accessing the cache line, they cause expensive transitions from Invalid to other states.

My understanding of the current OCaml runtime is that values allocated by different domains will end up in disjoint pages during their lifetime (already in the minor heap, and then also in the major heap; allocators try to ensure that separate cores allocate memory separately for better locality and to avoid weird NUMA effects).

Yes, let's assume so.

So for a domain-indexed array of references, if there is no write to the array (after the initialization phase) and each reference is indexed by the domain, my understanding is that there would be no false sharing between the references: as long as distinct domains access only "their own element", they are accessing them in disjoint cache lines and not creating races.

With the assumption, yes.

Now, in absence of padding, it remains of course possible that something else is placed on the same cache line as the "domain-indexed reference of a domain", for example an atomic variable that does get written from other domains, and that this would introduce false sharing -- but then shouldn't this atomic variable be padded instead?

It is not even necessary for those other things to be written from other domains. You just need one frequent writer (and it might be the owner of the heap).

Consider the following diagram:

   0        8        16       24       32       40       48
   +--------+--------+--------+--------+--------+--------+
   | hdr bef  r or w | hdr ref   r & w | hdr aft  r or w |
   +--------+--------+--------+--------+--------+--------+

The idea is that there is ref between two other objects. Let's even assume that only the domain from whose heap that block of memory has been allocated is the only domain that writes to the ref. However, two other objects bef and aft have been allocated from the same heap and are accessed by other domains. Now we still have false sharing.

Assuming that we do not pad the ref, then for bef object adding padding would help to avoid the false sharing. For the aft object it would not help.

I would advice against relying too much on the current implementation details of OCaml's GC to avoid false sharing (or other costly shared memory access patterns). What happens, for example, if multicore OCaml starts to support compaction? The consistent use of padding on all long lived frequently accessed objects would continue to work even then.

Ideally, I'd like for there to be ways to specify both alignment and padding for heap allocated objects. Sometimes padding is not just needed (or is desirable) at the end but also between fields of a record, for example.

Here is a less complete approach. The OCaml GC would be extended to support having a cache line granularity bit in objects. It could be a single bit in the header that tells the GC that, during promotion, the object must be placed at a cache line aligned address and no other objects must be placed in the same cache line with that object. The Obj module would be extended with an operation that sets that bit:

val set_cache_line_granularity : t -> bool -> unit

One would then selectively set the cache line granularity bit for objects that should be placed on their own cache lines to avoid costly shared memory access patterns.

The idea behind having that bit read first at promotion time is to avoid making allocation any slower.

Since false sharing is most expensive for long lived objects and one could assume those usually end up being promoted, this would likely be fairly effective way to avoid false sharing.

Jan 18 '23 11:01 polytypic

FYI, I wrote a gist with an explanation of the simplified MSI model and included a program that tries to quantify the costs of various transitions.

Jan 18 '23 21:01 polytypic

The implementation is very unsafe

Simplified MSI model of shared memory