moonpool
moonpool copied to clipboard
document the difference between `num_threads` on Moonpool and `num_domains` on Domainslib
With @clef-men I am trying to write benchmarks to compare concurrent schedulers, and we notice a "one-domain shift" between Domainslib and Moonpool: if D(N) is the performance of Domainslib with N domains, and M(N) is the performance of Moonpool with N "threads", then generally M(N) is noticeably worse than D(N), but in fact it is very close to D(N-1).
My understanding is that one of the following two hypotheses holds:
-
There is an unintended implementation bug in Moonpool where it uses "one less domain than expected"; for example, maybe the main domain sits idle instead of participating to task completion as one might hope in CPU-bound workloads.
-
There is an intended difference in semantics between
Domainslib.Task.setup_pool ~num_domains:nandWs_pool.create ~num_threads:n, where the Domainslib parameter has to be understood as "number of extra domains, in addition to the main domain", and the Moonpool parameter has to be understood as "total number of domains that will participate in computation".
(2) sounds more likely, but I still consider it an issue, because this is not clearly documented and it results in confusing benchmark results. Given that Domainslib is the dominant scheduler for CPU-bound tasks, I think it would be nice, if Moonpool interprets its own parameter subtly differently, to document it clearly. Hence the present issue.
Two minor remarks:
-
Even assuming that (2) holds, I remain uncertain and confused about a sub-question: Does this mean that the main domain intentionally does not participate in computations, or that it is included in
num_threadsand Moonpool with only spawn (n-1) domains? Given the sensibility of the Multicore OCaml runtime to extra domains above the number of cores, I think it's important that Moonpool users know for sure how many domains in total are going to run when they pass a given~num_threadsparameter. -
I tried to find a clear answer to the question of whether (1) and (2) holds by looking at the Moonpool codebase, and I failed to do so. There are many layers of stuff, with indirections via Picos. I don't know if there is any actionable feedback to extract from this remark, but maybe: if you add more complexity to the implementation, I think it would be nice to also make the documentation clearer and more complete.
A simple repro case
(* fibo.ml *)
let cutoff = 25
let input = 40
let rec fibo_seq n =
if n <= 1 then
n
else
fibo_seq (n - 1) + fibo_seq (n - 2)
let rec fibo_domainslib ctx n =
if n <= cutoff then
fibo_seq n
else
let open Domainslib in
let fut1 = Task.async ctx (fun () -> fibo_domainslib ctx (n - 1)) in
let fut2 = Task.async ctx (fun () -> fibo_domainslib ctx (n - 2)) in
Task.await ctx fut1 + Task.await ctx fut2
let rec fibo_moonpool ctx n =
if n <= cutoff then
fibo_seq n
else
let open Moonpool in
let fut1 = Fut.spawn ~on:ctx (fun () -> fibo_moonpool ctx (n - 1)) in
let fut2 = Fun.spawn ~on:ctx (fun () -> fibo_moonpool ctx (n - 2)) in
Fun.await ctx fut1 + Fun.await ctx fut2
let usage =
"fibo.exe <num_domains> [ domainslib | moonpool | seq ]"
let num_domains =
try int_of_string Sys.argv.(1)
with _ -> failwith usage
let implem =
try Sys.argv.(2)
with _ -> failwith usage
let () =
let output =
match implem with
| "domainslib" ->
let open Moonpool in
let ctx = Ws_pool.create ~num_threads:num_domains in
Ws_pool.run_wait_block ctx (fun () ->
fibo_domainslib ctx input
)
| "moonpool" ->
let pool = Task.setup_pool ~num_domains () in
Task.run pool (fun () ->
fibo_moonpool ctx input
)
| "seq" ->
fibo_seq input
| _ -> failwith usage
in
print_int output;
print_newline ()
$ ocamlfind ocamlopt -package domainslib,moonpool -linkpkg fibo.ml -o fibo.exe
$ hyperfine "./fibo.exe 4 domainslib"
Benchmark 1: ./fibo.exe 4 domainslib
Time (mean ± σ): 207.9 ms ± 3.2 ms [User: 999.8 ms, System: 8.3 ms]
Range (min … max): 199.8 ms … 214.5 ms 14 runs
$ hyperfine "./fibo.exe 4 moonpool"
Benchmark 1: ./fibo.exe 4 moonpool
Time (mean ± σ): 262.2 ms ± 3.3 ms [User: 1003.3 ms, System: 14.8 ms]
Range (min … max): 258.2 ms … 267.7 ms 11 runs
$ hyperfine "./fibo.exe 5 moonpool"
Benchmark 1: ./fibo.exe 5 moonpool
Time (mean ± σ): 211.1 ms ± 4.0 ms [User: 1002.0 ms, System: 16.6 ms]
Range (min … max): 204.9 ms … 216.7 ms 14 runs
Note: this repro case is pretty close to your own benchs/fib_rec.ml benchmark, but unfortunately in that benchmark you did not make the number of Domainslib domains a parameter (it only takes recommended_domain_count), and so you could not observe the difference at equal parameters.
https://github.com/c-cube/moonpool/blob/d957f7b54e7034f180d4a0921ccbce828e50f574/benchs/fib_rec.ml#L75-L79
Thank you for the report. I reproduced locally, and I think that 2. is correct.
It actually took me a little while to understand it because it seemed intuitive that n=4 meant that 4 domains participated in the computation! The mental model I have of moonpool is that there's N domains (N = recommended_domain_count()), with domains[0] = the main one, and then you build thread pools off of that.
The main places to look, if you want to follow what's going on, are:
- the domain pool run_on (creates a thread on
domains[idx], including the main domain atdomains[0] - the admittedly complex initialization of ws_pool. We create
num_threadson as many domains (possibly multiple threads per domain if you have more threads than domains).
I think documenting this is a good idea. If you don't specify a number it'll use recommended_domain_count and start one thread per domain, but still.
Does this mean that the main domain intentionally does not participate in computations, or that it is included in num_threads and Moonpool with only spawn (n-1) domains? Given the sensibility of the Multicore OCaml runtime to extra domains above the number of cores, I think it's important that Moonpool users know for sure how many domains in total are going to run when they pass a given ~num_threads parameter.
Whatever you do, Moonpool will never start more than recommended_domain_count() domains. If you start 5 threads on a 2 core machines, it'll only start one additional domain and spread the threads between the main one and the additional one.
In a very real way, Moonpool is my way to bypass the notion of domain and go back to a world where I can safely have many more threads than cores. :)
Thanks! Note that using recommended_domain_count () only works well in common cases, not all the time. For example maybe I am running on a machine where i know that 3 cores will be busy with other work, and I want to use N-3 domains for my program. Or maybe I happen to know that the value of recommended_domain_count () is wrong on my machine and I need to set the value manually.
I found the place where you spawn domains, it is in run_on in moonpool_dpool.ml (with the main domain recruited in initialization code right above), and it works with a global array domains_ that always has size recommended_domain_count () but where not all elements are initialized (and so not all domains are spawned).
Yes, that's fair. I imagine it could be a global option that tells moonpool what size the main domain pool should be. If you want to go lower and only need one pool it's easy already but otherwise we really can't have a way around explicitly changing the main domain pool.
I think that the current design of moonpool is okay in this respect, but I wish there was a clear mention in the documentation of the parameter num_threads, or maybe somewhere else, that this is different from Domainslib's num_domains as it also counts the calling domain, so the performance-preserving equivalent of ~num_domains:4 is ~num_threads:5.
Another situation where I don't want recommended_domain_count that just came to mind is if the calling program, that wants to use Moonpool, already uses K domains (in total) for some reason. My understanding is that in this case I want to pass (R-K+1) as the value of num_threads, where R is the recommended domain count, because the main domain will participate to the thread pool but the K-1 other domains will not.
Ideally the documentation would let me deduce this with confidence.
(For context: we spent some amount of time thinking that there was something wrong in the parallel-for implementation of Moonpool, that it would block the calling domain instead of having it participate to computations. It is only a while later that I understood that the observed performance difference probably only comes from a subtle API difference.)
I think that the current design of moonpool is okay in this respect, but I wish there was a clear mention in the documentation of the parameter
num_threads, or maybe somewhere else, that this is different from Domainslib'snum_domainsas it also counts the calling domain, so the performance-preserving equivalent of~num_domains:4is~num_threads:5.
That's fair. Do you think this is the kind of info that goes into the README? or in the scheduler's create function?
Another situation where I don't want
recommended_domain_countthat just came to mind is if the calling program, that wants to use Moonpool, already uses K domains (in total) for some reason. My understanding is that in this case I want to pass (R-K+1) as the value ofnum_threads, where R is the recommended domain count, because the main domain will participate to the thread pool but the K-1 other domains will not. Ideally the documentation would let me deduce this with confidence.
Well this is absolutely something to avoid (having domains + moonpool). I think, and have thought for years, that this is a failure of the domains API (lack of composability — I'm writing sth about it right now actually), and moonpool is simply my way of avoiding it. If you mix other domains and moonpool it'll be a problem.
Underlining this in the docs is a good suggestion. Docs are not my forte but I'll make an effort.
The proper solution, imho, would be to chuck the domain API in the trash bin and provide a run_on_domain function like my domain pool does :).