eio
eio copied to clipboard
Domains spawned by the Domain Manager do not inherit fiber-local state
Dear Eio maintainers,
I have been debugging a potentially surprising result this morning.
Consider a program with a fiber-local value i.
utop # #require "eio_main";;
utop # open Eio;;
utop # let i: int Fiber.key = Fiber.create_key ();;
val i : int Fiber.key = <abstr>
Quite sensibly, forked fibers inherit their parent's fiber-local state:
[code listing 1]
utop # Eio_main.run (fun env ->
Switch.run (fun sw ->
Fiber.with_binding i 42 (fun () ->
Fiber.fork ~sw (fun () ->
let o = Option.map (Printf.sprintf "%d") (Fiber.get i) in
let v = Option.value o ~default:"???" in
Flow.copy_string v (Stdenv.stdout env)))));;
42- : unit = ()
Now, what happens in a domain-spawning situation? One would reasonably expect the following Eio loop, which manually spawns a domain, to bomb out, because said spawn is happening outside the purview of Eio, and so there's no mechanism to associate the "parent" fiber's context with whatever state is inside the domain, much less handle any effects!
[code listing 2]
utop # Eio_main.run (fun env ->
Switch.run (fun sw ->
Fiber.with_binding i 42 (fun () ->
Domain.join @@ Domain.spawn @@ fun () -> Fiber.get i)));;
Exception: Stdlib.Effect.Unhandled(Eio__core__Cancel.Get_context)
Of course, we have a solution for Eio and domains: Domain_manager! Noting that the docs for Domain_manager state that "[the function passed to Domain_manager.run] must only access thread-safe values from the calling domain", and fiber-local state is definitionally thread-safe, I would expect a combination of the behaviour of code listing 1 and code listing 2: accessing a FLS value within a domain managed by Domain_manager should inherit the parent's value, but it does not: Instead of Some 42, we get None back:
[code listing 3]
utop # Eio_main.run (fun env ->
Switch.run (fun sw ->
let domain_mgr = Eio.Stdenv.domain_mgr env in
Fiber.with_binding i 42 (fun () ->
Domain_manager.run domain_mgr @@ fun () -> Fiber.get i)));;
- : int option = None
This feels like an omission to me - clearly the spawned domain is able to handle effects (we do not fail to get the context like in code listing 2), and the fact that Eio uses lambda capabilities as a core design suggests that "lexical scoping" in this manner ought to produce the value I expected. What do the Eio maintainers think? (If you feel the current behaviour is the correct one, I'd suggest at minimum being explicit about this in the docs.)
Thanks!
Nathan
I found this comment in one of the tests, but it would be good to document it somewhere sensible:
https://github.com/ocaml-multicore/eio/blob/8f7f82d2c12076af8e9b8b365c58ebadaa963b8c/tests/domains.md#L194-L195
Possibly there should be a way to mark something as safe to share. What are you wanting to use it for?
What are you wanting to use it for?
The context is that we are migrating a large legacy OCaml codebase from using fork-based coarse-grained parallelism to a fine-grained concurrency model with Eio. As you'd expect from a large legacy codebase, there are lots of occurrences of shared mutable state (that were previously race-free, owing to each child having its own address space) that we need to wrangle. Such pieces of mutable state are cross-cutting across the codebase and it unfortunately is not realistic to refactor the code to pass all of them around explicitly.
One stroke of luck: we previously had a Emacs excursion-like "scoped mutation" primitive (very much akin to Eio.Fiber.with_binding), so, when migrating to Eio, using fiber-local storage to ensure isolation between fibers seemed like a good fit.
There're some wrinkles with this as it relates to adopting FLS: in contrast to Domain.DLS, a Fiber.key can't have an initial value, so we have to maintain some additional state on our own; additionally, a common (but unfortunate, yes, I know) pattern is that such mutable values are initially parsed from command-line arguments, which require an unscoped set operation (that Eio does not expose). So, the implementation of this excursion-like primitive is a bit gnarlier than I would have liked, but that's "the cost of doing business" as they say.
So, that's the situation - situations like code listing 3 -- execute some operation in a thread pool that depends on values set in fiber-local storage -- are very common for us, unfortunately.
Possibly there should be a way to mark something as safe to share.
One wrinkle here is that possibly values might not be safe to share: for instance, the value stored in our "hook" datatypes might itself be mutable, like a Hashtbl (yes, yes, I know, it is not ideal). So, in a perfect world, Eio might expose something like Domain.DLS's split_from_parent, which could be id for an immutable value and something else otherwise.
I understand that this would add both complexity to the API but also have performance implications - spawning a fresh Fiber is cheap because the backing fiber-local hmap can start out empty.
If you have any thoughts here, I'd be very keen to hear them. A few things I am thinking about:
- I've thought about using Domain-local storage with
split_from_parentto inherit the correct value in the executor; however, this feels problematic in the presence of fibers migrating between backing domains so I don't believe this would be correct (and would require fiddly invariants around "executor pools cannot be created until all Hook values are set, to ensure the correct values are inherited"). Broadly this would be barking up the wrong tree. - We could "roll our own" fiber-local storage, except the notion of a "fiber ID" is hidden from us (I believe it's only exposed in the Trace module?) so we couldn't do that without upstream changes.
- I come from a more traditional concurrent programming background and am extremely far from an expert on effect systems; if we rolled our own set of effects, could we get access to the Eio context there, somehow? I suspect the type system would still hide the backing
hmapthat we would want to peek and poke at, but perhaps I'm wrong.
Thanks for your guidance! Nathan
Since Eio always uses a domain manager to spawn new domains, you could wrap the default with one that copies some or all of the fiber-local values across.
this feels problematic in the presence of fibers migrating between backing domains
Note that Eio never migrates fibers between domains (although OCaml itself supports it).
if we rolled our own set of effects, could we get access to the Eio context there, somehow?
You can get access using Eio.Private.Effects if you really want to (and call e.g. Private.Fiber_context.tid to get the ID). Though obviously you're more at risk of things breaking when new Eio versions come out then.
Belated thanks, Thomas.
Note that Eio never migrates fibers between domains (although OCaml itself supports it).
Ah! I had missed this (I had only skimmed the scheduler implementation and thought it was centralized), but this is a very good fact to know.
Since Eio always uses a domain manager to spawn new domains, you could wrap the default with one that copies some or all of the fiber-local values across.
Creating a wrapper domain manager is an interesting thought - I'd like to experiment with other things such as CPU affinity down the road so having a shim layer might be broadly useful for us.
However, regarding copying fiber-local values over: Can this be done without modifying Eio itself? From what I gather, I'd need to pull out vars from the fiber context in the domain manager, but the interface for Private.Fiber_context makes t opaque and does not expose {get|with}_vars. (If you're amenable to changing this, then I certainly see a path forward. If you're thinking of something different, could you elucidate?)
Thanks for all your replies in this thread. It's been tremendously helpful.
You probably don't want to copy all of vars anyway, just the ones you know are thread-safe. You could keep all of your programs vars in their own map and just copy that, for example.
Here's another example of us running into this issue, this time in a 3rd party library:
The opentelemetry library needs to keep track of the current "scope" to do tracing, which is just the current context/span a function is in. It does this via opentelemetry_ambient_context_eio.ml, which stores this scope in the fiber keys. This makes sense for single domain programs, since it allows concurrent fibers to keep track of their own scope. But when we introduce multiple domains, this scope is not copied over, and so spans created in these new domains are orphaned, since they no longer have the ability to tell what their parent was.
Maybe the question here is: "how can consumers of eio keep track of some sort of context across fibers AND domains".
Looking at go it seems they have a few libraries: go's gls, go context that provide more motivation for this issue than the above we've run into. Google's motivation for adding go routine context is that for web servers it's helpful to have a context by request.
We could imagine someone wanting to do something similar with eio: if someone creates a web server, where each request is handled by a fiber, we might want to store some context about the request (user id, ip address, etc.) instead of passing it around as a function parameter everywhere. Currently with how fiber-local bindings works, this would break, as the minute a fiber decided to spawn a new domain (perhaps to do something computationally heavy), we would lose this context.
To answer my own question, I would suggest we DO propogate fiber-local bindings when spawning fibers across domains. This seems like the easiest + most natural solution to this problem. The values may not be thread safe, but I think this is fine, as in my opinion the consumer of the library should be responsible for managing this. If a user spawns a fiber across domains, they're responsible for managing the thread safety of the values that fiber might close over. I think it's fair to ask that if a user uses a (relatively) obscure fiber interface, we can ask them to manage the thread safety of the values themselves too :).
@talex5 let me know what you think of this proposal, I think if it makes sense I'd be happy to make the change + update docs myself
Just to build a bit on top of @ajbt200128 's comment: this is an issue that cannot be solved by implementing a custom domain manager, since the fiber key is constructed in a third party dependency and is not exposed publicly so we don't have a handle to copy things over during domain spawning. I second @ajbt200128's proposal to make fiber values visible across domains since that's really the minimally-surprising behaviour (but might resubmit my proposal for a Domain.DLS-style split_from_parent argument too, if we really wanted this to be opt-in).
I solicited input from @patricoferris, who offered this helpful analysis (which I am sharing with their permission):
I see three paths moving forward
- Via some mechanism we make it possible to plumb through the FLS across domains. This could either be by exposing a function to grab a fiber's complete storage and then another function (e.g.
with_bindings) to see a fiber's storage to this value. To simplify this for users, perhapsDomain.rungets a?with_flsargument that defaults to false that essentially does the previous bit for them.- Both semgrep and open-telemetry provide wrapper domain_mgrs that copy only their FLS bindings into domains -- these should compose just fine, but it might be easy to forgot to do so.
- Make FLS copied by default across domains.
The pros for (3) is that things would just work automatically, users would not have to do anything. But it may contradict what talex5 has in mind as the safer API. > The pros of (1) are that semgrep could just set this to true and the opentelemetry problem would be fixed (maybe the opentelemetry API can remind users that they must set that value to true).
The pros of (2) are only the minimal amount of bindings are copied across which is probably a little better in terms of memory usage and safety...
IMO, something like (1) sounds like the nicest API, and IIUC this would not impose any extra cost unless people opted to use it.
@talex5 would you welcome work on this, and if so, do you have a strong preference or opinion on the alternatives?
The simplest thing would be adding some kind of optional Fiber.create_key ~share_on_domain_spawn:true () flag. Though declaring that all keys must be thread-safe is another possibility.
One extra point though: people are talking about copying the state when spawning domains, but typically you create an Executor_pool at the start of your program and use Executor_pool.submit to run jobs. It seems more likely to me that you want some way of preserving the context when submitting a job, not when spawning the domain.