eio icon indicating copy to clipboard operation
eio copied to clipboard

Nested Fiber.any race condition

Open adamchol opened this issue 6 months ago • 3 comments

Hey all, thanks for the awesome work on Eio!

I have a weird race condition with nested Fiber.any and I don't quite understand it.

Here is a simplified example of the code that reproduces the situation:

Eio_main.run (fun env ->
    let promise_r, promise_u = Eio.Promise.create () in

    let res =
      Eio.Fiber.any
        ~combine:(fun x y -> x + y)
        [
          (fun () ->
            (* function 1 *)

            (* let the second function start *)
            Eio.Time.sleep env#clock 1.0;

            Eio.Promise.resolve promise_u 1;
            Printf.printf "Promise resolved\n%!";
            10);
          (fun () ->
            Eio.Fiber.any
              ~combine:(fun x y -> x + y)
              [
                (fun () ->
                  (* function 2 *)
                  Eio.Time.sleep env#clock 10.;
                  2);
                (fun () ->
                  (* function 3 *)
                  let x = Eio.Promise.await promise_r in
                  Printf.printf "Promise returned\n%!";
                  x);
              ]);
        ]
    in
    Printf.printf "Result: %d\n" res)

The output from this is:

Promise resolved
Promise returned
Result: 10

Although, what I would expect from this is Result: 11. From my understanding, when the promise is resolved and "function 1" returns and other function are being checked if they are "unlocked". The "function 3" is unlocked and it returns, so it would seem like the nested Fiber.any should return in a whole and combine on the first Fiber.any should be invoked with values 10 and 1.

But seems like the second Fiber.any is cancelled and only the first function actually returns.

Can someone explain what's happening here?

adamchol avatar May 26 '25 09:05 adamchol

It's due to Fiber.any explicitly checking whether it's been cancelled before returning:

https://github.com/ocaml-multicore/eio/blob/8f7f82d2c12076af8e9b8b365c58ebadaa963b8c/lib_eio/core/fiber.ml#L152-L154

I think it might be reasonable to change that to:

  match !r, Cancel.get_error parent_c with
  | OK r, _ -> r
  | New, Some ex -> raise ex

That would prioritise not losing a calculated value, rather than cancelling quickly.

talex5 avatar May 26 '25 16:05 talex5

This patch did solve our problem @talex5. Do you think there are any negative externalities of this change?

wokalski avatar May 27 '25 06:05 wokalski

I don't think so.

Originally, Fiber.any didn't have a combine argument, so it was always expected that it was OK to throw away either result, and aborting early made sense. But with combine that's no longer true.

talex5 avatar Jun 13 '25 08:06 talex5