ocaml icon indicating copy to clipboard operation
ocaml copied to clipboard

An uncaught exception inside `caml_callback` sometimes results in a fatal error instead of an exception

Open gadmm opened this issue 2 years ago • 17 comments

The toplevel sometimes fails to catch exceptions raised from finalisers, in an inconsistent manner. Example below:

OCaml version 5.0.1+dev0-2022-12-15
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# Gc.full_major ();;
Exception: Stdlib.Exit.
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# Gc.full_major ();;
Fatal error: exception Stdlib.Exit

Notice the Fatal error at the end: the toplevel stops.

Expected outcome: the last line is instead Exception: Stdlib.Exit. and the session continues.

This likely also affects other asynchronous callbacks, and it is unclear whether this is related at all to async callbacks specifically.

gadmm avatar Aug 02 '23 14:08 gadmm

Here is another testcase involving native compilation and not the toplevel:

let collect () =
  (try Gc.full_major() with
  | _ -> print_endline "Exception caught");
  print_endline "Still alive"

let () =
  let x = Sys.opaque_identity(ref 0.) in
  Gc.finalise (fun _ -> raise Exit) x

let () = collect ()

type _ Effect.t += Do

let () =
  let x = Sys.opaque_identity(ref 0.) in
  Gc.finalise (fun _ -> Effect.perform Do) x

let () = collect ()

Observed result

$ ocamlopt uncght.ml && ./a.out 
Exception caught
Still alive
Fatal error: exception Effect.Unhandled

In words, the first exception arises normally from the asynchronous callback, whereas the second one ends the program abruptly with a fatal error.

Expected result

Exception caught
Still alive
Exception caught
Still alive

The exception Effect.Unhandled should arise from the asynchronous callback.

Impact

Having an exception is important to let the program perform clean-up before shutting down.

This bug probably concerns other uses of caml_callback, which can be called from user libraries. The mixing of incompatible libraries, one using caml_callback with another one using effects, is likely to lead to legitimate bugs. If this bug manifests in a fatal error then the impact is likely to be worse.

(A bit of context: it is not expected that one will be able to catch C frames with effects, so raising an exception Effect.Unhandled is the best one can do.)

I think this issue can interest @kayceesrk (but let me know if I should have pinged someone else).

gadmm avatar Aug 02 '23 17:08 gadmm

Thanks for the issue. Pinging other folks who have expertise in this area: @NickBarnes @fabbing.

kayceesrk avatar Aug 03 '23 09:08 kayceesrk

As I just tested it again, this is still reproducible with current trunk.

gadmm avatar Jun 06 '24 14:06 gadmm

The control flow leading to Effect.unhandled is as follows: caml_gc_full_major -> gc_full_major_exn -> caml_process_pending_actions_exn -> caml_process_pending_actions_with_root_exn -> caml_do_pending_actions_exn -> caml_final_do_calls_exn which temporarily clears the parent stack (with alloc_and_clear_stack_parent) before invoking caml_callback_asm; the invoked callback is (fun _ -> Effects.perform Do) which calls caml_perform.

Because the parent stack has been cleared by caml_final_do_calls_exn, caml_perform decides immediately to raise Effect.unhandled.

I am not sure how this logic should be reworked to allow the effect to be run, but this explains why the behaviour is systematically reproducible.

ghost avatar Jun 07 '24 07:06 ghost

My understanding is that the fact that the effect is unhandled is not the bug that is worrying here -- this more of an implementation or design choice, due to the fact that we don't know how to have effect work across C calls which use the system call stack. The problem is that after the effect is turned into the Effect.Unhandled exception, this exception aborts the program with a fatal error instead of being raised through the program as one would expect. This is similar to the first example of @gadmm where the Exit exception called inside a finalizer turns into a fatal-error instead of being reraised by Gc.full_major ().

I tried to look yesterday at this to understand why this is happening. One explanation (but I'm not sure it is the only one) is that caml_raise calls caml_fatal_uncaught_exception directly. In bytecode this happens when Caml_state->external_raise is NULL (see fail_byt.c), in native code when Caml_state->c_stack is NULL (see fail_nat.c). But I don't understand what would make these conditions hold here.

gasche avatar Jun 07 '24 07:06 gasche

There are two reproducers, only one of which involves effects. My interpretation was that under some circumstances, an exception triggers a fatal error when it should not. But maybe these are two separate bugs.

gadmm avatar Jun 07 '24 12:06 gadmm

My understanding is that the fact that the effect is unhandled is not the bug that is worrying here -- this more of an implementation or design choice, due to the fact that we don't know how to have effect work across C calls which use the system call stack. The problem is that after the effect is turned into the Effect.Unhandled exception, this exception aborts the program with a fatal error instead of being raised through the program as one would expect. This is similar to the first example of @gadmm where the Exit exception called inside a finalizer turns into a fatal-error instead of being reraised by Gc.full_major ().

Oh, right, I was looking in the wrong direction.

I tried to look yesterday at this to understand why this is happening. One explanation (but I'm not sure it is the only one) is that caml_raise calls caml_fatal_uncaught_exception directly. In bytecode this happens when Caml_state->external_raise is NULL (see fail_byt.c), in native code when Caml_state->c_stack is NULL (see fail_nat.c). But I don't understand what would make these conditions hold here.

Apparently the failure comes from cache_named_exception inlined in caml_make_unhandled_effect_exn, used to create the actual exception. Investigating...

ghost avatar Jun 07 '24 13:06 ghost

...and it looks like caml_register_named_value has never been invoked on Effects.Unhandled, hence the failure to retrieve the exception.

ghost avatar Jun 07 '24 13:06 ghost

Ah, but this suggests that the bytecode issue in the toplevel needs a separate fix. If you were willing to send your amazing investigative powers in this direction as well, I think that it could be useful. (I suspect that @gadmm is interested in these questions to make memprof-limits work well in OCaml 5.x, and this is useful because the Coq/Rocq people are interested in using memprof-limits.)

gasche avatar Jun 07 '24 13:06 gasche

If I add a dummy

let _ = Effect.Continuation_already_resumed in

line prior to the Gc.finalise call, then everything works as intended. So it looks that the use of Effect.perform is not enough to cause the toplevel part of stdlib/effect.ml:

let _ = Callback.register_exception "Effect.Unhandled"
          (Unhandled Should_not_see_this__)
let _ = Callback.register_exception "Effect.Continuation_already_resumed"
          Continuation_already_resumed

to run. Or is it delayed and the exception occurs too early?

ghost avatar Jun 07 '24 13:06 ghost

That may very well be the problem: we are in trouble if the Effect module is not linked.

perform is declared as an external primitive in the .mli, I wonder if declaring it as val perform : 'a t -> 'a instead would be enough to fix this issue, to force linking when Effect.perform is used.

gasche avatar Jun 07 '24 13:06 gasche

But that wouldn't explain the toplevel issue seen with Stdlib.Exit?

ghost avatar Jun 07 '24 13:06 ghost

perform is declared as an external primitive in the .mli, I wonder if declaring it as val perform : 'a t -> 'a instead would be enough to fix this issue, to force linking when Effect.perform is used.

Shouldn't one fix the linking of external primitives instead then?

gadmm avatar Jun 11 '24 17:06 gadmm

Shouldn't one fix the linking of external primitives instead then?

That would make more sense to me.

ghost avatar Jun 12 '24 14:06 ghost

To my knowledge it is an explicit design choice of the compiler that mentioning an external declaration of a module does not create a linking dependency on that module. The choice may be debatable and occasionally leads to unpleasant consequences, but it is not identified as a bug that we could just fix.

gasche avatar Jun 12 '24 14:06 gasche

This issue has been open one year with no activity and without being identified as a bug. Consequently, it is being marked with the "stale" label to see if anyone has comments that provide new information on the issue. Is the issue still reproducible? Did it appear in other contexts? How critical is it? Did you miss this feature in a new setting? etc. Note that the issue will not be closed in the absence of new activity: the "stale" label is only used for triaging reason.

github-actions[bot] avatar Jun 18 '25 04:06 github-actions[bot]

There is no indication that anything has been fixed yet. You can easily try the two reproducers (which might correspond to two different bugs).

gadmm avatar Jun 18 '25 12:06 gadmm

bumping this as we have some effect handling we'd like to do from callbacks from c into OCaml

ajbt200128 avatar Sep 03 '25 22:09 ajbt200128