An uncaught exception inside `caml_callback` sometimes results in a fatal error instead of an exception
The toplevel sometimes fails to catch exceptions raised from finalisers, in an inconsistent manner. Example below:
OCaml version 5.0.1+dev0-2022-12-15
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# Gc.full_major ();;
Exception: Stdlib.Exit.
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# let x = ref 0. in Gc.finalise (fun _ -> raise Exit) x;;
- : unit = ()
# Gc.full_major ();;
Fatal error: exception Stdlib.Exit
Notice the Fatal error at the end: the toplevel stops.
Expected outcome: the last line is instead Exception: Stdlib.Exit. and the session continues.
This likely also affects other asynchronous callbacks, and it is unclear whether this is related at all to async callbacks specifically.
Here is another testcase involving native compilation and not the toplevel:
let collect () =
(try Gc.full_major() with
| _ -> print_endline "Exception caught");
print_endline "Still alive"
let () =
let x = Sys.opaque_identity(ref 0.) in
Gc.finalise (fun _ -> raise Exit) x
let () = collect ()
type _ Effect.t += Do
let () =
let x = Sys.opaque_identity(ref 0.) in
Gc.finalise (fun _ -> Effect.perform Do) x
let () = collect ()
Observed result
$ ocamlopt uncght.ml && ./a.out
Exception caught
Still alive
Fatal error: exception Effect.Unhandled
In words, the first exception arises normally from the asynchronous callback, whereas the second one ends the program abruptly with a fatal error.
Expected result
Exception caught
Still alive
Exception caught
Still alive
The exception Effect.Unhandled should arise from the asynchronous callback.
Impact
Having an exception is important to let the program perform clean-up before shutting down.
This bug probably concerns other uses of caml_callback, which can be called from user libraries. The mixing of incompatible libraries, one using caml_callback with another one using effects, is likely to lead to legitimate bugs. If this bug manifests in a fatal error then the impact is likely to be worse.
(A bit of context: it is not expected that one will be able to catch C frames with effects, so raising an exception Effect.Unhandled is the best one can do.)
I think this issue can interest @kayceesrk (but let me know if I should have pinged someone else).
Thanks for the issue. Pinging other folks who have expertise in this area: @NickBarnes @fabbing.
As I just tested it again, this is still reproducible with current trunk.
The control flow leading to Effect.unhandled is as follows:
caml_gc_full_major -> gc_full_major_exn -> caml_process_pending_actions_exn -> caml_process_pending_actions_with_root_exn -> caml_do_pending_actions_exn -> caml_final_do_calls_exn which temporarily clears the parent stack (with alloc_and_clear_stack_parent) before invoking caml_callback_asm; the invoked callback is (fun _ -> Effects.perform Do) which calls caml_perform.
Because the parent stack has been cleared by caml_final_do_calls_exn, caml_perform decides immediately to raise Effect.unhandled.
I am not sure how this logic should be reworked to allow the effect to be run, but this explains why the behaviour is systematically reproducible.
My understanding is that the fact that the effect is unhandled is not the bug that is worrying here -- this more of an implementation or design choice, due to the fact that we don't know how to have effect work across C calls which use the system call stack. The problem is that after the effect is turned into the Effect.Unhandled exception, this exception aborts the program with a fatal error instead of being raised through the program as one would expect. This is similar to the first example of @gadmm where the Exit exception called inside a finalizer turns into a fatal-error instead of being reraised by Gc.full_major ().
I tried to look yesterday at this to understand why this is happening. One explanation (but I'm not sure it is the only one) is that caml_raise calls caml_fatal_uncaught_exception directly. In bytecode this happens when Caml_state->external_raise is NULL (see fail_byt.c), in native code when Caml_state->c_stack is NULL (see fail_nat.c). But I don't understand what would make these conditions hold here.
There are two reproducers, only one of which involves effects. My interpretation was that under some circumstances, an exception triggers a fatal error when it should not. But maybe these are two separate bugs.
My understanding is that the fact that the effect is unhandled is not the bug that is worrying here -- this more of an implementation or design choice, due to the fact that we don't know how to have effect work across C calls which use the system call stack. The problem is that after the effect is turned into the
Effect.Unhandledexception, this exception aborts the program with a fatal error instead of being raised through the program as one would expect. This is similar to the first example of @gadmm where theExitexception called inside a finalizer turns into a fatal-error instead of being reraised byGc.full_major ().
Oh, right, I was looking in the wrong direction.
I tried to look yesterday at this to understand why this is happening. One explanation (but I'm not sure it is the only one) is that
caml_raisecallscaml_fatal_uncaught_exceptiondirectly. In bytecode this happens whenCaml_state->external_raiseis NULL (see fail_byt.c), in native code whenCaml_state->c_stackis NULL (see fail_nat.c). But I don't understand what would make these conditions hold here.
Apparently the failure comes from cache_named_exception inlined in caml_make_unhandled_effect_exn, used to create the actual exception. Investigating...
...and it looks like caml_register_named_value has never been invoked on Effects.Unhandled, hence the failure to retrieve the exception.
Ah, but this suggests that the bytecode issue in the toplevel needs a separate fix. If you were willing to send your amazing investigative powers in this direction as well, I think that it could be useful. (I suspect that @gadmm is interested in these questions to make memprof-limits work well in OCaml 5.x, and this is useful because the Coq/Rocq people are interested in using memprof-limits.)
If I add a dummy
let _ = Effect.Continuation_already_resumed in
line prior to the Gc.finalise call, then everything works as intended. So it looks that the use of Effect.perform is not enough to cause the toplevel part of stdlib/effect.ml:
let _ = Callback.register_exception "Effect.Unhandled"
(Unhandled Should_not_see_this__)
let _ = Callback.register_exception "Effect.Continuation_already_resumed"
Continuation_already_resumed
to run. Or is it delayed and the exception occurs too early?
That may very well be the problem: we are in trouble if the Effect module is not linked.
perform is declared as an external primitive in the .mli, I wonder if declaring it as val perform : 'a t -> 'a instead would be enough to fix this issue, to force linking when Effect.perform is used.
But that wouldn't explain the toplevel issue seen with Stdlib.Exit?
perform is declared as an external primitive in the .mli, I wonder if declaring it as val perform : 'a t -> 'a instead would be enough to fix this issue, to force linking when Effect.perform is used.
Shouldn't one fix the linking of external primitives instead then?
Shouldn't one fix the linking of external primitives instead then?
That would make more sense to me.
To my knowledge it is an explicit design choice of the compiler that mentioning an external declaration of a module does not create a linking dependency on that module. The choice may be debatable and occasionally leads to unpleasant consequences, but it is not identified as a bug that we could just fix.
This issue has been open one year with no activity and without being identified as a bug. Consequently, it is being marked with the "stale" label to see if anyone has comments that provide new information on the issue. Is the issue still reproducible? Did it appear in other contexts? How critical is it? Did you miss this feature in a new setting? etc. Note that the issue will not be closed in the absence of new activity: the "stale" label is only used for triaging reason.
There is no indication that anything has been fixed yet. You can easily try the two reproducers (which might correspond to two different bugs).
bumping this as we have some effect handling we'd like to do from callbacks from c into OCaml