dune
dune copied to clipboard
fsevents segfaulting under M1 mac
Hi,
I'm trying to make the test suite work on M1 macs.
At the moment, the inline tests for fsevents are triggering a segfault.
I reduced that to the following piece of code:
open Stdune
module Event = Fsevents.Event
let main () =
let dir = Temp.create Dir ~prefix:"fsevents_dune" ~suffix:"" in
Sys.chdir (Path.to_string dir);
let cwd = Sys.getcwd () in
let fsevent_r = ref None in
let fsevent =
Fsevents.create ~paths:[ cwd ] ~latency:0.0 ~f:(fun _ ->
Fsevents.stop (Option.value_exn !fsevent_r);
raise Exit)
in
fsevent_r := Some fsevent;
let runloop = Fsevents.RunLoop.in_current_thread () in
Fsevents.start fsevent runloop;
match Fsevents.RunLoop.run_current_thread runloop with
| Ok () -> ()
| Error e -> raise e
Here's the crash location:
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8d2dc8)
frame #0: 0x00000001000f7d90 t.exe`dune_fsevents_callback(streamRef=<unavailable>, t=0x0000600000202dc0, numEvents=<unavailable>, eventPaths=1 element, eventFlags=0x000000010058c000, eventIds=0x0000000100588000) at fsevents_stubs.c:120:23 [opt]
117 }
118 v_res = caml_callback_exn(t->v_callback, v_events_xs);
119 if (Is_exception_result(v_res)) {
-> 120 t->runloop->v_exn = Extract_exception(v_res);
121 CFRunLoopStop(t->runloop->runloop);
122 }
123 CAMLdrop;
Target 0: (t.exe) stopped.
warning: t.exe was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) p/x t
(dune_fsevents_t *) $0 = 0x0000600000202dc0
(lldb) p/x t->runloop
(dune_runloop *) $1 = 0x0000beadde8d2dc0
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8d2dc8)
* frame #0: 0x00000001000f7d90 t.exe`dune_fsevents_callback(streamRef=<unavailable>, t=0x0000600000202dc0, numEvents=<unavailable>, eventPaths=1 element, eventFlags=0x000000010058c000, eventIds=0x0000000100588000) at fsevents_stubs.c:120:23 [opt]
frame #1: 0x00000001bd6b3000 FSEvents`implementation_callback_rpc + 3560
frame #2: 0x00000001bd6b2190 FSEvents`_Xcallback_rpc + 220
frame #3: 0x00000001bd6b2088 FSEvents`FSEventsD2F_server + 72
frame #4: 0x00000001bd6b51cc FSEvents`FSEventsClientProcessMessageCallback + 68
frame #5: 0x00000001b6275858 CoreFoundation`__CFMachPortPerform + 260
frame #6: 0x00000001b62458c8 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 60
frame #7: 0x00000001b6245784 CoreFoundation`__CFRunLoopDoSource1 + 604
frame #8: 0x00000001b6243c18 CoreFoundation`__CFRunLoopRun + 2372
frame #9: 0x00000001b6242b34 CoreFoundation`CFRunLoopRunSpecific + 600
frame #10: 0x00000001b62cfcc8 CoreFoundation`CFRunLoopRun + 64
frame #11: 0x00000001000f78c8 t.exe`dune_fsevents_runloop_run(v_runloop=<unavailable>) at fsevents_stubs.c:59:3 [opt]
frame #12: 0x0000000100127c2c t.exe`caml_c_call + 28
frame #13: 0x0000000100068404 t.exe`camlFsevents__run_current_thread_481 + 52
frame #14: 0x0000000100005e10 t.exe`camlDune__exe__T__code_end + 360
frame #15: 0x0000000100005c98 t.exe`camlDune__exe__T__code_begin + 32
frame #16: 0x0000000100002ae4 t.exe`caml_program + 4716
frame #17: 0x0000000100127c9c t.exe`caml_start_program + 104
frame #18: 0x000000010010226c t.exe`caml_startup_common(argv=0x0000000100068404, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
frame #19: 0x00000001001022e0 t.exe`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
frame #20: 0x00000001001022d8 t.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172:15 [opt]
frame #21: 0x00000001001022d8 t.exe`caml_main(argv=<unavailable>) at startup_nat.c:179:3 [opt]
frame #22: 0x0000000100102340 t.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
frame #23: 0x000000010038508c dyld`start + 520
Here, t->runloop cannot be written to. I'm not too sure about what happens with memory management in the fsevent stubs. Printing the value of t->runloop before and after caml_callback_exn shows a different value. One thing I also noticed is that in dune_fsevents_flush_sync, we're releasing the runtime before FSEventStreamFlushSync, but that function runs callbacks which can call out to ocaml code. Even if the runtime is re-acquired in dune_fsevents_callback, I'm not sure that's valid.
Can you have a look @rgrinberg ? Thanks!
Do you get the segfault if you call Fsevents.stop outside of the callback (after Fsevents.run_current_thread returns)? I believe Fsevents.stop actually frees the associated data structure, so it shouldn't be done when the watcher is still in use.
Yes, that makes the segfault go away. In the test, it's being done in a separate thread. Do you have an idea how to fix this? Move stop to finally in the test (but the test seems to observe that behavior) or moving the deallocation in a finalizer maybe?
I managed to remove the segfault by moving caml_stat_free to a finalizer but the tests do not seem to agree between the committed version of expectations, M1 output and intel output. I'll open a PR with that to discuss.
That PR is https://github.com/ocaml/dune/pull/6215
Head up that I also encounter this segfault on an intel mac
Ah, that's great. Can you try #6215? I'm not super familiar with these stubs but there's definitely something wrong with memory management in there.
I tried https://github.com/ocaml/dune/pull/6215 and running the code from the top of the issues on an intel and m1 mac and I no longer get the seg fault (I do get Fatal error: exception Stdlib.Exit now but not sure if that's intentional).
Do you want to take over that PR and get it over the finish line?
That would be appreciated. I can guide you but I don't have a mac handy at the moment so I can't easily work on it.
Yeh sure
I did some digging today into why the crash happens when the fsevents callback calls Fsevents.stop.
The c function dune_fsevents_callback is called on each event, and invokes the user-provided callback (the ~f argument to Fsevents.create) before dereferencing a field of the dune_fsevents_t struct . The c function dune_fsevents_stop deallocates the dune_events_t struct and lucky for us the deallocator seems to corrupt its pointer fields. dune_fsevents_stop is called by the ocaml function Fsevents.stop. So when the fsevents callback calls Fsevents.stop dune_fsevents_callback tries to dereference a field of the dune_events_t after its pointer fields have been corrupted and a segfault happens.
This implies that we'd also see this crash if we ever started, stopped, and then re-started a Fsevents.t as stop causes the value to be deallocated but it's only allocated inside create, though is moot as the Fsevents.state state machine doesn't allow stopping and re-starting.
@emillon's fix (https://github.com/ocaml/dune/pull/6215) solves the problem by removing the call to caml_stat_free(t); from dune_fsevents_stop and registering a custom finalizer for the dune_fsevents_t object instead. I've copied to https://github.com/ocaml/dune/pull/7312 and rebased onto main.