dune icon indicating copy to clipboard operation
dune copied to clipboard

fsevents segfaulting under M1 mac

Open emillon opened this issue 3 years ago • 2 comments
trafficstars

Hi,

I'm trying to make the test suite work on M1 macs.

At the moment, the inline tests for fsevents are triggering a segfault. I reduced that to the following piece of code:

open Stdune
module Event = Fsevents.Event

let main () =
  let dir = Temp.create Dir ~prefix:"fsevents_dune" ~suffix:"" in
  Sys.chdir (Path.to_string dir);
  let cwd = Sys.getcwd () in
  let fsevent_r = ref None in
  let fsevent =
    Fsevents.create ~paths:[ cwd ] ~latency:0.0 ~f:(fun _ ->
        Fsevents.stop (Option.value_exn !fsevent_r);
        raise Exit)
  in
  fsevent_r := Some fsevent;
  let runloop = Fsevents.RunLoop.in_current_thread () in
  Fsevents.start fsevent runloop;
  match Fsevents.RunLoop.run_current_thread runloop with
  | Ok () -> ()
  | Error e -> raise e

Here's the crash location:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8d2dc8)
    frame #0: 0x00000001000f7d90 t.exe`dune_fsevents_callback(streamRef=<unavailable>, t=0x0000600000202dc0, numEvents=<unavailable>, eventPaths=1 element, eventFlags=0x000000010058c000, eventIds=0x0000000100588000) at fsevents_stubs.c:120:23 [opt]
   117    }
   118    v_res = caml_callback_exn(t->v_callback, v_events_xs);
   119    if (Is_exception_result(v_res)) {
-> 120      t->runloop->v_exn = Extract_exception(v_res);
   121      CFRunLoopStop(t->runloop->runloop);
   122    }
   123    CAMLdrop;
Target 0: (t.exe) stopped.
warning: t.exe was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) p/x t
(dune_fsevents_t *) $0 = 0x0000600000202dc0
(lldb) p/x t->runloop
(dune_runloop *) $1 = 0x0000beadde8d2dc0
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xbeadde8d2dc8)
  * frame #0: 0x00000001000f7d90 t.exe`dune_fsevents_callback(streamRef=<unavailable>, t=0x0000600000202dc0, numEvents=<unavailable>, eventPaths=1 element, eventFlags=0x000000010058c000, eventIds=0x0000000100588000) at fsevents_stubs.c:120:23 [opt]
    frame #1: 0x00000001bd6b3000 FSEvents`implementation_callback_rpc + 3560
    frame #2: 0x00000001bd6b2190 FSEvents`_Xcallback_rpc + 220
    frame #3: 0x00000001bd6b2088 FSEvents`FSEventsD2F_server + 72
    frame #4: 0x00000001bd6b51cc FSEvents`FSEventsClientProcessMessageCallback + 68
    frame #5: 0x00000001b6275858 CoreFoundation`__CFMachPortPerform + 260
    frame #6: 0x00000001b62458c8 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 60
    frame #7: 0x00000001b6245784 CoreFoundation`__CFRunLoopDoSource1 + 604
    frame #8: 0x00000001b6243c18 CoreFoundation`__CFRunLoopRun + 2372
    frame #9: 0x00000001b6242b34 CoreFoundation`CFRunLoopRunSpecific + 600
    frame #10: 0x00000001b62cfcc8 CoreFoundation`CFRunLoopRun + 64
    frame #11: 0x00000001000f78c8 t.exe`dune_fsevents_runloop_run(v_runloop=<unavailable>) at fsevents_stubs.c:59:3 [opt]
    frame #12: 0x0000000100127c2c t.exe`caml_c_call + 28
    frame #13: 0x0000000100068404 t.exe`camlFsevents__run_current_thread_481 + 52
    frame #14: 0x0000000100005e10 t.exe`camlDune__exe__T__code_end + 360
    frame #15: 0x0000000100005c98 t.exe`camlDune__exe__T__code_begin + 32
    frame #16: 0x0000000100002ae4 t.exe`caml_program + 4716
    frame #17: 0x0000000100127c9c t.exe`caml_start_program + 104
    frame #18: 0x000000010010226c t.exe`caml_startup_common(argv=0x0000000100068404, pooling=<unavailable>) at startup_nat.c:160:9 [opt]
    frame #19: 0x00000001001022e0 t.exe`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:167:10 [opt]
    frame #20: 0x00000001001022d8 t.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:172:15 [opt]
    frame #21: 0x00000001001022d8 t.exe`caml_main(argv=<unavailable>) at startup_nat.c:179:3 [opt]
    frame #22: 0x0000000100102340 t.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #23: 0x000000010038508c dyld`start + 520

Here, t->runloop cannot be written to. I'm not too sure about what happens with memory management in the fsevent stubs. Printing the value of t->runloop before and after caml_callback_exn shows a different value. One thing I also noticed is that in dune_fsevents_flush_sync, we're releasing the runtime before FSEventStreamFlushSync, but that function runs callbacks which can call out to ocaml code. Even if the runtime is re-acquired in dune_fsevents_callback, I'm not sure that's valid.

Can you have a look @rgrinberg ? Thanks!

emillon avatar Sep 15 '22 09:09 emillon

Do you get the segfault if you call Fsevents.stop outside of the callback (after Fsevents.run_current_thread returns)? I believe Fsevents.stop actually frees the associated data structure, so it shouldn't be done when the watcher is still in use.

nojb avatar Sep 15 '22 09:09 nojb

Yes, that makes the segfault go away. In the test, it's being done in a separate thread. Do you have an idea how to fix this? Move stop to finally in the test (but the test seems to observe that behavior) or moving the deallocation in a finalizer maybe?

emillon avatar Sep 15 '22 09:09 emillon

I managed to remove the segfault by moving caml_stat_free to a finalizer but the tests do not seem to agree between the committed version of expectations, M1 output and intel output. I'll open a PR with that to discuss.

emillon avatar Sep 29 '22 12:09 emillon

That PR is https://github.com/ocaml/dune/pull/6215

emillon avatar Oct 11 '22 08:10 emillon

Head up that I also encounter this segfault on an intel mac

gridbugs avatar Nov 01 '22 11:11 gridbugs

Ah, that's great. Can you try #6215? I'm not super familiar with these stubs but there's definitely something wrong with memory management in there.

emillon avatar Feb 22 '23 10:02 emillon

I tried https://github.com/ocaml/dune/pull/6215 and running the code from the top of the issues on an intel and m1 mac and I no longer get the seg fault (I do get Fatal error: exception Stdlib.Exit now but not sure if that's intentional).

gridbugs avatar Mar 06 '23 06:03 gridbugs

Do you want to take over that PR and get it over the finish line?

rgrinberg avatar Mar 06 '23 23:03 rgrinberg

That would be appreciated. I can guide you but I don't have a mac handy at the moment so I can't easily work on it.

emillon avatar Mar 07 '23 13:03 emillon

Yeh sure

gridbugs avatar Mar 08 '23 00:03 gridbugs

I did some digging today into why the crash happens when the fsevents callback calls Fsevents.stop.

The c function dune_fsevents_callback is called on each event, and invokes the user-provided callback (the ~f argument to Fsevents.create) before dereferencing a field of the dune_fsevents_t struct . The c function dune_fsevents_stop deallocates the dune_events_t struct and lucky for us the deallocator seems to corrupt its pointer fields. dune_fsevents_stop is called by the ocaml function Fsevents.stop. So when the fsevents callback calls Fsevents.stop dune_fsevents_callback tries to dereference a field of the dune_events_t after its pointer fields have been corrupted and a segfault happens.

This implies that we'd also see this crash if we ever started, stopped, and then re-started a Fsevents.t as stop causes the value to be deallocated but it's only allocated inside create, though is moot as the Fsevents.state state machine doesn't allow stopping and re-starting.

@emillon's fix (https://github.com/ocaml/dune/pull/6215) solves the problem by removing the call to caml_stat_free(t); from dune_fsevents_stop and registering a custom finalizer for the dune_fsevents_t object instead. I've copied to https://github.com/ocaml/dune/pull/7312 and rebased onto main.

gridbugs avatar Mar 14 '23 06:03 gridbugs