postgresql-ocaml icon indicating copy to clipboard operation
postgresql-ocaml copied to clipboard

sigsegv using code from example together with lwt and caqti

Open tcoopman opened this issue 5 years ago • 8 comments

I've adapted a part of the prompt.ml example to be able to get postgres notifications into this code:

  let run () =
    let rec dump_notification conn =
      match conn#notifies with
      | Some { Postgresql.Notification.name; pid; extra } ->
          let () =
            Lwt_preemptive.run_in_main (fun () ->
                let () =
                  Logs.debug @@ fun m -> m "Notication from backend %i: [%s] [%s]\n" pid name extra
                in
                Lwt_mvar.put mvar (name, pid, extra))
          in
          dump_notification conn
      | None ->
          ()
    in
    let listener conn =
      try
        while true do
          let socket : Unix.file_descr = Caml.Obj.magic conn#socket in
          let _ = Unix.select [ socket ] [] [] 1. in
          let () = conn#consume_input in
          dump_notification conn
        done
      with
      | e ->
          Logs.err @@ fun m -> m "%s" (Exn.to_string e)
    in
    Lwt.async (fun () ->
        let conninfo = "postgresql://postgres:postgres@localhost:5432/board_games_dev" in
        let conn = new Postgresql.connection ~conninfo () in
        conn#exec ~expect:[ Command_ok ] "listen events" |> ignore ;
        conn#set_notice_processor (fun s -> Fmt.pr "postgresql error [%s]\n" s) ;
        Lwt_preemptive.detach listener conn) ;

And this crashes my program. I suspect that the consume_input is where the crash happens although I'm not sure. Without the dump_notification I can still get the code to crash although it seems that it takes longer to crash in that case. Just to be clear, I'm not sure this is a bug in postgresql-ocaml or in my code or in a combination of usage with lwt and caqti.

And this is the backtrace I've been able to get:

Thread 1 "server.exe" received signal SIGSEGV, Segmentation fault.
0x0000000000000003 in ?? ()
(gdb) bt
#0  0x0000000000000003 in ?? ()
#1  0x00005555558bb043 in camlPostgresql__protectx_967 ()
#2  0x0000000000000001 in ?? ()
#3  0x00007ffff75c4bf8 in ?? ()
#4  0x00007ffff73ba940 in ?? ()
#5  0x00005555558ad08d in camlCaqti_driver_postgresql__aux_1952 ()
#6  0x00007fffffffdcb0 in ?? ()
#7  0x00005555558ad0b4 in camlCaqti_driver_postgresql__aux_1952 ()
#8  0x00007ffff75d6310 in ?? ()
#9  0x00007ffff75d64d0 in ?? ()
#10 0x0000000000000001 in ?? ()
#11 0x0000555555a4bfcb in camlLwt__callback_1225 ()
#12 0x00007fffffffe080 in ?? ()
#13 0x0000555555a4bfd4 in camlLwt__callback_1225 ()
#14 0x00007ffff75d5928 in ?? ()
#15 0xa87add3fe9485300 in ?? ()
#16 0x0000000000000001 in ?? ()
#17 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#18 0x00007ffff75c4d70 in ?? ()
#19 0x0000000000000001 in ?? ()
#20 0x0000000000000017 in ?? ()
#21 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#22 0x00007ffff75c4db0 in ?? ()
#23 0x0000000000000001 in ?? ()
#24 0x00007ffff75d5a30 in ?? ()
#25 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#26 0x0000555555cd4ea8 in camlLwt__20 ()
#27 0x00007ffff75d5a30 in ?? ()
#28 0x00007fffffffe080 in ?? ()
#29 0x0000555555a4bff5 in camlLwt__callback_1225 ()
#30 0x00007ffff75d5a00 in ?? ()
#31 0x00007ffff75c4dd8 in ?? ()
#32 0x0000000000000001 in ?? ()
#33 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#34 0x00007ffff75c4e18 in ?? ()
#35 0x0000000000000001 in ?? ()
#36 0x00007ffff75c5ad0 in ?? ()
#37 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#38 0x00007ffff75c4e58 in ?? ()
#39 0x0000000000000001 in ?? ()
#40 0x00007ffff75d5b08 in ?? ()
#41 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#42 0x0000555555cd4ea8 in camlLwt__20 ()
#43 0x00007ffff75d5b08 in ?? ()
#44 0x0000000000000002 in ?? ()
#45 0x0000555555a4cc97 in camlLwt__callback_1343 ()
#46 0x00007ffff75c4ee8 in ?? ()
#47 0x0000555555a4acfe in camlLwt__leave_resolution_loop_894 ()
#48 0x0000000000000001 in ?? ()
#49 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#50 0x00007ffff75c4e80 in ?? ()
#51 0x0000000000000001 in ?? ()
#52 0x00007ffff75c5ba8 in ?? ()
#53 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#54 0x00007ffff75c4ec0 in ?? ()
#55 0x0000000000000001 in ?? ()
#56 0x00007ffff75d5be0 in ?? ()
#57 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#58 0x0000555555cd4ea8 in camlLwt__20 ()
#59 0x00007ffff75d5be0 in ?? ()
#60 0x00007fffffffe080 in ?? ()
#61 0x0000555555a4c82f in camlLwt__callback_1304 ()
#62 0x00007ffff75d5bb0 in ?? ()
#63 0x00007ffff75c4ee8 in ?? ()
#64 0x0000000000000001 in ?? ()
#65 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#66 0x00007ffff75c4ef8 in ?? ()
#67 0x0000000000000001 in ?? ()
#68 0x00007ffff75c5c80 in ?? ()
#69 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#70 0x00007ffff75c4f38 in ?? ()
#71 0x0000000000000001 in ?? ()
#72 0x00007ffff75d5d40 in ?? ()
#73 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#74 0x0000555555cd4ea8 in camlLwt__20 ()
#75 0x00007ffff75d5d40 in ?? ()
#76 0x00007ffff75d5ce0 in ?? ()
#77 0x0000555555a4f98b in camlLwt__callback_1848 ()
#78 0x00007ffff75c5030 in ?? ()
#79 0x0000555555a4abf8 in camlLwt__iter_callback_list_848 ()
#80 0x00007ffff75c4f60 in ?? ()
#81 0x0000000000000001 in ?? ()
#82 0x0000000000000001 in ?? ()
#83 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#84 0x00007ffff75c4fa0 in ?? ()
#85 0x0000000000000001 in ?? ()
#86 0x00007ffff75d5e48 in ?? ()
#87 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#88 0x0000555555cd4ea8 in camlLwt__20 ()
#89 0x00007ffff75d5e48 in ?? ()
#90 0x00007ffff75c6168 in ?? ()
#91 0x0000555555a4cc97 in camlLwt__callback_1343 ()
#92 0x00007ffff75c5030 in ?? ()
#93 0x0000555555a4b9b8 in camlLwt__add_task_r_1142 ()
#94 0x00007ffff733d6c0 in ?? ()
#95 0x0000555555a4abc8 in camlLwt__iter_callback_list_848 ()
#96 0x00007ffff75c4fc8 in ?? ()
#97 0x0000000000000001 in ?? ()
#98 0x00007ffff73db6a0 in ?? ()
#99 0x0000555555a4ad39 in camlLwt__run_in_resolution_loop_899 ()
#100 0x00007ffff75c5008 in ?? ()
#101 0x0000000000000001 in ?? ()
#102 0x00007ffff75d60d8 in ?? ()
#103 0x0000555555a4aeeb in camlLwt__resolve_916 ()
#104 0x0000555555cd4ea8 in camlLwt__20 ()
#105 0x00007ffff75d60d8 in ?? ()
#106 0x00007ffff75d60b0 in ?? ()
#107 0x0000555555a4b120 in camlLwt__wakeup_general_951 ()
#108 0x00007ffff75c5040 in ?? ()

I've posted the issue on discuss.ocaml as well for more context: https://discuss.ocaml.org/t/how-to-debug-a-terminated-by-signal-sigsegv-address-boundary-error/5936

tcoopman avatar Jun 08 '20 09:06 tcoopman

I can confirm the same bug in a similar environment but using caqti + Async. My stack trace is also very similar:

(gdb) bt
#0  0x00000000000010f7 in ?? ()
#1  0x00000000006d8253 in camlPostgresql__protectx_980 () at src/postgresql.ml:763
#2  0x00000000006ca893 in camlCaqti_driver_postgresql__retry_1982 () at lib-driver/caqti_driver_postgresql.ml:418
#3  0x00000000006c5a34 in camlCaqti_async__wrap_fd_1258 () at lib-async/caqti_async.ml:46
#4  0x00000000006ca770 in camlCaqti_driver_postgresql__get_next_result_1977 () at lib-driver/caqti_driver_postgresql.ml:423
#5  0x00000000006caaf3 in camlCaqti_driver_postgresql__fun_4740 () at lib-driver/caqti_driver_postgresql.ml:435
#6  0x00000000008cb1c1 in camlAsync_kernel__Deferred0__fun_1567 () at src/deferred0.ml:54
#7  0x00000000008b03ff in camlAsync_kernel__Job_queue__run_jobs_5120 () at src/job_queue.ml:167
#8  0x00000000008c6054 in camlAsync_kernel__Scheduler1__run_jobs_20773 () at src/scheduler1.ml:329
#9  0x00000000008e1d3e in camlAsync_kernel__Scheduler__run_jobs_19066 () at src/scheduler.ml:160
#10 0x00000000008e1ba6 in camlAsync_kernel__Scheduler__run_cycle_18113 () at src/scheduler.ml:168
#11 0x000000000076289d in camlAsync_unix__Raw_scheduler__have_lock_do_cycle_59828 () at src/raw_scheduler.ml:631
#12 0x0000000000adeded in camlBase__Exn__protectx_1290 () at src/exn.ml:66
#13 0x00000000007eb562 in camlThread_pool__loop_85482 () at thread_pool/src/thread_pool.ml:449
#14 0x0000000000adf05b in camlBase__Exn__handle_uncaught_aux_1678 () at src/exn.ml:111
#15 0x000000000089a245 in camlThread__fun_308 () at thread.ml:39
#16 0x0000000000c0e07c in caml_start_program ()
#17 0x0000000000c0608d in caml_callback_exn (closure=<optimized out>, arg=1) at callback.c:145
#18 0x0000000000be701e in caml_thread_start ()
#19 0x00007f190fd8bea5 in start_thread () from /lib64/libpthread.so.0
#20 0x00007f190eab88dd in clone () from /lib64/libc.so.6

Operating System: CentOS 7 PostgreSQL library: 9.2.24 OCaml: 4.10.0 postgresql-ocaml: 4.6.0 caqti-driver-postgresql: 1.2.4

pdonadeo avatar Jul 15 '20 14:07 pdonadeo

Adding random info for @mmottl: doing a very quick test I cannot reproduce the bug with the previous version, 4.5.2.

Unfortunately I can't provide a minimal test case in this very moment, like @tcoopman did (thanks!) but if any help is needed let me know.

pdonadeo avatar Jul 15 '20 14:07 pdonadeo

I'm afraid, but notice processors do not currently work as they should due to some limitations of the OCaml runtime. You may want to read issues https://github.com/mmottl/postgresql-ocaml/issues/37 and https://github.com/paurkedal/ocaml-caqti/issues/33

Basically, I'd have to be able to determine whether a thread is already holding the runtime lock or not to safely call into OCaml with notifications. The OCaml runtime currently does not support this. These calls can apparently be triggered by just about any call into the PostgreSQL library. I hope that maybe multicore support for OCaml will eventually fix this, but this is likely not going to happen in the near future.

I'll leave this issue open until some sort of fix is possible.

mmottl avatar Jul 30 '20 02:07 mmottl

Thanks for the info. That's unfortunate to read though...

Shouldn't this be reported to ocaml as a bug so that they are at least aware it with the use case as well?

@pdonadeo mentioned that this couldn't be reproduced on the previous version, given your explanation above that seems weird, no? I haven't tried to reproduce on that version, maybe I will but I don't have the time for it at the moment

tcoopman avatar Jul 30 '20 05:07 tcoopman

It usually takes at least many months for features to get into the OCaml runtime. It would be especially hard right now due to the introduction of multicore, from which such a feature might distract.

I didn't see anything obvious why the previous version should work better. GC and/or locking issues can be really tricky to replicate, especially if there are layers of other complex libraries (Lwt, Async) involved. Please let me know if you find a fixable bug, but for now it is not advisable to use notifications in any case unless you let the library write them to stderr without callbacks.

mmottl avatar Jul 30 '20 15:07 mmottl

I understand that it won't be fixed (and I'm a bit sad about it, because it blocks me kind of hard - but it was just on a hobby project...) and I don't understand nearly enough of the runtime to give any useful input into this, but given that you write that you hope that multicore will fix this, isn't this at least something worth of tracking?

tcoopman avatar Jul 31 '20 20:07 tcoopman

Since there won't be any global lock on the runtime anymore, multicore should make it possible for any thread to enter OCaml land without lock acquisitions.

I could certainly imagine workarounds for the current situation, but they are complicated and would require more time than I have right now. E.g. you could have a queue in C land that records all notifications, and a dedicated user thread in OCaml land that will wait for such notifications on a condition variable and execute the callbacks as they become available. Feel free to give it a try, but it certainly requires some care and effort.

mmottl avatar Jul 31 '20 21:07 mmottl

I'm sorry for the delay but I'm having a very busy period. From this weekend I will be in vacation for one week and I'm going to work for a minimal program reproducing the bug: the two executables (a client and a server) in which I found the segfault are far too complicated to be debugged.

What I can say now is that the two programs has been compiled with exactly the same "opam switch" under Fedora 32 (my workstation), CentOS 8 (the server) and CentOS 7 (the client): both the programs, in three OSs, crash immediately with version 4.6.0, while they seem to work with 4.5.2. Maybe it's only a strange coincidence, I don't really know.

Below the output of opam switch export.

4.10.0-SCADA.txt

pdonadeo avatar Aug 05 '20 20:08 pdonadeo