load_cache_1: random Segmentation faults
Describe the bug
I have built debug ERTS on MacOS, and sometimes I get this random Segmentation fault on startup of the app when I load a certain NIF lib (enif_protobuf).
To Reproduce I work with enif_protobuf in a company project. I think this will be hard to reproduce on other machines.
What I am doing is simply running Elixirs mix test inside LLDB to catch the Segmentation fault.
$ cerl -debug -lldb -pa /Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/eex/ebin /Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/elixir/ebin /Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/ex_unit/ebin /Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/iex/ebin /Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/logger/ebin /Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/mix/ebin -elixir ansi_enabled true -noshell -s elixir start_cli -extra /Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/mix test
(lldb) target create "beam.debug.smp"
Current executable set to 'beam.debug.smp' (x86_64).
(lldb) settings set -- target.run-args "--" "-root" "/Users/y/.asdf/plugins/erlang/kerl-home/builds/asdf_24.0.1/otp_src_24.0.1" "-progname" "/Users/y/.asdf/plugins/erlang/kerl-home/builds/asdf_24.0.1/otp_src_24.0.1/bin/cerl" "-debug" "--" "-home" "/Users/y" "--" "-pa" "/Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/eex/ebin" "/Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/elixir/ebin" "/Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/ex_unit/ebin" "/Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/iex/ebin" "/Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/logger/ebin" "/Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/../lib/mix/ebin" "-elixir" "ansi_enabled" "true" "-noshell" "-s" "elixir" "start_cli" "-extra" "/Users/y/.asdf/installs/elixir/1.12.0-otp-24/bin/mix" "test"
(lldb) command source -s 0 '/tmp/.cerllldb.89302'
Executing commands in '/tmp/.cerllldb.89302'.
(lldb) env TERM=dumb
(lldb) command script import /Users/y/.asdf/plugins/erlang/kerl-home/builds/asdf_24.0.1/otp_src_24.0.1/erts/etc/unix/etp.py
(lldb) run
Process 89458 launched: '/Users/y/.asdf/plugins/erlang/kerl-home/builds/asdf_24.0.1/otp_src_24.0.1/bin/x86_64-apple-darwin20.2.0/beam.debug.smp' (x86_64)
librdkafka fork already exist. delete deps/librdkafka for a fresh checkout ...
concurrentqueue fork already exist. delete deps/concurrentqueue for a fresh checkout ...
make[1]: `/Users/y/sportening/superbet_erlkaf/c_src/../priv/erlkaf_nif.so' is up to date.
===> Analyzing applications...
===> Compiling erlkaf
Loading library: "/Users/y/sportening/we-api-user-account/_build/test/lib/erlkaf/priv/erlkaf_nif"
15:54:27.885 [debug] :metrics_ex enabled=false, port=8088}
15:54:28.260 [info] persistent queue path: "/Users/y/sportening/we-api-user-account/_build/test/lib/erlkaf/priv/client"
15:54:28.260 [warn] rdkafka#producer-1 CONFWARN [thrd:app]: Configuration property enable.auto.commit is a consumer property and will be ignored by this producer instance
15:54:28.260 [warn] rdkafka#producer-1 CONFWARN [thrd:app]: Configuration property enable.auto.offset.store is a consumer property and will be ignored by this producer instance
15:54:28.261 [warn] rdkafka#producer-1 CONFWARN [thrd:app]: Configuration property enable.partition.eof is a consumer property and will be ignored by this producer instance
15:54:28.262 [info] Producer client created with config: [bootstrap_servers: "kafka:19092", delivery_report_only_error: true, delivery_report_callback: &PrettyKafkaClient.Producer.delivery_report/2, message_max_bytes: 52428800, socket_timeout_ms: 120000, queue_buffering_max_ms: 1, queue_buffering_overflow_strategy: :block_calling_process]
Process 89458 stopped
* thread #6, name = '2_scheduler', stop reason = EXC_BAD_ACCESS (code=1, address=0x142651038)
frame #0: 0x0000000148033271 enif_protobuf.so`stack_ensure_all(env=0x0000700001003ca0, cache=0x00000001426483a0) at ep_node.c:762:39
759 for (j = spot->pos; j < (size_t) (spot->node->size); j++) {
760 spot->pos = j + 1;
761 field = ((ep_field_t *) (spot->node->fields)) + j;
-> 762 if (field->o_type == occurrence_repeated) {
763 if (field->type == field_msg || field->type == field_map) {
764 spot++;
765 stack_ensure(env, stack, &spot);
Target 0: (beam.debug.smp) stopped.
(lldb) bt
* thread #6, name = '2_scheduler', stop reason = EXC_BAD_ACCESS (code=1, address=0x142651038)
* frame #0: 0x0000000148033271 enif_protobuf.so`stack_ensure_all(env=0x0000700001003ca0, cache=0x00000001426483a0) at ep_node.c:762:39
frame #1: 0x000000014803b002 enif_protobuf.so`load_cache_1(env=0x0000700001003ca0, argc=1, argv=0x0000700001003dc0) at enif_protobuf.c:252:5
frame #2: 0x0000000100031c8a beam.debug.smp`beam_jit_call_nif(c_p=0x00000001426f0638, I=0x000000014991d0c0, reg=0x0000700001003dc0, fp=(enif_protobuf.so`load_cache_1 at enif_protobuf.c:154), NifMod=0x000000014263c308) at beam_jit_common.c:117:26
Expected behavior No random Segmentation faults on ERTS startup.
Affected versions $ cerl Erlang/OTP 24 [erts-12.0.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]
Additional context $ uname -v Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64
I am running into this also https://gist.github.com/JayKickliter/9b05e6218df6ec6dc6b5bec8e10b7dd4#file-gistfile1-txt-L292
Hi,
Also running into a similar issue where the bug is potentially in stack_ensure_all within ep_node.c, specifically on if (field->o_type == occurrence_repeated). The segfault occurs on a dereference of field (but all future runs are fine).
Printed out the memory address around that line confirming this issue:
Seg Fault (occurs only on the first run/clean build):
Pointer address jumps from being of the form 0x151544848 to 0x3f6b8b.
Passed Case (occurs after the first failed run): Pointer address stays consistently within the same range of one another (eleven digits).
hi @yossarin could you help to check whether the issue still exists