bdwgc icon indicating copy to clipboard operation
bdwgc copied to clipboard

SIGSEGV in GC_is_marked() running ECL on FreeBSD 13 amd64

Open Partmedia opened this issue 3 years ago • 11 comments

ECL crashes in GC_is_marked() (bdwgc 8.0.6) with SIGSEGV (caught by ECL) on FreeBSD 13 amd64.

#0  0x0000000800636f3b in sigsegv_handler (sig=sig@entry=11, info=info@entry=0x7fffde7f0470, aux=aux@entry=0x7fffde7f0100)
    at /home/kevinz/workspace/ecl/src/c/unixint.d:853
#1  0x000000080025fe0e in handle_signal (actp=actp@entry=0x7fffde7f0080, sig=sig@entry=11, info=info@entry=0x7fffde7f0470, 
    ucp=ucp@entry=0x7fffde7f0100) at /usr/src/lib/libthr/thread/thr_sig.c:301
#2  0x000000080025f3cf in thr_sighandler (sig=11, info=0x7fffde7f0470, _ucp=0x7fffde7f0100)
    at /usr/src/lib/libthr/thread/thr_sig.c:246
#3  <signal handler called>
#4  0x000000080029ad25 in GC_is_marked (p=0x800698820 <cl_symbols+114816>) at mark.c:231
#5  0x00000008002956e5 in GC_make_disappearing_links_disappear (dl_hashtbl=0x8002afaa0 <GC_dl_hashtbl>, is_remove_dangling=0)
    at finalize.c:972
#6  0x0000000800294e42 in GC_finalize () at finalize.c:1020
#7  0x000000080028d9c3 in GC_finish_collection () at alloc.c:1045
#8  0x000000080028d3ba in GC_try_to_collect_inner (stop_func=0x80028ca60 <GC_never_stop_func>) at alloc.c:553
#9  0x000000080028ee9d in GC_collect_or_expand (needed_blocks=72, ignore_off_page=1, retry=0) at alloc.c:1443
#10 0x000000080029826e in GC_alloc_large (lb=294912, k=0, flags=1) at malloc.c:64
#11 0x0000000800299b17 in GC_generic_malloc_ignore_off_page (lb=294912, k=0) at mallocx.c:218
#12 0x0000000800299c97 in GC_malloc_atomic_ignore_off_page (lb=294912) at mallocx.c:253
#13 0x0000000800650530 in ecl_alloc_atomic_unprotected (n=294912) at /home/kevinz/workspace/ecl/src/c/alloc_2.d:694
#14 ecl_alloc_atomic (n=34366652416) at /home/kevinz/workspace/ecl/src/c/alloc_2.d:714
#15 0x0000000800635700 in init_stacks (env=env@entry=0x805f4b000) at /home/kevinz/workspace/ecl/src/c/stacks.d:717

In frame 4:

 225│ GC_API int GC_CALL GC_is_marked(const void *p)
 226│ {
 227│     struct hblk *h = HBLKPTR(p);
 228│     hdr * hhdr = HDR(h);
 229│     word bit_no = MARK_BIT_NO((ptr_t)p - (ptr_t)h, hhdr -> hb_sz);
 230│
 231├───> return (int)mark_bit_from_hdr(hhdr, bit_no); /* 0 or 1 */
 232│ }
(gdb) frame 4
#4  0x000000080029ad25 in GC_is_marked (p=0x800698820 <cl_symbols+114816>) at mark.c:231
(gdb) print p
$1 = (const void *) 0x800698820 <cl_symbols+114816>
(gdb) print h
$2 = (struct hblk *) 0x800698000 <cl_symbols+112736>
(gdb) print hhdr
$3 = (hdr *) 0x0
(gdb) print bit_no
$4 = 130
(gdb)

Based on this, does the problem seem more likely in ECL or bdwgc? If the latter, could you suggest additional debugging steps I can try?

Partmedia avatar Jan 14 '22 19:01 Partmedia

  1. It is better to check master.
  2. Compile bdgwc w/o optimizations e.g.: ./configure --disable-thread-local-alloc --disable-parallel-mark --disable-munmap && make CFLAGS_EXTRA=-DGC_DISABLE_INCREMENTAL

ivmai avatar Jan 14 '22 21:01 ivmai

Reproduced the crash with master (457b0241) and the flags given here.

Partmedia avatar Jan 14 '22 22:01 Partmedia

Is p (passed to GC_is_marked) a valid heap pointer?

ivmai avatar Jan 16 '22 11:01 ivmai

p is a pointer to an element of a global array cl_symbols:

#4  0x000000080062bde9 in GC_is_marked (p=0x8005e5b70 <cl_symbols+91072>)

Each element of cl_symbols has the definition:

typedef union {
        struct {
                const char *name;
                const char *translation;
                void *fun;
                int narg;
                int type;
                cl_object value;
        } init;
        struct ecl_symbol data;
} cl_symbol_initializer;

Because it is possible for symbols to contain pointers to things elsewhere:

GC_push_other_roots = stacks_scanner;

static void
stacks_scanner()
{
  // snip
  GC_push_all((void *)cl_symbols, (void *)(cl_symbols + cl_num_symbols_in_core));
  // more snip
  if (old_GC_push_other_roots)
    (*old_GC_push_other_roots)();
}

Instead of GC_push_all, is GC_add_roots the right thing to use here?

Partmedia avatar Jan 16 '22 20:01 Partmedia

GC_push_all((void *)cl_symbols, (void *)(cl_symbols + cl_num_symbols_in_core))

It's OK. (e.g. GC_push_all is used by GC to push some globals)

ivmai avatar Jan 17 '22 21:01 ivmai

Is the bug always reproduced?

ivmai avatar Jan 18 '22 05:01 ivmai

According to GC_is_marked contract, its argument should be the beginning of the object previously allocated by GC, i.e. the following condition should be true: p != 0 && GC_base(p) == p.

ivmai avatar Jan 18 '22 06:01 ivmai

I can reproduce the crash reliably on FreeBSD 13 amd64 (on two different machines), but the steps to reproduce a bit involved, namely:

  1. Install quicklisp (Common Lisp package manger)
  2. Install vlime (a Vim language server extension)
  3. Start vlime server with ECL (ECL uses boehm-gc)
  4. Connect vim to vlime via TCP
  5. ECL crashes with SIGSEGV in GC_is_marked()

Are there some heap memory checks that can be turned on in boehm-gc to check if some memory is being clobbered in ECL?

Partmedia avatar Jan 18 '22 07:01 Partmedia

Just to be clear cl_symbols is not heap-allocated. Therefore GC_is_marked() should never be called on it, correct?

Partmedia avatar Jan 18 '22 08:01 Partmedia

Just to be clear cl_symbols is not heap-allocated. Therefore GC_is_marked() should never be called on it, correct?

Correct. I've added a comment about it to gc_mark.h today.

ivmai avatar Jan 18 '22 17:01 ivmai

Reproduced the crash with master (457b024) and the flags given here.

It is better to debug the issue on master. To turn on internal checks, add --enable-gc-assertions (to the mentioned flags).

ivmai avatar Jan 18 '22 17:01 ivmai

I've since updated to the latest master and re-compiled bdwgc with all the flags you suggested. I can still reproduce the crash, although the backtrace looks a bit different now.

Note: It looks like I'm using the bundled (older) bdwgc, but what I've actually done here is symlinked the bundled bdwgc to the latest master from Git. I've also completely uninstalled the system libgc.so.1 so I'm fairly confident that I'm actually running the latest bdwgc.

#0  0x000000080096a33a in thr_kill () from /lib/libc.so.7
#1  0x00000008008e2c74 in raise () from /lib/libc.so.7
#2  0x0000000800994109 in abort () from /lib/libc.so.7
#3  0x0000000800663620 in GC_register_disappearing_link_inner () from /opt/ecl/lib//libecl.so.21.2
#4  0x0000000800663514 in GC_general_register_disappearing_link () from /opt/ecl/lib//libecl.so.21.2
#5  0x000000080065aee2 in ecl_alloc_weak_pointer () from /opt/ecl/lib//libecl.so.21.2
#6  0x000000080065af15 in si_make_weak_pointer () from /opt/ecl/lib//libecl.so.21.2
#7  0x000000080064332f in _ecl_sethash_weak () from /opt/ecl/lib//libecl.so.21.2
#8  0x000000080064229d in ecl_sethash () from /opt/ecl/lib//libecl.so.21.2
#9  0x0000000800644fe5 in si_hash_set () from /opt/ecl/lib//libecl.so.21.2
#10 0x00000008050d45a0 in L9form_to_json ()
   from /var/cache/usercache/common-lisp/ecl-21.2.1-03ff4c16-bsd-x64/.vim/pack/start/vlime/lisp/src/vlime
-protocol.fas
snipped for brevity

The function in frame 5 is defined as:

/**********************************************************************
 * WEAK POINTERS
 */

cl_object
ecl_alloc_weak_pointer(cl_object o)
{
  const cl_env_ptr the_env = ecl_process_env();
  struct ecl_weak_pointer *obj;
  ecl_disable_interrupts_env(the_env);
  obj = GC_MALLOC_ATOMIC(sizeof(struct ecl_weak_pointer));
  ecl_enable_interrupts_env(the_env);
  obj->t = t_weak_pointer;
  obj->value = o;
  if (!ECL_IMMEDIATE(o)) {
    GC_GENERAL_REGISTER_DISAPPEARING_LINK((void**)&(obj->value), (void*)o);
    si_set_finalizer((cl_object)obj, ECL_T);
  }
  return (cl_object)obj;
}

I'm not familiar enough with the libgc API, does this seem like an issue with the application or the garbage collector?

Partmedia avatar Mar 18 '23 04:03 Partmedia

I wasn't able to find documentation for GC_register_disappearing_link besides the page on finalization and the header file. Does the obj argument have to be a heap (and not a data) pointer allocated by GC?

Partmedia avatar Mar 18 '23 04:03 Partmedia

I will improve the documentation about GC_general_register_disappearing_link - it is OK to pass any link there, not only inside GC-allocated objects. Eg. gctest passes a link inside static data roots.

ivmai avatar Mar 18 '23 09:03 ivmai

  obj = GC_MALLOC_ATOMIC(sizeof(struct ecl_weak_pointer));
  ...
  obj->t = t_weak_pointer;
  obj->value = o;
  ...
    GC_GENERAL_REGISTER_DISAPPEARING_LINK((void**)&(obj->value), (void*)o);

This looks like correct usage of the bdwgc disappearing-link functionality. So, as of now, I don't understand the root cause of the SIGSEGV.

p is a pointer to an element of a global array cl_symbols: # 4 0x000000080062bde9 in GC_is_marked (p=0x8005e5b70 <cl_symbols+91072>)

I don't understand why p belongs to cl_symbols. p should be GC_base(&(obj->value))

ivmai avatar Mar 18 '23 09:03 ivmai

p is a pointer to an element of a global array cl_symbols: # 4 0x000000080062bde9 in GC_is_marked (p=0x8005e5b70 <cl_symbols+91072>)

I don't understand why p belongs to cl_symbols.

Please figure out how this happens.

ivmai avatar Mar 18 '23 10:03 ivmai

I can still reproduce the crash, although the backtrace looks a bit different now. # 2 0x0000000800994109 in abort () from /lib/libc.so.7 # 3 0x0000000800663620 in GC_register_disappearing_link_inner () from /opt/ecl/lib//libecl.so.21.2 # 4 0x0000000800663514 in GC_general_register_disappearing_link () from /opt/ecl/lib//libecl.so.21.2 # 5 0x000000080065aee2 in ecl_alloc_weak_pointer () from /opt/ecl/lib//libecl.so.21.2

There's only one assert which turn into abort in GC_register_disappearing_link_inner (this assertion was added in the latest bdwgc release if I remember correctly): GC_ASSERT(obj != NULL && GC_base_C(obj) == obj);

It looks like o passed to GC_GENERAL_REGISTER_DISAPPEARING_LINK() (obj argument) does not point to the beginning of the allocated object. Please check it. If it is intentional then use GC_base(o) instead.

ivmai avatar Mar 18 '23 11:03 ivmai

ECL crashes in GC_is_marked() (bdwgc 8.0.6) with SIGSEGV (caught by ECL) on FreeBSD 13 amd64. # 3 # 4 0x000000080029ad25 in GC_is_marked (p=0x800698820 <cl_symbols+114816>) at mark.c:231 # 5 0x00000008002956e5 in GC_make_disappearing_links_disappear (dl_hashtbl=0x8002afaa0 <GC_dl_hashtbl>, is_remove_dangling=0) at finalize.c:972

GC_is_marked is called from this line: if (EXPECT(GC_is_marked((ptr_t)GC_REVEAL_POINTER(curr_dl->dl_hidden_obj)), TRUE)) { The SIGSEGV is caused by the fact that GC_REVEAL_POINTER(curr_dl->dl_hidden_obj) does not point to the beginning of the bdwgc-allocated object (as I wrote above now obj is arg is verified to point to the beginning of an object in the assertion during link registration).

This is the root cause of the issue. It should be fixed on the client side (e.g. in ecl_alloc_weak_pointer).

ivmai avatar Mar 18 '23 12:03 ivmai

Try replacing GC_GENERAL_REGISTER_DISAPPEARING_LINK((void**)&(obj->value), (void*)o); to GC_general_register_disappearing_link((void**)&(obj->value), GC_base((void*)o));

ivmai avatar Mar 18 '23 15:03 ivmai

It looks like GC_base((void*)o) is now NULL.

In the calling frame:

(gdb) print o
$7 = (cl_object) 0x8006afae8 <cl_symbols+75768>

In the frame with GC_general_register_disappearing_link:

#4  0x000000080065f32d in GC_general_register_disappearing_link (link=0x21e4ae8, obj=0x0)
242         return GC_register_disappearing_link_inner(&GC_dl_hashtbl, link, obj,

Which indicates that the second argument has become NULL.

Partmedia avatar Mar 18 '23 17:03 Partmedia

Why is GC_general_register_disappearing_link called with o not a GC-allocated object?

ivmai avatar Mar 18 '23 21:03 ivmai

I'm still investigating, but it looks like because this is the implementation of a weak hash table which points to a general object, which could be allocated on the heap, but sometimes is an object in data.

Partmedia avatar Mar 18 '23 21:03 Partmedia

I wasn't able to find documentation for GC_register_disappearing_link besides the page on finalization and the header file. Does the obj argument have to be a heap (and not a data) pointer allocated by GC?

Obj argument should be a pointer to an object allocated by bdwgc, to the beginning of the object.

ivmai avatar Mar 18 '23 21:03 ivmai