bdwgc
bdwgc copied to clipboard
SIGSEGV in GC_is_marked() running ECL on FreeBSD 13 amd64
ECL crashes in GC_is_marked()
(bdwgc 8.0.6) with SIGSEGV (caught by ECL) on FreeBSD 13 amd64.
#0 0x0000000800636f3b in sigsegv_handler (sig=sig@entry=11, info=info@entry=0x7fffde7f0470, aux=aux@entry=0x7fffde7f0100)
at /home/kevinz/workspace/ecl/src/c/unixint.d:853
#1 0x000000080025fe0e in handle_signal (actp=actp@entry=0x7fffde7f0080, sig=sig@entry=11, info=info@entry=0x7fffde7f0470,
ucp=ucp@entry=0x7fffde7f0100) at /usr/src/lib/libthr/thread/thr_sig.c:301
#2 0x000000080025f3cf in thr_sighandler (sig=11, info=0x7fffde7f0470, _ucp=0x7fffde7f0100)
at /usr/src/lib/libthr/thread/thr_sig.c:246
#3 <signal handler called>
#4 0x000000080029ad25 in GC_is_marked (p=0x800698820 <cl_symbols+114816>) at mark.c:231
#5 0x00000008002956e5 in GC_make_disappearing_links_disappear (dl_hashtbl=0x8002afaa0 <GC_dl_hashtbl>, is_remove_dangling=0)
at finalize.c:972
#6 0x0000000800294e42 in GC_finalize () at finalize.c:1020
#7 0x000000080028d9c3 in GC_finish_collection () at alloc.c:1045
#8 0x000000080028d3ba in GC_try_to_collect_inner (stop_func=0x80028ca60 <GC_never_stop_func>) at alloc.c:553
#9 0x000000080028ee9d in GC_collect_or_expand (needed_blocks=72, ignore_off_page=1, retry=0) at alloc.c:1443
#10 0x000000080029826e in GC_alloc_large (lb=294912, k=0, flags=1) at malloc.c:64
#11 0x0000000800299b17 in GC_generic_malloc_ignore_off_page (lb=294912, k=0) at mallocx.c:218
#12 0x0000000800299c97 in GC_malloc_atomic_ignore_off_page (lb=294912) at mallocx.c:253
#13 0x0000000800650530 in ecl_alloc_atomic_unprotected (n=294912) at /home/kevinz/workspace/ecl/src/c/alloc_2.d:694
#14 ecl_alloc_atomic (n=34366652416) at /home/kevinz/workspace/ecl/src/c/alloc_2.d:714
#15 0x0000000800635700 in init_stacks (env=env@entry=0x805f4b000) at /home/kevinz/workspace/ecl/src/c/stacks.d:717
In frame 4:
225│ GC_API int GC_CALL GC_is_marked(const void *p)
226│ {
227│ struct hblk *h = HBLKPTR(p);
228│ hdr * hhdr = HDR(h);
229│ word bit_no = MARK_BIT_NO((ptr_t)p - (ptr_t)h, hhdr -> hb_sz);
230│
231├───> return (int)mark_bit_from_hdr(hhdr, bit_no); /* 0 or 1 */
232│ }
(gdb) frame 4
#4 0x000000080029ad25 in GC_is_marked (p=0x800698820 <cl_symbols+114816>) at mark.c:231
(gdb) print p
$1 = (const void *) 0x800698820 <cl_symbols+114816>
(gdb) print h
$2 = (struct hblk *) 0x800698000 <cl_symbols+112736>
(gdb) print hhdr
$3 = (hdr *) 0x0
(gdb) print bit_no
$4 = 130
(gdb)
Based on this, does the problem seem more likely in ECL or bdwgc? If the latter, could you suggest additional debugging steps I can try?
- It is better to check master.
- Compile bdgwc w/o optimizations e.g.: ./configure --disable-thread-local-alloc --disable-parallel-mark --disable-munmap && make CFLAGS_EXTRA=-DGC_DISABLE_INCREMENTAL
Reproduced the crash with master (457b0241) and the flags given here.
Is p (passed to GC_is_marked) a valid heap pointer?
p is a pointer to an element of a global array cl_symbols:
#4 0x000000080062bde9 in GC_is_marked (p=0x8005e5b70 <cl_symbols+91072>)
Each element of cl_symbols has the definition:
typedef union {
struct {
const char *name;
const char *translation;
void *fun;
int narg;
int type;
cl_object value;
} init;
struct ecl_symbol data;
} cl_symbol_initializer;
Because it is possible for symbols to contain pointers to things elsewhere:
GC_push_other_roots = stacks_scanner;
static void
stacks_scanner()
{
// snip
GC_push_all((void *)cl_symbols, (void *)(cl_symbols + cl_num_symbols_in_core));
// more snip
if (old_GC_push_other_roots)
(*old_GC_push_other_roots)();
}
Instead of GC_push_all
, is GC_add_roots
the right thing to use here?
GC_push_all((void *)cl_symbols, (void *)(cl_symbols + cl_num_symbols_in_core))
It's OK. (e.g. GC_push_all is used by GC to push some globals)
Is the bug always reproduced?
According to GC_is_marked contract, its argument should be the beginning of the object previously allocated by GC, i.e. the following condition should be true: p != 0 && GC_base(p) == p.
I can reproduce the crash reliably on FreeBSD 13 amd64 (on two different machines), but the steps to reproduce a bit involved, namely:
- Install quicklisp (Common Lisp package manger)
- Install vlime (a Vim language server extension)
- Start vlime server with ECL (ECL uses boehm-gc)
- Connect vim to vlime via TCP
- ECL crashes with SIGSEGV in
GC_is_marked()
Are there some heap memory checks that can be turned on in boehm-gc to check if some memory is being clobbered in ECL?
Just to be clear cl_symbols is not heap-allocated. Therefore GC_is_marked() should never be called on it, correct?
Just to be clear cl_symbols is not heap-allocated. Therefore GC_is_marked() should never be called on it, correct?
Correct. I've added a comment about it to gc_mark.h today.
Reproduced the crash with master (457b024) and the flags given here.
It is better to debug the issue on master. To turn on internal checks, add --enable-gc-assertions (to the mentioned flags).
I've since updated to the latest master
and re-compiled bdwgc with all the flags you suggested. I can still reproduce the crash, although the backtrace looks a bit different now.
Note: It looks like I'm using the bundled (older) bdwgc, but what I've actually done here is symlinked the bundled bdwgc to the latest master
from Git. I've also completely uninstalled the system libgc.so.1
so I'm fairly confident that I'm actually running the latest bdwgc.
#0 0x000000080096a33a in thr_kill () from /lib/libc.so.7
#1 0x00000008008e2c74 in raise () from /lib/libc.so.7
#2 0x0000000800994109 in abort () from /lib/libc.so.7
#3 0x0000000800663620 in GC_register_disappearing_link_inner () from /opt/ecl/lib//libecl.so.21.2
#4 0x0000000800663514 in GC_general_register_disappearing_link () from /opt/ecl/lib//libecl.so.21.2
#5 0x000000080065aee2 in ecl_alloc_weak_pointer () from /opt/ecl/lib//libecl.so.21.2
#6 0x000000080065af15 in si_make_weak_pointer () from /opt/ecl/lib//libecl.so.21.2
#7 0x000000080064332f in _ecl_sethash_weak () from /opt/ecl/lib//libecl.so.21.2
#8 0x000000080064229d in ecl_sethash () from /opt/ecl/lib//libecl.so.21.2
#9 0x0000000800644fe5 in si_hash_set () from /opt/ecl/lib//libecl.so.21.2
#10 0x00000008050d45a0 in L9form_to_json ()
from /var/cache/usercache/common-lisp/ecl-21.2.1-03ff4c16-bsd-x64/.vim/pack/start/vlime/lisp/src/vlime
-protocol.fas
snipped for brevity
The function in frame 5 is defined as:
/**********************************************************************
* WEAK POINTERS
*/
cl_object
ecl_alloc_weak_pointer(cl_object o)
{
const cl_env_ptr the_env = ecl_process_env();
struct ecl_weak_pointer *obj;
ecl_disable_interrupts_env(the_env);
obj = GC_MALLOC_ATOMIC(sizeof(struct ecl_weak_pointer));
ecl_enable_interrupts_env(the_env);
obj->t = t_weak_pointer;
obj->value = o;
if (!ECL_IMMEDIATE(o)) {
GC_GENERAL_REGISTER_DISAPPEARING_LINK((void**)&(obj->value), (void*)o);
si_set_finalizer((cl_object)obj, ECL_T);
}
return (cl_object)obj;
}
I'm not familiar enough with the libgc API, does this seem like an issue with the application or the garbage collector?
I wasn't able to find documentation for GC_register_disappearing_link
besides the page on finalization and the header file. Does the obj argument have to be a heap (and not a data) pointer allocated by GC?
I will improve the documentation about GC_general_register_disappearing_link - it is OK to pass any link there, not only inside GC-allocated objects. Eg. gctest passes a link inside static data roots.
obj = GC_MALLOC_ATOMIC(sizeof(struct ecl_weak_pointer));
...
obj->t = t_weak_pointer;
obj->value = o;
...
GC_GENERAL_REGISTER_DISAPPEARING_LINK((void**)&(obj->value), (void*)o);
This looks like correct usage of the bdwgc disappearing-link functionality. So, as of now, I don't understand the root cause of the SIGSEGV.
p is a pointer to an element of a global array cl_symbols: # 4 0x000000080062bde9 in GC_is_marked (p=0x8005e5b70 <cl_symbols+91072>)
I don't understand why p belongs to cl_symbols. p should be GC_base(&(obj->value))
p is a pointer to an element of a global array cl_symbols: # 4 0x000000080062bde9 in GC_is_marked (p=0x8005e5b70 <cl_symbols+91072>)
I don't understand why p belongs to cl_symbols.
Please figure out how this happens.
I can still reproduce the crash, although the backtrace looks a bit different now. # 2 0x0000000800994109 in abort () from /lib/libc.so.7 # 3 0x0000000800663620 in GC_register_disappearing_link_inner () from /opt/ecl/lib//libecl.so.21.2 # 4 0x0000000800663514 in GC_general_register_disappearing_link () from /opt/ecl/lib//libecl.so.21.2 # 5 0x000000080065aee2 in ecl_alloc_weak_pointer () from /opt/ecl/lib//libecl.so.21.2
There's only one assert which turn into abort in GC_register_disappearing_link_inner (this assertion was added in the latest bdwgc release if I remember correctly):
GC_ASSERT(obj != NULL && GC_base_C(obj) == obj);
It looks like o
passed to GC_GENERAL_REGISTER_DISAPPEARING_LINK()
(obj argument) does not point to the beginning of the allocated object. Please check it. If it is intentional then use GC_base(o)
instead.
ECL crashes in GC_is_marked() (bdwgc 8.0.6) with SIGSEGV (caught by ECL) on FreeBSD 13 amd64. # 3
# 4 0x000000080029ad25 in GC_is_marked (p=0x800698820 <cl_symbols+114816>) at mark.c:231 # 5 0x00000008002956e5 in GC_make_disappearing_links_disappear (dl_hashtbl=0x8002afaa0 <GC_dl_hashtbl>, is_remove_dangling=0) at finalize.c:972
GC_is_marked is called from this line: if (EXPECT(GC_is_marked((ptr_t)GC_REVEAL_POINTER(curr_dl->dl_hidden_obj)), TRUE)) {
The SIGSEGV is caused by the fact that GC_REVEAL_POINTER(curr_dl->dl_hidden_obj)
does not point to the beginning of the bdwgc-allocated object (as I wrote above now obj is arg is verified to point to the beginning of an object in the assertion during link registration).
This is the root cause of the issue. It should be fixed on the client side (e.g. in ecl_alloc_weak_pointer
).
Try replacing
GC_GENERAL_REGISTER_DISAPPEARING_LINK((void**)&(obj->value), (void*)o);
to
GC_general_register_disappearing_link((void**)&(obj->value), GC_base((void*)o));
It looks like GC_base((void*)o)
is now NULL.
In the calling frame:
(gdb) print o
$7 = (cl_object) 0x8006afae8 <cl_symbols+75768>
In the frame with GC_general_register_disappearing_link:
#4 0x000000080065f32d in GC_general_register_disappearing_link (link=0x21e4ae8, obj=0x0)
242 return GC_register_disappearing_link_inner(&GC_dl_hashtbl, link, obj,
Which indicates that the second argument has become NULL.
Why is GC_general_register_disappearing_link called with o
not a GC-allocated object?
I'm still investigating, but it looks like because this is the implementation of a weak hash table which points to a general object, which could be allocated on the heap, but sometimes is an object in data.
I wasn't able to find documentation for
GC_register_disappearing_link
besides the page on finalization and the header file. Does the obj argument have to be a heap (and not a data) pointer allocated by GC?
Obj argument should be a pointer to an object allocated by bdwgc, to the beginning of the object.