IncludeOS icon indicating copy to clipboard operation
IncludeOS copied to clipboard

Early boot memory corruption sometimes causes chain crashes

Open alfreb opened this issue 1 year ago • 2 comments

The best repro case was found with https://github.com/includeos/IncludeOS/pull/2251, preserved until fixed in https://github.com/alfreb/IncludeOS/tree/memory-ghost-repro . On that branch, starting at commit e81fb7c7da96b8cae8b43d406b6d868b7d09b66e reproduce with

nix-shell --argstr unikernel ./test/net/integration/tcp/ --run "./test.py"

( Requires https://github.com/includeos/vmrunner )

Backtrace was fetched from gdb after building musl with debug symbols, and seeing the same issue:

#0  0x0000000000329bc2 in a_crash ()
#1  0x000000000032895e in enframe ()
#2  0x0000000000329840 in alloc_group ()
#3  0x0000000000328853 in alloc_slot ()
#4  0x00000000003297df in alloc_group ()
#5  0x0000000000328853 in alloc_slot ()
#6  0x00000000003297df in alloc_group ()
#7  0x0000000000328853 in alloc_slot ()
#8  0x00000000003285eb in __libc_malloc_impl ()
#9  0x00000000003267a5 in malloc ()
#10 0x000000000023f36b in strdup ()
#11 0x0000000000246f1d in x86::init_libc (magic=<optimized out>, addr=<optimized out>) at /build/source/src/platform/x86_pc/init_libc.cpp:107
#12 0x000000000024769a in long_mode ()
#13 0x0000000000000000 in ?? ()

The call to strdup in init_libc causes a crash in libc during malloc. Our heap should be ready at that time, since this is after init_heap.

Possible culprit:

  • enframe asserts: https://git.musl-libc.org/cgit/musl/tree/src/malloc/mallocng/meta.h?h=v1.2.4#n205
    • assert calls abort https://git.musl-libc.org/cgit/musl/tree/src/exit/assert.c , although after fprintf. This fprintf must have been lost in that case (possibly because a system calls to validate file descriptors failed) since there's no output.
    • abort calls a_crash https://git.musl-libc.org/cgit/musl/tree/src/exit/abort.c?h=v1.2.5#n27, after some system calls.
  • alloc_group calls enframe: https://git.musl-libc.org/cgit/musl/tree/src/malloc/mallocng/malloc.c#n267
  • alloc_group entry: https://git.musl-libc.org/cgit/musl/tree/src/malloc/mallocng/malloc.c#n174

Note that I think this bug is also present on master, possibly the main reason for master not booting at the moment.

Things I've tried

  • Remove the calls to strdup. This causes another chain crash a bit later, this time without halting, so in that case it's not libc emitting the crash.

alfreb avatar Jun 19 '24 08:06 alfreb

Some additional references:

  • Call to strdup from init_libc: https://github.com/includeos/IncludeOS/blob/v0.16.0-release/src/platform/x86_pc/init_libc.cpp#L106
  • strdup implementation: https://github.com/includeos/IncludeOS/blob/v0.16.0-release/src/crt/string.c#L23

MagnusS avatar Aug 25 '24 10:08 MagnusS

This may be resolved with #2273

MagnusS avatar Sep 04 '24 11:09 MagnusS