uftrace icon indicating copy to clipboard operation
uftrace copied to clipboard

SEGFAULT recording uftrace on Asahi Linux (Apple M1 ported)

Open JSYoo5B opened this issue 2 years ago • 6 comments

I reported this issue in #1470, but this seems different issue from postgres. So I'm reporting this separately.

I've tried to run uftrace on Asahi Linux, the Linux ported on Apple Silicon. As the Asahi Linux is the alpha release, it may be unstable. But as I know, main reasons for alpha release are device drivers. (especially graphic processing, they're implementing GUI by framebuffer now)

Tried foobar example in the uftrace.github.io/slide/

void bar() {

}

void foo() {
  bar();
}

int main() {
  foo();
  bar();
}
$ gcc -pg -g -o foobar foobar.c
$ uftrace ./foobar
WARN: child terminated by signal: 11: Segmentation fault
WARN: cannot open record data: /tmp/uftrace-live-nMiZjE: No data available

$ uftrace --no-libcall ./foobar
# DURATION     TID     FUNCTION
            [1069565] | main() {
            [1069565] |   foo() {
   0.084 us [1069565] |     bar();
   0.875 us [1069565] |   } /* foo */
   0.041 us [1069565] |   bar();
   1.250 us [1069565] | } /* main */

Testing environment information

  • Device: Apple MacBook Air M1, 8GB ram, 512GB ssd
  • Linux Distribution: Asahi linux minimal (https://asahilinux.org/2022/03/asahi-linux-alpha-release/)
  • Kernel version(uname -a): Linux JSYoo5B-Sombre 5.19.0-asahi-5-1-ARCH #1 SMP PREEMPT_DYNAMIC Sat, 20 Aug 2022 09:23:11 +0000 aarch64 GNU/Linux
  • Other related version infos:
$ ldd foobar
        linux-vdso.so.1 (0x0000ffff8f0e0000)
        libc.so.6 => /usr/lib/libc.so.6 (0x0000ffff8eed0000)
        /lib/ld-linux-aarch64.so.1 => /usr/lib/ld-linux-aarch64.so.1 (0x0000ffff8f0ac000)

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-unknown-linux-gnu/12.1.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc/configure --enable-languages=c,c++,fortran,go,lto,objc,obj-c++ --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/archlinuxarm/PKGBUILDs/issues --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-multilib --disable-werror --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.1.0 (GCC)

$ /usr/lib/libc.so.6
GNU C Library (GNU libc) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 12.1.0.
libc ABIs: UNIQUE ABSOLUTE
For bug reporting instructions, please see:
<https://github.com/archlinuxarm/PKGBUILDs/issues>.

JSYoo5B avatar Sep 12 '22 09:09 JSYoo5B

By running with gdb, it seems SEGFAULT invoked by LD

(gdb) r foobar
Starting program: /usr/local/bin/uftrace foobar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[Attaching after Thread 0xfffff7fb4140 (LWP 3968) fork to child process 3970]
[New inferior 2 (process 3970)]
[Detaching after fork from parent process 3968]
[Inferior 1 (process 3968) detached]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
process 3970 is executing new program: /home/jsyoo5b/Workspace/uftrace/examples/foobar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Thread 2.1 "foobar" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfffff7f5bea0 (LWP 3970)]
0x0000fffff7fd04d4 in ?? () from /lib/ld-linux-aarch64.so.1
(gdb) bt
#0  0x0000fffff7fd04d4 in ?? () from /lib/ld-linux-aarch64.so.1
#1  0x0000fffff7fd2604 in ?? () from /lib/ld-linux-aarch64.so.1
#2  0x0000aaaaaaaa0770 in _start () at ../sysdeps/aarch64/start.S:81
(gdb)

JSYoo5B avatar Sep 12 '22 09:09 JSYoo5B

Maybe this could be issue with paging size. It uses 16K page due to M1's IOMMU.

Maybe I need to test similar condition with RPi with 16K paging.

Whenever you need additional experiment or research about M1 linux, (without installing it on your devices) feel free to ask me.

JSYoo5B avatar Sep 12 '22 11:09 JSYoo5B

Thanks for doing this analysis. We currently use page size in a mixed way with predefined PAGE_SIZE 4096 and getpagesize() as follows.

$ git grep "#define PAGE_SIZE"
arch/x86_64/mcount-event.c:#define PAGE_SIZE 4096
libmcount/dynamic.c:#define PAGE_SIZE 4096
libmcount/dynamic.h:#define PAGE_SIZE 4096

$ git grep getpagesize
libmcount/mcount.c:     page_size_in_kb = getpagesize() / KB;
libmcount/plthook.c:                    page_size = getpagesize();
uftrace.c:              if (opts->bufsize & (getpagesize() - 1)) {
uftrace.c:                      opts->bufsize = ROUND_UP(opts->bufsize, getpagesize());
uftrace.c:              if (opts->kernel_bufsize & (getpagesize() - 1)) {
uftrace.c:                      opts->kernel_bufsize = ROUND_UP(opts->kernel_bufsize, getpagesize());
utils/kernel.c: fprintf(fp, "PAGE_SIZE: %d\n", getpagesize());
utils/kernel.c: pevent_set_page_size(pevent, getpagesize());

Could you modify the predefined PAGE_SIZE to 16K or use getpagesize() where it's being used and test it again?

honggyukim avatar Sep 12 '22 18:09 honggyukim

I tried by replacing hard-coded 4096 to getpagesize() calls, but it still invokes SEGFAULT.

As the gdb backtrace says, it invokes on _start procedure, as I know, it's early codes to prepare C runtime.

The replacement changes paging size on dynamic tracing. (especially, on mmap()) Though it didn't solve this issue, hard-coded 4096 shall be changed.

Maybe we need to consider which way would be more portable, sysconf(3) or getpagesize(2)

JSYoo5B avatar Sep 14 '22 11:09 JSYoo5B

Any reason do you think it's related to page size? It'd be nice if you can have symbol in the dynamic loader ('/lib/ld-linux-aarch64.so.1').

namhyung avatar Sep 15 '22 06:09 namhyung

Any reason do you think it's related to page size? It'd be nice if you can have symbol in the dynamic loader ('/lib/ld-linux-aarch64.so.1').

Well, page size was known difference and they says some linux software has problems. So I doubt about page size.

I'd like to give information about '/lib/ld-linux-aarch64.so.1', but I'm not sure what & how to report. Can I ask which command to run for reading symbols?

JSYoo5B avatar Sep 16 '22 04:09 JSYoo5B

I guess this is because 16k page optimize or something (I can't find the exact naming). When I debug the error, I get a segfault on ld-linux-aarch64.so.1 with this callstack:

#0  0x0000fffff7fd04d4 in ?? () from /lib/ld-linux-aarch64.so.1
#1  0x0000fffff7fd2604 in ?? () from /lib/ld-linux-aarch64.so.1
#2  0x0000aaaaaaaa0770 in _start () at ../sysdeps/aarch64/start.S:81

And it try to write on 0x00aaaaaaab1000, which is the [email protected]. There is no reason to raise segfault, so I run the vmmap, and see the entire 16k page is readonly.

Start              End                Offset             Perm Path
0x00aaaaaaaa0000 0x00aaaaaaaa4000 0x00000000000000 r-x /home/kyuwoncho18/test
0x00aaaaaaab0000 0x00aaaaaaab4000 0x00000000000000 r-- /home/kyuwoncho18/test
0x00aaaaaaab4000 0x00aaaaaab00000 0x00000000000000 rw- [heap]

I'm not sure it's a common ELF loading mechanism on 16k paging machine, but at least Asahi Linux use only one page for entire global data including .bss, .data, and .got/.plt. So that page should never be marked with readonly I guess.

ChoKyuWon avatar Nov 08 '22 04:11 ChoKyuWon

I test with various linkers like lld, gold and mold with following command:

$ gcc -pg -g ./test.c -fuse-ld=<LINKER>

Then uftrace run without segfault. As I mentioned in PR, it's GNU ld's bug, it does not follow the RELRO rule in ELF. See also bug 29639 and 28824 in binutils bugzila.

For lld, 35.29% of tests success :

detail result
runtime test stats
====================
total  5460  Tests executed (success: 35.29%)
  OK:  1900  Test succeeded
  OK:    27  Test succeeded (with some fixup)
  NG:   232  Different test result
  NZ:   168  Non-zero return value
  SG:    28  Abnormal exit by signal
  TM:    50  Test ran too long
  BI:  2425  Build failed
  LA:     0  Unsupported Language
  SK:   630  Skipped

(I don't know why, but gcc always build failed with lld).

For gold, 74.21% of tests success

detail result
runtime test stats
====================
total  5460  Tests executed (success: 74.21%)
  OK:  3985  Test succeeded
  OK:    67  Test succeeded (with some fixup)
  NG:   495  Different test result
  NZ:    80  Non-zero return value
  SG:    53  Abnormal exit by signal
  TM:     0  Test ran too long
  BI:     0  Build failed
  LA:     0  Unsupported Language
  SK:   780  Skipped

For mold, 70.59% of tests success

detail result
runtime test stats
====================
total  5460  Tests executed (success: 70.59%)
  OK:  3790  Test succeeded
  OK:    64  Test succeeded (with some fixup)
  NG:   515  Different test result
  NZ:   256  Non-zero return value
  SG:    55  Abnormal exit by signal
  TM:     0  Test ran too long
  BI:     0  Build failed
  LA:     0  Unsupported Language
  SK:   780  Skipped

So if you want to use uftrace in Asahi Linux, you'd better use gold as a linker.

ChoKyuWon avatar Nov 13 '22 17:11 ChoKyuWon

Thanks for the test and the info! I'm glad to see ld is fixed.

namhyung avatar Nov 15 '22 01:11 namhyung