uftrace
uftrace copied to clipboard
SEGFAULT recording uftrace on Asahi Linux (Apple M1 ported)
I reported this issue in #1470, but this seems different issue from postgres. So I'm reporting this separately.
I've tried to run uftrace on Asahi Linux, the Linux ported on Apple Silicon. As the Asahi Linux is the alpha release, it may be unstable. But as I know, main reasons for alpha release are device drivers. (especially graphic processing, they're implementing GUI by framebuffer now)
Tried foobar example in the uftrace.github.io/slide/
void bar() {
}
void foo() {
bar();
}
int main() {
foo();
bar();
}
$ gcc -pg -g -o foobar foobar.c
$ uftrace ./foobar
WARN: child terminated by signal: 11: Segmentation fault
WARN: cannot open record data: /tmp/uftrace-live-nMiZjE: No data available
$ uftrace --no-libcall ./foobar
# DURATION TID FUNCTION
[1069565] | main() {
[1069565] | foo() {
0.084 us [1069565] | bar();
0.875 us [1069565] | } /* foo */
0.041 us [1069565] | bar();
1.250 us [1069565] | } /* main */
Testing environment information
- Device: Apple MacBook Air M1, 8GB ram, 512GB ssd
- Linux Distribution: Asahi linux minimal (https://asahilinux.org/2022/03/asahi-linux-alpha-release/)
- Kernel version(uname -a):
Linux JSYoo5B-Sombre 5.19.0-asahi-5-1-ARCH #1 SMP PREEMPT_DYNAMIC Sat, 20 Aug 2022 09:23:11 +0000 aarch64 GNU/Linux
- Other related version infos:
$ ldd foobar
linux-vdso.so.1 (0x0000ffff8f0e0000)
libc.so.6 => /usr/lib/libc.so.6 (0x0000ffff8eed0000)
/lib/ld-linux-aarch64.so.1 => /usr/lib/ld-linux-aarch64.so.1 (0x0000ffff8f0ac000)
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-unknown-linux-gnu/12.1.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc/configure --enable-languages=c,c++,fortran,go,lto,objc,obj-c++ --enable-bootstrap --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/archlinuxarm/PKGBUILDs/issues --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-multilib --disable-werror --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.1.0 (GCC)
$ /usr/lib/libc.so.6
GNU C Library (GNU libc) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 12.1.0.
libc ABIs: UNIQUE ABSOLUTE
For bug reporting instructions, please see:
<https://github.com/archlinuxarm/PKGBUILDs/issues>.
By running with gdb, it seems SEGFAULT invoked by LD
(gdb) r foobar
Starting program: /usr/local/bin/uftrace foobar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[Attaching after Thread 0xfffff7fb4140 (LWP 3968) fork to child process 3970]
[New inferior 2 (process 3970)]
[Detaching after fork from parent process 3968]
[Inferior 1 (process 3968) detached]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
process 3970 is executing new program: /home/jsyoo5b/Workspace/uftrace/examples/foobar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Thread 2.1 "foobar" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfffff7f5bea0 (LWP 3970)]
0x0000fffff7fd04d4 in ?? () from /lib/ld-linux-aarch64.so.1
(gdb) bt
#0 0x0000fffff7fd04d4 in ?? () from /lib/ld-linux-aarch64.so.1
#1 0x0000fffff7fd2604 in ?? () from /lib/ld-linux-aarch64.so.1
#2 0x0000aaaaaaaa0770 in _start () at ../sysdeps/aarch64/start.S:81
(gdb)
Maybe this could be issue with paging size. It uses 16K page due to M1's IOMMU.
Maybe I need to test similar condition with RPi with 16K paging.
Whenever you need additional experiment or research about M1 linux, (without installing it on your devices) feel free to ask me.
Thanks for doing this analysis. We currently use page size in a mixed way with predefined PAGE_SIZE 4096
and getpagesize()
as follows.
$ git grep "#define PAGE_SIZE"
arch/x86_64/mcount-event.c:#define PAGE_SIZE 4096
libmcount/dynamic.c:#define PAGE_SIZE 4096
libmcount/dynamic.h:#define PAGE_SIZE 4096
$ git grep getpagesize
libmcount/mcount.c: page_size_in_kb = getpagesize() / KB;
libmcount/plthook.c: page_size = getpagesize();
uftrace.c: if (opts->bufsize & (getpagesize() - 1)) {
uftrace.c: opts->bufsize = ROUND_UP(opts->bufsize, getpagesize());
uftrace.c: if (opts->kernel_bufsize & (getpagesize() - 1)) {
uftrace.c: opts->kernel_bufsize = ROUND_UP(opts->kernel_bufsize, getpagesize());
utils/kernel.c: fprintf(fp, "PAGE_SIZE: %d\n", getpagesize());
utils/kernel.c: pevent_set_page_size(pevent, getpagesize());
Could you modify the predefined PAGE_SIZE
to 16K or use getpagesize()
where it's being used and test it again?
I tried by replacing hard-coded 4096
to getpagesize()
calls, but it still invokes SEGFAULT.
As the gdb backtrace says, it invokes on _start
procedure, as I know, it's early codes to prepare C runtime.
The replacement changes paging size on dynamic tracing. (especially, on mmap()
)
Though it didn't solve this issue, hard-coded 4096
shall be changed.
Maybe we need to consider which way would be more portable, sysconf(3)
or getpagesize(2)
Any reason do you think it's related to page size? It'd be nice if you can have symbol in the dynamic loader ('/lib/ld-linux-aarch64.so.1').
Any reason do you think it's related to page size? It'd be nice if you can have symbol in the dynamic loader ('/lib/ld-linux-aarch64.so.1').
Well, page size was known difference and they says some linux software has problems. So I doubt about page size.
I'd like to give information about '/lib/ld-linux-aarch64.so.1', but I'm not sure what & how to report. Can I ask which command to run for reading symbols?
I guess this is because 16k page optimize or something (I can't find the exact naming).
When I debug the error, I get a segfault on ld-linux-aarch64.so.1
with this callstack:
#0 0x0000fffff7fd04d4 in ?? () from /lib/ld-linux-aarch64.so.1
#1 0x0000fffff7fd2604 in ?? () from /lib/ld-linux-aarch64.so.1
#2 0x0000aaaaaaaa0770 in _start () at ../sysdeps/aarch64/start.S:81
And it try to write on 0x00aaaaaaab1000
, which is the [email protected]
. There is no reason to raise segfault, so I run the vmmap, and see the entire 16k page is readonly.
Start End Offset Perm Path
0x00aaaaaaaa0000 0x00aaaaaaaa4000 0x00000000000000 r-x /home/kyuwoncho18/test
0x00aaaaaaab0000 0x00aaaaaaab4000 0x00000000000000 r-- /home/kyuwoncho18/test
0x00aaaaaaab4000 0x00aaaaaab00000 0x00000000000000 rw- [heap]
I'm not sure it's a common ELF loading mechanism on 16k paging machine, but at least Asahi Linux use only one page for entire global data including .bss, .data, and .got/.plt. So that page should never be marked with readonly I guess.
I test with various linkers like lld
, gold
and mold
with following command:
$ gcc -pg -g ./test.c -fuse-ld=<LINKER>
Then uftrace run without segfault.
As I mentioned in PR, it's GNU ld
's bug, it does not follow the RELRO rule in ELF.
See also bug 29639 and 28824 in binutils bugzila.
For lld
, 35.29% of tests success :
detail result
runtime test stats
====================
total 5460 Tests executed (success: 35.29%)
OK: 1900 Test succeeded
OK: 27 Test succeeded (with some fixup)
NG: 232 Different test result
NZ: 168 Non-zero return value
SG: 28 Abnormal exit by signal
TM: 50 Test ran too long
BI: 2425 Build failed
LA: 0 Unsupported Language
SK: 630 Skipped
(I don't know why, but gcc
always build failed with lld
).
For gold
, 74.21% of tests success
detail result
runtime test stats
====================
total 5460 Tests executed (success: 74.21%)
OK: 3985 Test succeeded
OK: 67 Test succeeded (with some fixup)
NG: 495 Different test result
NZ: 80 Non-zero return value
SG: 53 Abnormal exit by signal
TM: 0 Test ran too long
BI: 0 Build failed
LA: 0 Unsupported Language
SK: 780 Skipped
For mold
, 70.59% of tests success
detail result
runtime test stats
====================
total 5460 Tests executed (success: 70.59%)
OK: 3790 Test succeeded
OK: 64 Test succeeded (with some fixup)
NG: 515 Different test result
NZ: 256 Non-zero return value
SG: 55 Abnormal exit by signal
TM: 0 Test ran too long
BI: 0 Build failed
LA: 0 Unsupported Language
SK: 780 Skipped
So if you want to use uftrace in Asahi Linux, you'd better use gold
as a linker.
Thanks for the test and the info! I'm glad to see ld
is fixed.