dynamorio icon indicating copy to clipboard operation
dynamorio copied to clipboard

CRASH from new glibc 2.35 rseq on any app (-disable_rseq solves)

Open derchr opened this issue 3 years ago • 11 comments
trafficstars

Describe the bug Running DynamoRIO to instrument any application, a crash occurs. ./drrun -- grep (no client at all) results in:

<Starting application /usr/bin/grep (331303)>
<Initial options = -no_dynamic_options -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/home/derek/Git/dynamorio/build/lib64/debug/libdynamorio.so' 0x00007f972794a000
>
<(1+x) Handling our fault in a TRY at 0x00007f9727bbd98c>
<spurious rep/repne prefix @0x00007f97278a9650 (f3 0f 1e fa): >
<Application /usr/bin/grep (331303).  DynamoRIO internal crash at PC 0x00007f9727c26fda.  Please report this at http://dynamorio.org/issues/.  Program aborted.
Received SIGSEGV at pc 0x00007f9727c26fda in thread 331303
Base: 0x00007f97278ff000
Registers:eax=0x00007f94e391c0a0 ebx=0x0000000000050e27 ecx=0x00007f9727c29258 edx=0x00000000000009a0
        esi=0x00007f94e392b948 edi=0x0000000000001003 esp=0x00007f94e392b9d0 ebp=0x00007f94e392b9f0
        r8 =0x0000000000000000 r9 =0x0000000000000000 r10=0x00007f94e392b948 r11=0x0000000000000246
        r12=0x00007f97238c66d0 r13=0x00007f97238c66c7 r14=0x00007f97238c66cf r15=0x00007f97238c66dd
        eflags=0x0000000000010206
version 9.0.19068, custom build
-no_dynamic_options -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
0x00007f94e392b9f0 0x00007f9727c27254
0x00007f94e392ba10 0x00007f972798076b
0x00007f94e392ba70 0x00007f9727981bad
0x00007f94e392bab0 0x00007f9727ab66f9
0x00007f94e392bb10 0x00007f9727ab6bee
0x00007f94e392bb60 0x00007f9727978977
0x00007f94e392bba0 0x00007f9727a986a2
0x00007f94e392bd10 0x00007f9727a9b147
0x00007f94e392bf10 0x00007f97279cff0b
0x00007f94e392bff0 0x00007f96e388bf05
0x00007ffe2cb9c140 0x00007f9727639e0c
0x00007f97238c64e0 0x00007f97238d0f54>

Note that running trivial applications like ls or a simple hello world program does not result in a crash. Small list of applications that also do not work: vim, vi (it crashes at the moment you type anything), less, more.

Small list of applications that do work: ls, uname, cat.

To Reproduce Steps to reproduce the behavior:

  1. Pointer to a minimized application (ideally the source code for it and instructions on which toolchain it was built with). grep that comes with every Linux distribution.
  2. Precise command line for running the application. ./drrun -- grep
  3. Exact output or incorrect behavior. See above

I can reproduce on a fresh Arch Linux environment.

  • What happens when you run without any client? Same result with or without client: crash
  • What happens when you run with debug build ("-debug" flag to drrun/drconfig/drinject)? Same result

Expected behavior No crash, correct instrumentation.

Versions

  • What version of DynamoRIO are you using? Tested the 9.0.1 release and also a fresh build on master.

  • Does the latest build from https://github.com/DynamoRIO/dynamorio/releases solve the problem? No

  • What operating system version are you running on? Manjaro Linux (derivative of Arch Linux)

  • Is your application 32-bit or 64-bit? 64 bit

Operating System: Manjaro Linux KDE Plasma Version: 5.24.3 KDE Frameworks Version: 5.91.0 Qt Version: 5.15.3 Kernel Version: 5.16.14-1-MANJARO (64-bit) Graphics Platform: X11 Processors: 8 × Intel® Core™ i5-8250U CPU @ 1.60GHz Memory: 7.6 GiB of RAM Graphics Processor: Mesa Intel® UHD Graphics 620

Additional context This is the same bug as described in https://groups.google.com/g/dynamorio-users/c/eq5zD824QwY The problem might be related to rseq.

Also, one observation I made is that could be related to the recent update of Arch Linux to glibc version 2.35. For a small test I downgraded to 2.33 and the crash did not occur. However, this is not a solution as it breaks almost all applications that need the new version to run.

Running drrun with -disable_rseq also fixes the problem. However, with this flag the instrumentation is dead slow to say the least.

Logs and backtrace of the crash: log.0.32805.txt grep.0.32805.txt 'bt' and 'bt full'.txt

derchr avatar Mar 25 '22 13:03 derchr

Inlining stacktrace for searching:

(gdb) bt
#0  0x00007ffff7e8cdc8 in rseq_clear_tls_ptr (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/unix/rseq_linux.c:221
#1  0x00007ffff7e8d042 in rseq_shared_fragment_flushtime_update (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/unix/rseq_linux.c:280
#2  0x00007ffff7be676b in check_flush_queue (dcontext=0x7ffdb3b65080, was_I_flushed=0x0)
    at /home/derek/Git/dynamorio/core/fragment.c:5436
#3  0x00007ffff7be7bad in enter_couldbelinking (dcontext=0x7ffdb3b65080, was_I_flushed=0x0,
    cache_transition=false) at /home/derek/Git/dynamorio/core/fragment.c:5784
#4  0x00007ffff7d1c605 in move_lazy_list_to_pending_delete (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/vmareas.c:9226
#5  0x00007ffff7d1cafa in add_to_lazy_deletion_list (dcontext=0x7ffdb3b65080, f=0x7ffdb3c3cb80)
    at /home/derek/Git/dynamorio/core/vmareas.c:9313
#6  0x00007ffff7bde977 in fragment_remove_shared_no_flush (dcontext=0x7ffdb3b65080, f=0x7ffdb3c3cb80)
    at /home/derek/Git/dynamorio/core/fragment.c:3222
#7  0x00007ffff7cfe5ae in end_and_emit_trace (dcontext=0x7ffdb3b65080, cur_f=0x7ffdb3c40408)
    at /home/derek/Git/dynamorio/core/monitor.c:1520
#8  0x00007ffff7d01053 in monitor_cache_enter (dcontext=0x7ffdb3b65080, f=0x7ffdb3c40408)
    at /home/derek/Git/dynamorio/core/monitor.c:2030
#9  0x00007ffff7c35f0b in d_r_dispatch (dcontext=0x7ffdb3b65080)
    at /home/derek/Git/dynamorio/core/dispatch.c:193

derekbruening avatar Mar 25 '22 14:03 derekbruening

Running drrun with -disable_rseq also fixes the problem. However, with this flag the instrumentation is dead slow to say the least.

Could you clarify: which instrumentation is slow? This should not affect tool-added instrumentation: it affects the application's code paths, disabling things like per-cpu caches, which can affect performance by forcing per-thread caches: but we're talking small percentages on large many-threaded applications. I would expect zero observable impact on the apps listed above. Could you show precise command lines and times?

derekbruening avatar Mar 25 '22 14:03 derekbruening

Could you clarify: which instrumentation is slow?

I'm currently testing ./drrun without any tool / client active.

I must correct myself, the slowdown was because I was still running a debug build of DynamoRIO. Now with the release build, I don't experience much difference in performance between my host machine running ./drrun -disable_rseq -- vim and my docker container (Debian) running ./drrun -- vim. With -disable_rseq I'm now even able to run graphical applications without significant slowdown.

derchr avatar Mar 25 '22 16:03 derchr

With -disable_rseq I'm now even able to run graphical applications without significant slowdown.

I should have suggested -disable_rseq earlier: glad you found it. So there is a workaround. I would say that this gives pretty high confidence this is an rseq issue with the new glibc => updating the title.

derekbruening avatar Mar 25 '22 16:03 derekbruening

I can reproduce this on Ubuntu 22.04 on some apps, like grep as reported, but no on all apps: ls and hello,world work fine. Will investigate further.

derekbruening avatar Mar 27 '22 23:03 derekbruening

Here's the stack trace for the crash:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7e89909 in rseq_clear_tls_ptr (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/unix/rseq_linux.c:221
221         if (is_dynamo_address((byte *)(ptr_uint_t)app_rseq->rseq_cs.ptr64))
(gdb) bt
#0  0x00007ffff7e89909 in rseq_clear_tls_ptr (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/unix/rseq_linux.c:221
#1  0x00007ffff7e89b97 in rseq_shared_fragment_flushtime_update (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/unix/rseq_linux.c:280
#2  0x00007ffff7bdfb7a in check_flush_queue (dcontext=0x7ffdb3b86080, was_I_flushed=0x0) at /home/abhinav92003/dynamorio-master/core/fragment.c:5436
#3  0x00007ffff7be0fe4 in enter_couldbelinking (dcontext=0x7ffdb3b86080, was_I_flushed=0x0, cache_transition=false) at /home/abhinav92003/dynamorio-master/core/fragment.c:5784
#4  0x00007ffff7d1671a in move_lazy_list_to_pending_delete (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/vmareas.c:9229
#5  0x00007ffff7d16c13 in add_to_lazy_deletion_list (dcontext=0x7ffdb3b86080, f=0x7ffdb3c1f928) at /home/abhinav92003/dynamorio-master/core/vmareas.c:9316
#6  0x00007ffff7bd7d02 in fragment_remove_shared_no_flush (dcontext=0x7ffdb3b86080, f=0x7ffdb3c1f928) at /home/abhinav92003/dynamorio-master/core/fragment.c:3222
#7  0x00007ffff7cf842c in end_and_emit_trace (dcontext=0x7ffdb3b86080, cur_f=0x7ffdb3c20798) at /home/abhinav92003/dynamorio-master/core/monitor.c:1520
#8  0x00007ffff7cfaee5 in monitor_cache_enter (dcontext=0x7ffdb3b86080, f=0x7ffdb3c20798) at /home/abhinav92003/dynamorio-master/core/monitor.c:2030
#9  0x00007ffff7c2f6fe in d_r_dispatch (dcontext=0x7ffdb3b86080) at /home/abhinav92003/dynamorio-master/core/dispatch.c:193
#10 0x00007fffb3afef05 in ?? ()
#11 0x0000000000000000 in ?? ()

app_rseq is not readable, which causes the crash.

217         struct rseq *app_rseq = (struct rseq *)(base + rseq_tls_offset);
...
(gdb) p base
$1 = (byte *) 0x7ffdb3b9c700 ""
(gdb) p app_rseq
$2 = (struct rseq *) 0x7ffdb3b9d0a0
(gdb) memquery app_rseq
7ffdb3b9d000-7ffdb3b9f000 ---p 00000000 00:00 0
(gdb) memquery base
7ffdb3b9b000-7ffdb3b9d000 rw-p 00000000 00:00 0

I think there might be an issue in our heuristics for locating the thread's struct rseq (rseq_tls_offset), in rseq_locate_tls_offset.

abhinav92003 avatar Apr 25 '22 15:04 abhinav92003

I see that we set rseq_tls_offset at two places:

In rseq_locate_rseq_regions during init/takeover: https://github.com/DynamoRIO/dynamorio/blob/c233329152742864f3b1b72e4faf754fd2b1c65c/core/unix/rseq_linux.c#L732

In rseq_process_syscall post an rseq syscall: https://github.com/DynamoRIO/dynamorio/blob/c233329152742864f3b1b72e4faf754fd2b1c65c/core/unix/rseq_linux.c#L675

Also, we expect both of them to give the same value, and if they don't we throw this error: https://github.com/DynamoRIO/dynamorio/blob/c233329152742864f3b1b72e4faf754fd2b1c65c/core/unix/rseq_linux.c#L687

I found the following glibc thread that discusses "put the rseq area into struct pthread, not into a initial-exec TLS" https://sourceware.org/pipermail/libc-alpha/2021-November/133221.html. I wonder if this is causing some assumption to break on our side.

abhinav92003 avatar Apr 25 '22 19:04 abhinav92003

Are they at least following the convention of having __rseq_cs or __rseq_table sections in the ELF file (https://dynamorio.org/page_rseq.html#autotoc_md251)?

Static TLS is assumed for marking our own region restartable (https://dynamorio.org/page_rseq.html#autotoc_md267) and that assumption is supposedly checked (I think the error you point to above) -- why didn't the error check fire?

derekbruening avatar Apr 25 '22 21:04 derekbruening

rseq_process_syscall and rseq_locate_tls_offset compute rseq_tls_offset relative to get_app_segment_base, however rseq_clear_tls_ptr adds it to get_segment_base, which is wrong as we need the app's seg base. It works when I change it to get_app_segment_base.

Are they at least following the convention of having __rseq_cs or __rseq_table sections in the ELF file (https://dynamorio.org/page_rseq.html#autotoc_md251)?

Probably not. linux.rseq is still failing because rseq_process_elf_sections reports "__rseq_cs_ptr_array's entries are not in a loaded segment"

I'm also seeing linux.rseq_table and linux.rseq_noarray fail with "Rseq signature is unreadable".

abhinav92003 avatar Apr 26 '22 20:04 abhinav92003

That is not good: if the conventions are not followed we might have no hope of handling it at all.... we would need to push for changes to glibc. Wondering whether Mathieu Desnoyers or others who set up the conventions were involved in any of the glibc decisions.

derekbruening avatar Apr 26 '22 21:04 derekbruening

Re-examining this: the original conventions may have only said the data had to be in the ELF file in that section, not inside a loadable segment. Is there such a section here in the file, and the problem is just that the section is not in a loaded segment?

derekbruening avatar May 25 '22 18:05 derekbruening

A few observations:

  • Another, probably cleaner workaround than -disable_rseq is:
GLIBC_TUNABLES=glibc.pthread.rseq=0
export GLIBC_TUNABLES

This prevents glibc from registering its own rseq_cs; essentially disables the glibc rseq support. The rseq and burst tests work fine with this.

Looking at the suite/tests/bin/linux.rseq ELF file on an older glibc (2.31) vs newer glibc (2.35), I can see that the relevant sections are still there.

Old:

$ objdump -s -j __rseq_cs_ptr_array suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs_ptr_array:
 50a0 20500000 00000000 40500000 00000000   P......@P......
 50b0 60500000 00000000 80500000 00000000  `P.......P......

$ objdump -s -j __rseq_cs suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs:
 5020 00000000 00000000 9a170000 00000000  ................
 5030 34000000 00000000 d3170000 00000000  4...............
 5040 00000000 00000000 2f190000 00000000  ......../.......
 5050 37000000 00000000 6c190000 00000000  7.......l.......
 5060 00000000 00000000 8f1a0000 00000000  ................
 5070 12000000 00000000 a71a0000 00000000  ................
 5080 00000000 00000000 181b0000 00000000  ................
 5090 48000000 00000000 661b0000 00000000  H.......f.......

New:

$ objdump -s -j __rseq_cs_ptr_array suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs_ptr_array:
 5140 c0500000 00000000 e0500000 00000000  .P.......P......
 5150 00510000 00000000 20510000 00000000  .Q...... Q......

$ objdump -s -j __rseq_cs suite/tests/bin/linux.rseq

suite/tests/bin/linux.rseq:     file format elf64-x86-64

Contents of section __rseq_cs:
 50c0 00000000 00000000 6d170000 00000000  ........m.......
 50d0 34000000 00000000 a6170000 00000000  4...............
 50e0 00000000 00000000 14190000 00000000  ................
 50f0 37000000 00000000 51190000 00000000  7.......Q.......
 5100 00000000 00000000 871a0000 00000000  ................
 5110 12000000 00000000 9f1a0000 00000000  ................
 5120 00000000 00000000 151b0000 00000000  ................
 5130 48000000 00000000 631b0000 00000000  H.......c.......

  • For the "__rseq_cs_ptr_array's entries are not in a loaded segment" error: I was able to get the rseq_cs by hacking around in rseq_process_elf_sections, specifically by forcing this statement to execute always: https://github.com/DynamoRIO/dynamorio/blob/246ddb28e7848b2d09d2b9909f99a6da9b2ce35e/core/unix/rseq_linux.c#L472

  • The final issue I'm running into is I think because the struct rseq is not in the static TLS anymore. (It causes an EINVAL in rseq tests and "struct rseq is not in static thread-local storage" in burst tests). This is as documented at https://sourceware.org/pipermail/libc-alpha/2021-November/133221.html. Also, https://lwn.net/Articles/883104/ says that the rseq registered by glibc is stored in the Thread Control Block maintained by glibc. I'm working on modifying rseq_locate_tls_offset for this.

To support older glibcs, I think some of this fix needs to be guarded based on the value set for glibc.pthread.rseq.

abhinav92003 avatar Oct 25 '22 21:10 abhinav92003

Update: I have a local fix where I detect whether glibc's rseq support is enabled, and if it is, adjust some logic of locating the struct rseq and processing modules. I'll try pushing it for review today.

abhinav92003 avatar Oct 28 '22 14:10 abhinav92003