server icon indicating copy to clipboard operation
server copied to clipboard

MDEV-31151: Fix a stack overflow in pinbox allocator

Open HugoWenTD opened this issue 1 year ago • 14 comments

Description

MariaDB supports a "wait-free concurrent allocator based on pinning addresses". In lf_pinbox_real_free() it tries to sort the pinned addresses for better performance to use binary search during "real free". alloca() was used to allocate stack memory and copy addresses.

To prevent a stack overflow when allocating the stack memory the function checks if there's enough stack space. However, the available stack size was calculated inaccurately which eventually caused database crash due to stack overflow.

The crash was seen on MariaDB 10.6.11 but the same code defect exists on all MariaDB versions.

Crash stack trace example:

#0  msort_with_tmp (p=0x40333f0790a0, b=0x40333f0790d0, n=3) at msort.c:40
#1  0x000040000456c86c in msort_with_tmp (n=3, b=0x40333f0790d0, p=0x40333f079100) at msort.c:45
#2  __GI___qsort_r (b=b@entry=0x40333f0790d0, n=n@entry=3, s=s@entry=8, cmp=cmp@entry=0xaaaac50af0e0 <ptr_cmp>, arg=arg@entry=0x0) at msort.c:297
#3  0x000040000456c968 in __GI_qsort (b=b@entry=0x40333f0790d0, n=n@entry=3, s=s@entry=8, cmp=cmp@entry=0xaaaac50af0e0 <ptr_cmp>) at msort.c:308
#4  0x0000aaaac50af498 in lf_pinbox_real_free (pins=0x4032bb19b610) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/mysys/lf_alloc-pin.c:358
#5  0x0000aaaac50af74c in lf_pinbox_free (pins=pins@entry=0x4032bb19b610, addr=<optimized out>) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/mysys/lf_alloc-pin.c:271
#6  0x0000aaaac50b0fc4 in l_delete (pins=0x4032bb19b610, keylen=<optimized out>, key=0x4033257b3b40 "\002dlrtms_lt", hashnr=4146604099, cs=0xaaaac5aeb5e0 <my_charset_bin>, head=0x400014a82098) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/mysys/lf_hash.cc:258
#7  lf_hash_delete (hash=0xaaaac5b74500 <mdl_locks>, pins=0x4032bb19b610, key=0x4033257b3b40, keylen=<optimized out>) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/mysys/lf_hash.cc:467
#8  0x0000aaaac48577c4 in MDL_context::release_lock (this=<optimized out>, duration=<optimized out>, ticket=0x403289fb0620) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/mdl.cc:2886
#9  0x0000aaaac4857848 in MDL_context::release_locks_stored_before (this=this@entry=0x40332803d370, duration=duration@entry=MDL_TRANSACTION, sentinel=sentinel@entry=0x0) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/mdl.cc:2936
#10 0x0000aaaac4857d24 in MDL_context::release_transactional_locks (this=this@entry=0x40332803d370, thd=thd@entry=0x40332803d218) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/mdl.cc:3122
#11 0x0000aaaac476fa00 in THD::release_transactional_locks (this=<optimized out>) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/sql_class.h:5071
#12 mysql_execute_command (thd=thd@entry=0x40332803d218, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/sql_parse.cc:6150
#13 0x0000aaaac4774f0c in mysql_parse (thd=0x40332803d218, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/sql_parse.cc:8123
#14 0x0000aaaac476d050 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x40332803d218, packet=packet@entry=0x403328054359 "", packet_length=packet_length@entry=2581, blocking=125, blocking@entry=true)
    at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/sql_parse.cc:1907
#15 0x0000aaaac476c244 in do_command (thd=0x40332803d218, blocking=blocking@entry=true) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/sql_parse.cc:1417
#16 0x0000aaaac484eb44 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x40330f279dd8, put_in_cache=put_in_cache@entry=true) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/sql_connect.cc:1459
#17 0x0000aaaac484eec8 in handle_one_connection (arg=arg@entry=0x40330f279dd8) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/sql/sql_connect.cc:1361
#18 0x0000aaaac4dad8f0 in pfs_spawn_thread (arg=0x40330f367018) at /local/p4clients/pkgbuild-QlPbg/workspace/src/MariaDB/storage/perfschema/pfs.cc:2201
#19 0x000040000450a22c in start_thread (arg=0x400004533000) at pthread_create.c:465
#20 0x000040000460ca1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:80

Why was the available stack size calculation inaccurate?

  • MariaDB defines the thd->thread_stack to be used as a stack thread pointer. However, that's not the accurate value of the stack starting address instead it's the address of the thd object. thd->thread_stack= (char*) &thd;

  • That inaccurate value is then used to calculate the variable of stack_ends_here which is eventually used in function lf_pinbox_real_free() to get the available stack size.

  • The end result is inaccurate and could cause stack overflow in the qsort() function, especially when there are high connection numbers. e.g. 8000-12000 concurrent connections with MariaDB system variable thread_stack=262144. https://mariadb.com/kb/en/server-system-variables/#thread_stack

  • Additionally it's easier to be reproduced on ARM instances, it's likely related to there are more registers to save/restore on ARM64 and the structure alignment on ARM tends to have stronger alignment requirements. Those differences cause slightly different stack frame sizes, and with the higher performance of AWS r6g.8xlarge instances, it may cause higher number of pinned addresses in the pinbox.

A similar issue happened previously and the fix in fc2c1e43 was to add a ALLOCA_SAFETY_MARGIN which is 8192 bytes. However, that safety margin is not enough during high connection workloads.

MySQL also had a similar issue and the fix https://github.com/mysql/mysql-server/commit/b086fda was to remove the use of alloca and replace qsort approach by a linear scan through all pointers (pins) owned by each thread.

This commit is mostly the same as it is the only way to solve this issue as:

  1. Frame sizes in different architecture can be different.
  2. Number of active (non-null) pinned addresses varies, so the frame size for the recursive sorting function msort_with_tmp is also hard to predict. Note the number of active pinned addresses usually is very small, with 8000 active connections on an AWS r6g.8xlarge instance it's about less than 10. But we don't know if there's edge case that the value could be bigger.
  3. Allocating big memory blocks in stack doesn't seem to be a very good practice.

For further details see the mentioned commit in MySQL and the inline comments.

In addition, thd->thread_stack and stack_ends_here should not be used to calculate if we need an accurate value related to the stack size. However stack_ends_here is already used in many places so I would not deprecate or change it in this commit.

Mini-benchmarking test had been done with script https://github.com/MariaDB/server/blob/10.11/support-files/mini-benchmark.sh and no performance regression was seen.

How can this PR be tested?

MTR and mini-benchmark tests had been done to verify there's no regression.

Basing the PR against the correct MariaDB version

  • [x] This is a bug fix and the PR is based against the earliest branch in which the bug can be reproduced

Copyright

All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.

HugoWenTD avatar Mar 11 '23 00:03 HugoWenTD

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Mar 11 '23 00:03 CLAassistant

@LinuxJedi @grooverdan This fixes a rather serious issue. There is no testable functional or performance regression, and a similar fix has been in MySQL already for a long time. Could you please review it? Thanks!

ottok avatar Mar 17 '23 20:03 ottok

Could you elaborate on why the stack size calculation is inaccurate? In particular, why thd->thread_stack is wrong?

vuvova avatar Mar 17 '23 21:03 vuvova

@vuvova Thank you for checking the PR.

Could you elaborate on why the stack size calculation is inaccurate? In particular, why thd->thread_stack is wrong?

Sure. Let me know if I missed anything but from my understanding, as i pointed in the PR description,

MariaDB defines the thd->thread_stack to be used as a stack thread pointer. However, that's not the accurate value of the stack starting address instead it's the address of the thd object. thd->thread_stack= (char*) &thd;

The address of thd object is theoretically close to the start of the stack but it's not accurate.

The corresponding code is https://github.com/MariaDB/server/blob/mariadb-10.6.11/sql/sql_connect.cc#L1390-L1398.

The thd object is defined here https://github.com/MariaDB/server/blob/mariadb-10.6.11/sql/sql_connect.cc#L1361 . But it is not the beginning of the stack. Also consider there are several other frames before that (see stack trace).


Here's related code to allocate the stack size during new connection:

  • Set stack size attribute - https://github.com/MariaDB/server/blob/mariadb-10.6.11/sql/mysqld.cc#L5667-L5677
  • Start a thread - https://github.com/MariaDB/server/blob/mariadb-10.6.11/sql/mysqld.cc#L5992-L5994

You can simply verify this inaccurate value with following steps:

  • Start the server, print the process maps.

    [23:16:21][root][~]$ cat /proc/26150/maps > maps-before-connection
    
  • Start a connection, keep it running.

    [23:16:38][root][~]$ /mysql/bin/mysql -uuser -p -e "select sleep(10000)" &
    [2] 26302
    
  • Print the process maps again.

    [23:16:41][root][~]$ cat /proc/26150/maps > maps-after-connection
    
  • Compare the maps you'll see the new stack created. In this case it's 40002bd8a000-40002bdcb000

    [23:16:51][root][~]$ diff maps-before-connection maps-after-connection 
    49a50,51
    > 40002bd89000-40002bd8a000 ---p 00000000 00:00 0 
    > 40002bd8a000-40002bdcb000 rw-p 00000000 00:00 0 
    
  • Use gdb to attach to the server

  • Go to the corresponding thread.

  • Go to the corresponding frame.

  • Print the address of &thd and thd->thread_stack. You will see they are the same address which is a random address 0x40002bdc80b8 between 40002bd8a000-40002bdcb000. But you can not tell the accurate stack start or end address using 0x40002bdc80b8 or with the thread_stack system variable. In my case the server started with option value --thread_stack='262144'.

(gdb) bt
#0  0x000040002adef180 in pthread_cond_timedwait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x0000aaaad05a0eec in inline_mysql_cond_timedwait (src_file=0xaaaad0f12e28 "/local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/item_func.cc", src_line=4044, abstime=0x40002bdc6340, mutex=0xaaaad1767b18 <LOCK_item_func_sleep>, that=0x40002bdc63d0)
    at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/include/mysql/psi/mysql_thread.h:1088
#2  Interruptible_wait::wait (this=this@entry=0x40002bdc63b8, cond=cond@entry=0x40002bdc63d0, mutex=mutex@entry=0xaaaad1767b18 <LOCK_item_func_sleep>) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/item_func.cc:4044
#3  0x0000aaaad05a1188 in Item_func_sleep::val_int (this=<optimized out>) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/item_func.cc:4610
#4  0x0000aaaad049cfd0 in Type_handler::Item_send_long (this=<optimized out>, item=0x40003127f4a0, protocol=0x400031255170, buf=<optimized out>) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_type.cc:7490
#5  0x0000aaaad02a1d74 in Protocol::send_result_set_row (this=this@entry=0x400031255170, row_items=row_items@entry=0x40003127f1e8) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/protocol.cc:1328
#6  0x0000aaaad030883c in select_send::send_data (this=0x40003127fe38, items=...) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_class.cc:3123
#7  0x0000aaaad039c034 in select_result_sink::send_data_with_check (sent=0, u=<optimized out>, items=..., this=<optimized out>) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_class.h:5745
#8  JOIN::exec_inner (this=this@entry=0x40003127fe60) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_select.cc:4671
#9  0x0000aaaad039c2b8 in JOIN::exec (this=this@entry=0x40003127fe60) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_select.cc:4583
#10 0x0000aaaad039abd0 in mysql_select (thd=thd@entry=0x400031254bd8, tables=<optimized out>, fields=..., conds=0x0, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=<optimized out>, result=result@entry=0x40003127fe38, unit=0x400031258d88, 
    unit@entry=0x400031257bd8, select_lex=select_lex@entry=0x40003127ef30) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_select.cc:5062
#11 0x0000aaaad039b10c in handle_select (thd=thd@entry=0x400031254bd8, lex=lex@entry=0x400031258cc0, result=result@entry=0x40003127fe38, setup_tables_done_option=70369480045320, setup_tables_done_option@entry=0)
    at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_select.cc:554
#12 0x0000aaaad034b760 in execute_sqlcom_select (thd=thd@entry=0x400031254bd8, all_tables=0x0) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_parse.cc:6297
#13 0x0000aaaad0348e14 in mysql_execute_command (thd=thd@entry=0x400031254bd8, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_parse.cc:3985
#14 0x0000aaaad034dea4 in mysql_parse (thd=0x400031254bd8, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_parse.cc:8128
#15 0x0000aaaad0345fb0 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x400031254bd8, packet=packet@entry=0x40003126e0d9 "", packet_length=packet_length@entry=19, blocking=43, blocking@entry=true)
    at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_parse.cc:1907
#16 0x0000aaaad03451a4 in do_command (thd=0x400031254bd8, blocking=blocking@entry=true) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_parse.cc:1417
#17 0x0000aaaad04281f4 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x4000315fd858, put_in_cache=put_in_cache@entry=true) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_connect.cc:1459
#18 0x0000aaaad04285a8 in handle_one_connection (arg=arg@entry=0x4000315fd858) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_connect.cc:1361
#19 0x0000aaaad0988730 in pfs_spawn_thread (arg=0x400031606b18) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/storage/perfschema/pfs.cc:2201
#20 0x000040002ade822c in start_thread () from /lib64/libpthread.so.0
#21 0x000040002aeeaa1c in thread_start () from /lib64/libc.so.6
(gdb) f 17
#17 0x0000aaaad04281f4 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x4000315fd858, put_in_cache=put_in_cache@entry=true) at /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_connect.cc:1459
1459	in /local/p4clients/pkgbuild-_nuWH/workspace/src/MariaDB/sql/sql_connect.cc
(gdb) p &thd
$2 = (THD **) 0x40002bdc80b8
(gdb) p thd->thread_stack
$3 = 0x40002bdc80b8 "\330K%1"
(gdb) 

HugoWenTD avatar Mar 17 '23 23:03 HugoWenTD

smap info about the stack of 40002bd8a000-40002bdcb000 in last comment: - MariaDB server started with --thread_stack='262144' (256kB) .

40002bd8a000-40002bdcb000 rw-p 00000000 00:00 0
Size:                260 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  28 kB
Pss:                  28 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        28 kB
Referenced:           28 kB
Anonymous:            28 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    0      
VmFlags: rd wr mr mw me ac 

HugoWenTD avatar Mar 18 '23 02:03 HugoWenTD

MTR on builder amd64-fedora-36 failed on:

main.func_math                           w6 [ fail ]
        Test ended at 2023-03-11 01:00:30
CURRENT_TEST: main.func_math
mysqltest: At line 431: query 'SELECT 9223372036854775807 + 9223372036854775807' succeeded - should have failed with errno 1690

Seems unrelated to the change in this PR.

ottok avatar Mar 18 '23 20:03 ottok

I backported this to 10.11 and tested running the full MTR (including --big-test) on amd64/arm64/ppc64el/s390x, and I also included this in a package for which I ran extensive install/upgrade testing on Debian. I did not come across any issues introduced by this change, so all this testing is in favor of merging this.

ottok avatar Mar 20 '23 15:03 ottok

@LinuxJedi @grooverdan This fixes a rather serious issue. There is no testable functional or performance regression, and a similar fix has been in MySQL already for a long time. Could you please review it? Thanks!

We've both got not infrequent commitments beyond PRs so response time can be vary and with the current volume its easy to miss.

For the most serious regressions or crashing issues can you use a JIRA Issue with a critical/blocker priority and an assigned Fix versions.

Thanks for the clear explanation @HugoWenTD.

Leaving it with @vuvova for now since he's looked at stack issues far more than me, but I will catch up sometime.

grooverdan avatar Mar 24 '23 03:03 grooverdan

This was cherry-picked to the official Debian/Ubuntu package of MariaDB 10.11.2 in https://salsa.debian.org/mariadb-team/mariadb-server/-/commit/d1ea0bda86b99a499d91b40496fefac2a250695c

ottok avatar Mar 25 '23 20:03 ottok

Created Jira https://jira.mariadb.org/browse/MDEV-31151 for better tracking this issue as suggested by @grooverdan .

HugoWenTD avatar Apr 28 '23 19:04 HugoWenTD

My interpretation of https://stackoverflow.com/questions/53827/checking-available-stack-size-in-c agrees with the Description. Related to this, in d0abbdf56e11ccc88447c1dc80caaf355c94be3b I replaced cmake/stack_direction.c with an ISA based CMake rule, because that code would produce the wrong result for STACK_DIRECTION when compiled for AMD64 by clang-16.

dr-m avatar Feb 09 '24 07:02 dr-m

I can think of a more reliable way of implementing a more accurate stack size check. The idea would be to assign the stack pointer to a thread-local variable (not THD::thread_stack) at the time of thread creation. I tried the following on AMD64:

#include <stdio.h>

inline size_t read_sp()
{
  size_t sp;
  __asm__("mov %%rsp,%0" : "=r" (sp));
  return sp;
}

static thread_local size_t stack_start;
#ifdef __clang__
__attribute__((optnone))
#elif defined __GNUC__
__attribute__((optimize(0)))
#endif
static void f(int i)
{
  if (stack_start - read_sp() < 1000)
    f(i + 1);
  else
    printf("%d\n", i);
}

int main ()
{
  stack_start = read_sp();
  f(0);
}

Optimizations on the recursive function had to be disabled to avoid an infinite loop or an immediate termination. The recursion depth would reach 20 or 30 when compiled with clang or gcc.

An additional problem with THD::thread_stack is that sometimes it is initialized nowhere near the start of a thread. An example would be Events::init(THD*, bool) when called with a nullptr first argument. I could not determine if that is actually dead code.

Possibly this idea could actually be implemented in a rather portable way:

thread_local char *stack_start;

pthread_handler_t thread_handler(void *arg)
{
  char dummy;
  stack_start = &dummy;
  // the rest of the thread routine
}

Problems around stack frame alignment would still remain, and we’d have to reliably calculate the stack pointer right before the alloca() call. If there are multiple local variables declared in a stack frame, there might be no portable solution. Also, techniques like link-time optimization could violate the underlying assumptions.

If it can be demonstrated that there is no performance degradation, merging this pull would seem to be the cleanest solution.

dr-m avatar Feb 09 '24 08:02 dr-m

@dr-m Thank you for reviewing this pull request.

If it can be demonstrated that there is no performance degradation, merging this pull would seem to be the cleanest solution.

Although comprehensive performance verification for non-regression is challenging, as mentioned in the PR description, I had run the mini-benchmark script to validate the change.

Additionally, the commit of this pull request has been merged into AWS RDS MariaDB 10.6+ branches since March 2023. We have not encountered the same crash after applying this fix, and our RDS benchmarking system has not detected any regressions.

HugoWenTD avatar Apr 08 '24 18:04 HugoWenTD

Rebased to 10.5.

HugoWenTD avatar May 07 '24 16:05 HugoWenTD

@HugoWenTD , do you want to resolve the conflicts? Happy to help as you've obviously rewritten it all.

grooverdan avatar Jul 04 '24 07:07 grooverdan

Thank you very much @HugoWenTD.

grooverdan avatar Jul 05 '24 03:07 grooverdan

Thank you so much for reviewing and merging it!

HugoWenTD avatar Jul 05 '24 03:07 HugoWenTD