server
server copied to clipboard
MDEV-35049: Improve adaptive hash index scalability
- [x] The Jira issue number for this PR is: MDEV-35049
Description
MEM_HEAP_BTR_SEARCH: Remove. Let us handle this special type of mem_heap_t allocations in the only compilation unit, btr0sea.cc, specifically, btr_search_sys_t::partition::insert(), which replaces the function ha_insert_for_fold().
mem_block_info_t::ahi_block: Replaces free_block. This caches one buffer page for use in adaptive hash index allocations. This is protected by btr_search_sys_t::partition::latch. It only is Atomic_relaxed because btr_search_free_space() is following a pattern of test, lock, and test.
btr_search_check_free_space(): Protect the ahi_block with a shared AHI partition latch instead of an exclusive one. We must recheck btr_search_enabled after acquiring the latch in order to avoid a race condition with btr_search_disable(). Using a shared latch instead of an exclusive one should reduce contention with btr_search_guess_on_hash() and other operations when running with innodb_adaptive_hash_index=ON.
Release Notes
The performance of innodb_adaptive_hash_index with larger numbers of threads was improved at high concurrency.
How can this PR be tested?
./mtr --parallel=auto --big-test --force --mysqld=--loose-innodb-adaptive-hash-index=ON --max-test-fail=0
I tested this with and without cmake -DWITH_INNODB_AHI=OFF.
Using the innodb-hashtest.sh I still see some waits for an AHI partition latch. Here are all functions that exceed 1% of perf record samples:
27,97% mariadbd mariadbd [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, mtr_t*)
21,10% mariadbd mariadbd [.] ssux_lock_impl<true>::rd_wait()
2,93% mariadbd mariadbd [.] btr_search_sys_t::partition::insert(unsigned long, unsigned char const*)
2,89% mariadbd mariadbd [.] row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)
2,17% mariadbd mariadbd [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)
1,92% mariadbd [kernel.kallsyms] [k] psi_group_change
1,75% mariadbd [kernel.kallsyms] [k] try_to_wake_up
1,47% mariadbd mariadbd [.] btr_search_drop_page_hash_index(buf_block_t*, bool)
1,36% mariadbd mariadbd [.] ssux_lock_impl<true>::wr_wait(unsigned int)
1,34% mariadbd mariadbd [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)
1,31% mariadbd mariadbd [.] ha_delete_hash_node(hash_table_t*, mem_block_info_t*, ha_node_t*)
1,31% mariadbd mariadbd [.] Row_sel_get_clust_rec_for_mysql::operator()(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned short**, mem_block_info_t**, dtuple_t**, mtr_t*)
1,22% mariadbd mariadbd [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
1,16% mariadbd mariadbd [.] evaluate_join_record(JOIN*, st_join_table*, int)
1,15% mariadbd mariadbd [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned short const*, unsigned long, unsigned long*)
1,06% mariadbd [vdso] [.] __vdso_gettimeofday
It should be noted that conflicts are inevitable under this workload, because ha_delete_hash_node() is covered by an exclusive AHI partition latch.
I repeated the experiment with an additional patch that disables the spin loops in this subsystem:
diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc
index 9fb5bb4597a..f8c19922729 100644
--- a/storage/innobase/btr/btr0cur.cc
+++ b/storage/innobase/btr/btr0cur.cc
@@ -2620,7 +2620,7 @@ btr_cur_optimistic_insert(
ut_ad(flags == BTR_NO_LOCKING_FLAG);
} else if (index->table->is_temporary()) {
} else {
- srw_spin_lock* ahi_latch = btr_search_sys.get_latch(*index);
+ auto ahi_latch = btr_search_sys.get_latch(*index);
if (!reorg && cursor->flag == BTR_CUR_HASH) {
btr_search_update_hash_node_on_insert(
cursor, ahi_latch);
@@ -3402,7 +3402,7 @@ btr_cur_update_in_place(
#ifdef BTR_CUR_HASH_ADAPT
{
- srw_spin_lock* ahi_latch = block->index
+ auto ahi_latch = block->index
? btr_search_sys.get_latch(*index) : NULL;
if (ahi_latch) {
/* TO DO: Can we skip this if none of the fields
diff --git a/storage/innobase/btr/btr0sea.cc b/storage/innobase/btr/btr0sea.cc
index 968ca1993d2..4fd63a4c31e 100644
--- a/storage/innobase/btr/btr0sea.cc
+++ b/storage/innobase/btr/btr0sea.cc
@@ -34,6 +34,8 @@ Created 2/17/1996 Heikki Tuuri
#include "btr0btr.h"
#include "srv0mon.h"
+#define srw_spin_lock srw_lock
+
/** Is search system enabled.
Search system is protected by array of latches. */
char btr_search_enabled;
diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc
index be721a51f3c..6368ec62909 100644
--- a/storage/innobase/handler/handler0alter.cc
+++ b/storage/innobase/handler/handler0alter.cc
@@ -5898,7 +5898,7 @@ static bool innobase_instant_try(
#ifdef BTR_CUR_HASH_ADAPT
/* Acquire the ahi latch to avoid a race condition
between ahi access and instant alter table */
- srw_spin_lock* ahi_latch = btr_search_sys.get_latch(*index);
+ auto ahi_latch = btr_search_sys.get_latch(*index);
ahi_latch->wr_lock(SRW_LOCK_CALL);
#endif /* BTR_CUR_HASH_ADAPT */
const bool metadata_changed = ctx->instant_column();
diff --git a/storage/innobase/include/btr0sea.h b/storage/innobase/include/btr0sea.h
index a8d746bf8ce..8feb892395a 100644
--- a/storage/innobase/include/btr0sea.h
+++ b/storage/innobase/include/btr0sea.h
@@ -36,6 +36,8 @@ Created 2/17/1996 Heikki Tuuri
extern mysql_pfs_key_t btr_search_latch_key;
#endif /* UNIV_PFS_RWLOCK */
+#define srw_spin_lock srw_lock
+
#define btr_search_sys_create() btr_search_sys.create()
#define btr_search_sys_free() btr_search_sys.free()
@@ -354,6 +356,7 @@ again set this much timeout. This is to reduce contention. */
#define BTR_SEA_TIMEOUT 10000
#endif /* BTR_CUR_HASH_ADAPT */
+#undef srw_spin_lock
#include "btr0sea.inl"
#endif
With this additional patch, the spin loop is gone:
28,01% mariadbd mariadbd [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, mtr_t*)
19,94% mariadbd mariadbd [.] btr_search_sys_t::partition::insert(unsigned long, unsigned char const*)
15,99% mariadbd mariadbd [.] btr_search_drop_page_hash_index(buf_block_t*, bool)
15,03% mariadbd mariadbd [.] ha_delete_hash_node(hash_table_t*, mem_block_info_t*, ha_node_t*)
1,25% mariadbd mariadbd [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)
1,23% mariadbd mariadbd [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)
0,88% mariadbd mariadbd [.] row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)
0,86% mariadbd mariadbd [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
0,78% mariadbd libc.so.6 [.] __memcmp_avx2_movbe
0,66% mariadbd mariadbd [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned short const*, unsigned long, unsigned long*)
0,52% mariadbd mariadbd [.] cmp_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long)
0,52% mariadbd mariadbd [.] mtr_memo_slot_t::release() const
0,47% mariadbd mariadbd [.] trx_purge(unsigned long, unsigned long)
0,45% mariadbd mariadbd [.] ssux_lock_impl<false>::rd_wait()
0,44% mariadbd mariadbd [.] btr_search_check_guess(btr_cur_t*, bool, dtuple_t const*, unsigned long) [clone .constprop.0]
0,42% mariadbd mariadbd [.] mtr_t::memset(buf_block_t const*, unsigned long, unsigned long, unsigned char)
0,41% mariadbd [vdso] [.] __vdso_gettimeofday
The ssux_lock_impl<false>::rd_wait() above will involve context switches due to futex system calls.
This needs to be tested with a larger workload that originally reproduced the scalability issue.
Basing the PR against the correct MariaDB version
- [ ] This is a new feature or a refactoring, and the PR is based against the
mainbranch. - [x] This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.
This should be directly applicable to 10.5, but the patch is currently is based on the 10.6 branch. For 10.5, we might also limit the change to btr_search_check_free_space_in_heap().
PR quality check
- [x] I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
- [ ] For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.