hse Implement vblock split write amp reduction algo

During a vblock split, we can save write amp by using mpool_mblock_punch() to FALLOC_FL_PUNCH_HOLE a portion of the mblock for both the left and right hand side destination mblocks.

Signed-off-by: Tristan Partin [email protected] Co-authored-by: Nabeel Meeramohideen Mohamed [email protected]

Jan 11 '23 21:01 tristan957

Workload I used for testing:

kmt -j64 -i2g -Rx -s1 -f %023lu -l977 -okeydist=0 /mnt/hse kvs1 kvs-oparams cn_dsplit_size=32

With this, I notice a 4.68% reduction in KVDB size at the end of the run.

Jan 11 '23 21:01 tristan957

You should also have seen no gc operations...

Jan 11 '23 21:01 beckerg

Greg how could I double check that? Is there a log message I should look out for?

Jan 11 '23 21:01 tristan957

It doesn't work though 😕

Feb 09 '23 22:02 tristan957

Ok, so here are the facts. If I am remembering correctly from when I was last touching this PR, you can either end up in 1 of 2 situations. One is a segfault due to a null pointer dereference, but I couldn't tell you what the backtrace is. The other situation is the following:

kmt: lib/cn/kvset_split.c:781: vblocks_split: Assertion `blks_right->bl_vtotal >= blks_right->bl_vused' failed.
Process 29455 stopped
* thread #19, name = 'hse_cn_io', stop reason = signal SIGABRT
    frame #0: 0x00007ffff7b92e5c libc.so.6`__pthread_kill_implementation + 268
libc.so.6`__pthread_kill_implementation:
->  0x7ffff7b92e5c <+268>: movl   %eax, %ebp
    0x7ffff7b92e5e <+270>: negl   %ebp
    0x7ffff7b92e60 <+272>: cmpl   $0xfffff000, %eax         ; imm = 0xFFFFF000
    0x7ffff7b92e65 <+277>: movl   $0x0, %eax
(lldb) bt
* thread #19, name = 'hse_cn_io', stop reason = signal SIGABRT
  * frame #0: 0x00007ffff7b92e5c libc.so.6`__pthread_kill_implementation + 268
    frame #1: 0x00007ffff7b42a76 libc.so.6`raise + 22
    frame #2: 0x00007ffff7b2c7fc libc.so.6`abort + 215
    frame #3: 0x00007ffff7b2c71b libc.so.6`__assert_fail_base.cold + 15
    frame #4: 0x00007ffff7b3b656 libc.so.6`__assert_fail + 70
    frame #5: 0x00000000005fb613 kmt`vblocks_split(ks=0x00007ff613fec4c0, split_key=0x00007eef78bc24f0, work=0x00007fff7bbb7670, pc=0x0000000000887b28, result=0x00007ef378019b60) at kvset_split.c:781:5
    frame #6: 0x00000000005fbf73 kmt`kvset_split(ks=0x00007ff613fec4c0, split_key=0x00007eef78bc24f0, pc=0x0000000000887b28, result=0x00007ef378019b60) at kvset_split.c:880:11
    frame #7: 0x00000000005fc09b kmt`kvset_split_worker(work=0x00007ef378019b20) at kvset_split.c:904:18
    frame #8: 0x00000000004b6d1c kmt`worker_thread(arg=0x000000000083b280) at workqueue.c:456:13
    frame #9: 0x00007ffff7b9112d libc.so.6`start_thread + 717
    frame #10: 0x00007ffff7c12bc0 libc.so.6`__clone3 + 48

My steps for reproducing:

$ ./build/cli/hse kvdb create /mnt/hse
$ ./build/cli/hse kvs create /mnt/hse kvs1
$ lldb -- ./build/tools/kmt -j64 -i2g -Rx -s1 -f%023lu -l997 -okeydist=0 /mnt/hse kvs1 kvdb-oparams cn_io_threads=1 csched_debug_mask=$((0x80)) kvs-oparams cn_split_size=19 cn_dsplit_size=19

Been a while since I tried this workflow with gdb since I kept hitting an internal gdb bug. I am using a debug build, and the run will usually fail at >= 3.10 space amp, so pretty far into the load.

Feb 13 '23 17:02 tristan957

And this is the backtrace in the other scenario:

Thread 19 "hse_cn_io" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff7bf77000 (LWP 382640)]
0x000000000057fd92 in mbset_get_udata (blk_num=0, self=0x0) at ../../../home/gramdasi/hse-master/hse/lib/cn/mbset.h:111
111         bool valid = blk_num < self->mbs_mblkc;
Missing separate debuginfos, use: dnf debuginfo-install cyrus-sasl-lib-2.1.27-14.fc35.x86_64 glibc-2.34-7.fc35.x86_64 keyutils-libs-1.6.1-3.fc35.x86_64 krb5-libs-1.19.2-2.fc35.x86_64 libbsd-0.10.0-8.fc35.x86_64 libcom_err-1.46.3-1.fc35.x86_64 libgcc-11.2.1-1.fc35.x86_64 libselinux-3.2-4.fc35.x86_64 libstdc++-11.2.1-1.fc35.x86_64 libxcrypt-4.4.26-4.fc35.x86_64 libzstd-1.5.0-2.fc35.x86_64 openssl-libs-1.1.1l-2.fc35.x86_64 pcre2-10.37-4.fc35.x86_64 snappy-1.1.9-3.fc35.x86_64 userspace-rcu-0.13.0-3.fc35.x86_64 zlib-1.2.11-30.fc35.x86_64
(gdb) bt
#0  0x000000000057fd92 in mbset_get_udata (blk_num=0, self=0x0) at ../../../home/gramdasi/hse-master/hse/lib/cn/mbset.h:111
#1  lvx2vbd (i=104, ks=0x7fffec25a940) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset.c:138
#2  kvset_get_nth_vblock_alen (ks=0x7fffec25a940, index=104) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset.c:1767
#3  0x00000000005fd516 in vblocks_split (ks=0x7fffec25a940, split_key=0x7ef049fbd570, work=0x7fff7bf326f0, pc=0x8bf358, result=0x7ef02c05e100) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset_split.c:750
#4  0x00000000005fe164 in kvset_split (ks=0x7fffec25a940, split_key=0x7ef049fbd570, pc=0x8bf358, result=0x7ef02c05e100) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset_split.c:880
#5  0x00000000005fe28c in kvset_split_worker (work=0x7ef02c05e0c0) at ../../../home/gramdasi/hse-master/hse/lib/cn/kvset_split.c:904
#6  0x00000000004b8b62 in worker_thread (arg=0x872b00) at ../../../home/gramdasi/hse-master/hse/lib/util/lib/workqueue.c:456
#7  0x00007ffff7be6b17 in start_thread () from /lib64/libc.so.6
#8  0x00007ffff7c6b6a0 in clone3 () from /lib64/libc.so.6

Feb 13 '23 18:02 gsramdasi

I have some pretty spammy logs in this diff. So what you'll see when you encounter either the segfault or the assertion failure is something like the following:

left: punched id=0x30703400159 vgidx=20 vblkidx=256 alen=4096

The problematic portion being the 4096 alen, which is most likely completely wrong. Following ancestries of blocks like this you can see that pretty much the entire block is a hole due cloning at an offset and then punching. Right now, I am trying to understand if the min and max keys of the vblocks need to change after cloning/punching.

Feb 13 '23 18:02 tristan957

hsettp column headers

T NODE IDX DGEN COMP KEYS TOMBS PTOMBS HWLEN KWLEN VWLEN VGARB HBLKS KBLKS VBLKS VGRPS RULE STATE...

tristan957/nabeel-algo t  193 834 1123    3   2.1g     0      0  20.4m  43.9g   2.2t  52.2m   834  1900 74390 12512      - - 1 kvs1 1.002
master                 t  193 957 1122    3   2.1g     0      0  23.5m  43.8g   2.3t 161.0g   957  2045 74803 12937      - - 1 kvs1 1.077

I believe last column is space amp (Greg and unstructured data 😄). Doing some quick math would indicate that vgarb had a 99.07% reduction with this patch using the aforementioned workload.

[0 0 09:44:10] [tpartin@campise] [hse] [tristan957/nabeel-algo=]
+ $ du /mnt/hse
2146631940      /mnt/hse/capacity
2146631944      /mnt/hse

[0 0 02:38:09] [tpartin@campise] [hse] [master=]
+ $ du /mnt/hse
2298648080      /mnt/hse/capacity
2298648084      /mnt/hse

Feb 14 '23 08:02 tristan957

Looking at du output, something seems off with the reported stats vs actual.

Feb 14 '23 15:02 tristan957

yeah, i was having issues with that when running db-bench... what stats are you looking at? in general, i don't think our stats include the garbage, which is why i asked nabeel to create the vgarb stat..

sadly, our stats are reported in SI units, and du is reported in powers of two, does that make a difference? n/m, you're looking at byte counts...

Feb 14 '23 16:02 beckerg

Well more debugging to do...sigh

Feb 14 '23 16:02 tristan957

This PR has at least two issues that I think were documented in JIRA, but since we no longer have JIRA here we are :).

mblocks should not be punched in vblocks_split(), but instead after the entire split operation has finished. If an mblock is punched and then some sort of failure occurs before the split has completed, the KVDB will be unable to open.
The min and max keys of vblocks have to change. In the event a vblock gets punched/cloned, we change either the max or min key which references the vblock respectively. Since we don't change the bounds of the vblock, in another split operation, the same vblock may be inadvertently chosen to become a punch/clone candidate again because the bounds of the vblock incorrectly indicate that it overlaps the split key.

Issue 1 manifests itself if you abort at the end of the vblocks_split() function. Issue 2 is harder to notice, but you will fail because total bytes is <= used bytes of the vblock. You would also notice a vblock has an alen of a single page (4096).

I think @gsramdasi noticed one other issue similar to how issue 2 manifests even when a vblock was going through its first split. This might be an indicator of a third undiagnosed issue.

Mar 19 '23 23:03 tristan957

hse hse copied to clipboard

Implement vblock split write amp reduction algo

hse
hse copied to clipboard