valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Persist AOF file by io_uring

Open Wenwen-Chen opened this issue 1 year ago • 12 comments

Description Persisting write commands into AOF file is a method of Valkey to ensure high reliability. When user turn on AOF and set the appendfsync always, the speed of writing data into disk is critical to Valkey. Due to the write operation is synchronous and Valkey server will not response to other requests of Valkey clients. IO_Uring is a powerful asynchronous I/O API for Linux. This patch optimize Valkey's performance by replace traditional write interface by io_uring when persist AOF file to disk.
We tested the performance by Valkey-benchmark tool. The patch improves perfromance by 29.24%. Baseline: 48,847.20 Qps -> Optimized: 63,130.57 Qps

Test Environment OPERATING SYSTEM: Ubuntu Kernel: 6.5.0 DISK: SATA SSD PROCESSOR: Intel(R) Xeon(R) Gold 6152 CPU (total 88 Threads, 2 Sockets, 22 Cores per socket, 2 Threads per Core) NUMA info of the processor NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,...,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,...,87 Base: #741 Server and Valkey-benchmark in same socket.

Server config port 9876 bind 127.0.0.1 appendonly yes appendfsync always no-appendfsync-on-rewrite no aof-use-rdb-preamble no daemonize no protected-mode no databases 16 latency-monitor-threshold 1 repl-diskless-sync-delay 0 save io-uring-enalbed yes

Test step

  1. Start sever with taskset -c 12,14,16,18 src/valkey-server valkey.conf
  2. Start benchmark using single thread: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q
  3. Start benchmark using multiple threads: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4 For both single thread and multiple threads, I tested each case 3 times. The average performance are summaried as follow table:
Mode Baseline Optimized Performance Improvement
Single Thread 48847.2 63130.57 29.24%
Multiple Threads 59992.36 72723.67 21.22%

Wenwen-Chen avatar Jul 05 '24 07:07 Wenwen-Chen

Codecov Report

Attention: Patch coverage is 19.04762% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.34%. Comparing base (b728e41) to head (c87f7de).

Files Patch % Lines
src/io_uring.c 0.00% 11 Missing :warning:
src/server.c 20.00% 4 Missing :warning:
src/aof.c 60.00% 2 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #750      +/-   ##
============================================
- Coverage     70.40%   70.34%   -0.06%     
============================================
  Files           112      113       +1     
  Lines         61467    61487      +20     
============================================
- Hits          43275    43253      -22     
- Misses        18192    18234      +42     
Files Coverage Δ
src/config.c 78.69% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/aof.c 79.97% <60.00%> (-0.17%) :arrow_down:
src/server.c 88.45% <20.00%> (-0.11%) :arrow_down:
src/io_uring.c 0.00% <0.00%> (ø)

... and 10 files with indirect coverage changes

codecov[bot] avatar Jul 05 '24 07:07 codecov[bot]

Hello. Are you working with @lipzhu? If we do the write with io_uring, we could also do the fsync in the same ring without an extra syscall?

No,I am not working with @lipzhu,but I have following #599 for a long time.
In my opinion, AOF write and fsync can use the same io_uring instance in the manner of time-sharing multiplexing.

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls? Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Yes, the performance improvemet is brought by the less syscalls of io_uring. I did an extra experiment for without io_uring scenario. In order to count the number of 'write' which are called in each aofWrite function, I add some logs in aofWrite . I tested valkey with the same test case. I found that each aofWrite only calls 'write' once . Therefore, I didn't replace write by writev.

Wenwen-Chen avatar Jul 09 '24 08:07 Wenwen-Chen

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?

Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Echo @zuiderkwast , I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

zhulipeng avatar Jul 10 '24 00:07 zhulipeng

@Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC.

OK, I will do these tests ASAP.

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?

Wenwen-Chen avatar Jul 10 '24 05:07 Wenwen-Chen

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?

Through config auto-aof-rewrite-min-size 64gb.

zhulipeng avatar Jul 10 '24 07:07 zhulipeng

Through config auto-aof-rewrite-min-size 64gb.

Thank you very much.

I did some extra experiments

  • Persist AOF file by io_uring bring a bit CPU overhead when compared with traditional write system call
  • Why the performance improved when using io_uring? It has relationship with Rewrite feature. However, I don't know the root cause. The detail test result is shown as following.

1. Performance comparison I compared the perfomance by enable/disable rewrite feature. Test command : taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4 It shown that enable io_uring + enable rewrite got best performance.

Rewrite Baseline (use write SYSCALL) Optimized (use io_uring) Performance Improvement
Disable 61722.51 60336.46 -2.25%
Enable 59576.85 72835.51 22.25%

2. CPU utilization comparison perf stat -p 'pid of valkey-server' sleep 10

  • Disable Rewrite
Index Baseline(use write SYSCALL) Optimzed(use io_uring) Optimzed/ Baseline -1
Cycles 21,496,813,799 22,242,600,805 3.47%
Instructions 21,470,082,059 24,364,695,157 13.48%
Insn Per Cycle 1 1.1 -17.29%
CPU utilized 0.653 0.683 4.59%
  • Enable Rewrite
Index Baseline(use write SYSCALL) Optimzed(use io_uring) Optimzed/ Baseline -1
Cycles 24,055,924,761 27,149,142,327 12.86%
Instructions 23,769,267,973 30,960,818,308 30.26%
Insn Per Cycle 0.99 1.14 15.42%
CPU utilized 0.732 0.859 17.35%

Wenwen-Chen avatar Jul 10 '24 09:07 Wenwen-Chen

With io_uring, the kernel can use kernel threads? Maybe that's why it's faster but uses more CPU?

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

With higher throughput, we can handle more traffic. It's OK to use more CPU for more traffic, but for the same traffic I hope we don't use very much more CPU.

zuiderkwast avatar Jul 10 '24 11:07 zuiderkwast

With io_uring, the kernel can use kernel threads?

It is determined by IO traffic. If Application's IO traffic is low, the kernel will not use kernel thread. Otherwise, the kernel will obtain io worker threads( kernel thread) from io_uring's worker pool to process IO.

Maybe that's why it's faster but uses more CPU?

Yes.
However, why enable io_uring + enable Rewrite get best performance? I don't konw the root cause.
I will deep dive in valkey's rewrite feature and resolve the problem. I would be very grateful if someone could provide a method to resolve the problem.

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

For a fixed duration (10S). They were tested by the command of perf stat -p 'pid of valkey-server' sleep 10

Wenwen-Chen avatar Jul 11 '24 07:07 Wenwen-Chen

I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

@lipzhu I have posted the test result. Do you have any comment on the result? I am wondering why enable io_uring + enable rewrite get best performance. According to my unstanding, rewrite feature forks a child process which stores the latest KV data into disk. Is there anything special about rewrite feature which has strong correlation with io_uring?

Wenwen-Chen avatar Jul 19 '24 09:07 Wenwen-Chen

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

I am wondering why enable io_uring + enable rewrite get best performance.

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

zhulipeng avatar Jul 31 '24 01:07 zhulipeng

@lipzhu

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

  1. Function flushAppendOnlyFile() persists aof_buf to disk by using write system call or aofWriteByIOUring (wrapped io_uring interface in this patch). I analyzed the execution time of flushAppendOnlyFile under 3 scenarios by using uftrace tool https://github.com/namhyung/uftrace
  • Optimized: enable io_uring + enable rewrite,
  • Baseline 1: disable io_uring + enable rewrite,
  • Baseline 2: enable io_uring + disable rewrite.
  1. For optimized scenario, it get best performance, due to it spend shortest time on flushAppendOnlyFile() . The execution time of flushAppendOnlyFile() mainly come from fdatasync(). The optimized scenario reduces 50.3%/44.5% execution time on fdatasync() when compared with scenario Baseline 1 and Baseline2.
Type io-uring-enalbed Rewite enable Performance (Qps) Time of flushAppendOnlyFile(s) Time of fdatasync (s)
Optimized Yes Yes 38,963.27 15.113 10.775
Baseline 1 No Yes 33,057.41 22.759 21.66
Baseline 2 Yes No 32,701.11 25.82 19.43
    Optimized vs Baseline 1 17.9% -33.6% -50.3%
    Optimized vs Baseline 2 19.1% -41.5% -44.5%
  1. Test steps
  • Start Server: uftrace record -F flushAppendOnlyFile -F fdatasync src/valkey-server valkey.conf
  • Start valkey-benchmark: taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 2500000 -q --threads 4
  • Analysis execution time: uftrace graph flushAppendOnlyFile.

Wenwen-Chen avatar Aug 09 '24 08:08 Wenwen-Chen

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

io_uring doesn't bring perfromance improvment on ‘disable Rewrite’ scenario when compared with write SYSCALL ( io_uring: 60336.46 vs write: 61722.51, https://github.com/valkey-io/valkey/pull/750#issuecomment-2220008106).
However, we should focus on 'enable Rewrite' scenario instead of 'disable Rewrite'. User usually enable rewrite feature in the production environment when he/she turns on AOF! io_uring improves perfromance significantly on 'enable Rewrite' scenario(io_uring: 72835.51 vs write: 59576.85, https://github.com/valkey-io/valkey/pull/750#issuecomment-2220008106). According to https://github.com/valkey-io/valkey/pull/750#issuecomment-2277412963 , Valkey get best performance due to io_uring reduces the execution time of flushAppendOnlyFile.

Wenwen-Chen avatar Aug 20 '24 02:08 Wenwen-Chen

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

https://github.com/valkey-io/valkey/blob/e30ae762a8ec7f531005fab90edd275dfa98f72f/valkey.conf#L2374C1-L2388C25

# server-cpulist 0-7:2 # bio-cpulist 1,3 # aof-rewrite-cpulist 8-11 # bgsave-cpulist 1,10-11

egbaydarov avatar Oct 15 '24 13:10 egbaydarov

@Wenwen-Chen do you plan to work on this?

xbasel avatar Dec 17 '24 13:12 xbasel

@Wenwen-Chen do you plan to work on this?

Hi, @xbasel I really want to promote this patch, However, I am not expert of Valkey. I have not found the root reason why io_uring enbale + rewrite enable reduces the time of fdatasync. Do you have any suggestion?

Wenwen-Chen avatar Dec 25 '24 13:12 Wenwen-Chen

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

Hi @egbaydarov, thank you very much for your suggestion. It got the same boost with correctly configured affinity.

Wenwen-Chen avatar Dec 26 '24 01:12 Wenwen-Chen