valkey Persist AOF file by io

Description Persisting write commands into AOF file is a method of Valkey to ensure high reliability. When user turn on AOF and set the appendfsync always, the speed of writing data into disk is critical to Valkey. Due to the write operation is synchronous and Valkey server will not response to other requests of Valkey clients. IO_Uring is a powerful asynchronous I/O API for Linux. This patch optimize Valkey's performance by replace traditional write interface by io_uring when persist AOF file to disk.
We tested the performance by Valkey-benchmark tool. The patch improves perfromance by 29.24%. Baseline: 48,847.20 Qps -> Optimized: 63,130.57 Qps

Test Environment OPERATING SYSTEM: Ubuntu Kernel: 6.5.0 DISK: SATA SSD PROCESSOR: Intel(R) Xeon(R) Gold 6152 CPU (total 88 Threads, 2 Sockets, 22 Cores per socket, 2 Threads per Core) NUMA info of the processor NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,...,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,...,87 Base: #741 Server and Valkey-benchmark in same socket.

Server config port 9876 bind 127.0.0.1 appendonly yes appendfsync always no-appendfsync-on-rewrite no aof-use-rdb-preamble no daemonize no protected-mode no databases 16 latency-monitor-threshold 1 repl-diskless-sync-delay 0 save io-uring-enalbed yes

Test step

Start sever with taskset -c 12,14,16,18 src/valkey-server valkey.conf
Start benchmark using single thread: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q
Start benchmark using multiple threads: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4 For both single thread and multiple threads, I tested each case 3 times. The average performance are summaried as follow table:

Mode	Baseline	Optimized	Performance Improvement
Single Thread	48847.2	63130.57	29.24%
Multiple Threads	59992.36	72723.67	21.22%

Jul 05 '24 07:07 Wenwen-Chen

Codecov Report

Attention: Patch coverage is 19.04762% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.34%. Comparing base (b728e41) to head (c87f7de).

Files	Patch %	Lines
src/io_uring.c	0.00%	11 Missing :warning:
src/server.c	20.00%	4 Missing :warning:
src/aof.c	60.00%	2 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #750      +/-   ##
============================================
- Coverage     70.40%   70.34%   -0.06%     
============================================
  Files           112      113       +1     
  Lines         61467    61487      +20     
============================================
- Hits          43275    43253      -22     
- Misses        18192    18234      +42

Files	Coverage Δ
src/config.c	`78.69% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/aof.c	`79.97% <60.00%> (-0.17%)`	:arrow_down:
src/server.c	`88.45% <20.00%> (-0.11%)`	:arrow_down:
src/io_uring.c	`0.00% <0.00%> (ø)`

... and 10 files with indirect coverage changes

Jul 05 '24 07:07 codecov[bot]

Hello. Are you working with @lipzhu? If we do the write with io_uring, we could also do the fsync in the same ring without an extra syscall?

No，I am not working with @lipzhu，but I have following #599 for a long time.
In my opinion, AOF write and fsync can use the same io_uring instance in the manner of time-sharing multiplexing.

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls? Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Yes, the performance improvemet is brought by the less syscalls of io_uring. I did an extra experiment for without io_uring scenario. In order to count the number of 'write' which are called in each aofWrite function, I add some logs in aofWrite . I tested valkey with the same test case. I found that each aofWrite only calls 'write' once . Therefore, I didn't replace write by writev.

Jul 09 '24 08:07 Wenwen-Chen

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?

Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Echo @zuiderkwast , I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

Jul 10 '24 00:07 zhulipeng

@Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC.

OK, I will do these tests ASAP.

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?

Jul 10 '24 05:07 Wenwen-Chen

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?

Through config auto-aof-rewrite-min-size 64gb.

Jul 10 '24 07:07 zhulipeng

Through config auto-aof-rewrite-min-size 64gb.

Thank you very much.

I did some extra experiments

Persist AOF file by io_uring bring a bit CPU overhead when compared with traditional write system call
Why the performance improved when using io_uring? It has relationship with Rewrite feature. However, I don't know the root cause. The detail test result is shown as following.

1. Performance comparison I compared the perfomance by enable/disable rewrite feature. Test command : taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4 It shown that enable io_uring + enable rewrite got best performance.

Rewrite	Baseline (use write SYSCALL)	Optimized (use io_uring)	Performance Improvement
Disable	61722.51	60336.46	-2.25%
Enable	59576.85	72835.51	22.25%

2. CPU utilization comparison perf stat -p 'pid of valkey-server' sleep 10

Disable Rewrite

Index	Baseline(use write SYSCALL)	Optimzed(use io_uring)	Optimzed/ Baseline -1
Cycles	21,496,813,799	22,242,600,805	3.47%
Instructions	21,470,082,059	24,364,695,157	13.48%
Insn Per Cycle	1	1.1	-17.29%
CPU utilized	0.653	0.683	4.59%

Enable Rewrite

Index	Baseline(use write SYSCALL)	Optimzed(use io_uring)	Optimzed/ Baseline -1
Cycles	24,055,924,761	27,149,142,327	12.86%
Instructions	23,769,267,973	30,960,818,308	30.26%
Insn Per Cycle	0.99	1.14	15.42%
CPU utilized	0.732	0.859	17.35%

Jul 10 '24 09:07 Wenwen-Chen

With io_uring, the kernel can use kernel threads? Maybe that's why it's faster but uses more CPU?

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

With higher throughput, we can handle more traffic. It's OK to use more CPU for more traffic, but for the same traffic I hope we don't use very much more CPU.

Jul 10 '24 11:07 zuiderkwast

With io_uring, the kernel can use kernel threads?

It is determined by IO traffic. If Application's IO traffic is low, the kernel will not use kernel thread. Otherwise, the kernel will obtain io worker threads( kernel thread) from io_uring's worker pool to process IO.

Maybe that's why it's faster but uses more CPU?

Yes.
However, why enable io_uring + enable Rewrite get best performance? I don't konw the root cause.
I will deep dive in valkey's rewrite feature and resolve the problem. I would be very grateful if someone could provide a method to resolve the problem.

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

For a fixed duration (10S). They were tested by the command of perf stat -p 'pid of valkey-server' sleep 10

Jul 11 '24 07:07 Wenwen-Chen

I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

@lipzhu I have posted the test result. Do you have any comment on the result? I am wondering why enable io_uring + enable rewrite get best performance. According to my unstanding, rewrite feature forks a child process which stores the latest KV data into disk. Is there anything special about rewrite feature which has strong correlation with io_uring?

Jul 19 '24 09:07 Wenwen-Chen

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

I am wondering why enable io_uring + enable rewrite get best performance.

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

Jul 31 '24 01:07 zhulipeng

@lipzhu

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

Function flushAppendOnlyFile() persists aof_buf to disk by using write system call or aofWriteByIOUring (wrapped io_uring interface in this patch). I analyzed the execution time of flushAppendOnlyFile under 3 scenarios by using uftrace tool https://github.com/namhyung/uftrace

Optimized: enable io_uring + enable rewrite,
Baseline 1: disable io_uring + enable rewrite,
Baseline 2: enable io_uring + disable rewrite.

For optimized scenario, it get best performance, due to it spend shortest time on flushAppendOnlyFile() . The execution time of flushAppendOnlyFile() mainly come from fdatasync(). The optimized scenario reduces 50.3%/44.5% execution time on fdatasync() when compared with scenario Baseline 1 and Baseline2.

Type	io-uring-enalbed	Rewite enable	Performance (Qps)	Time of flushAppendOnlyFile（s）	Time of fdatasync (s)
Optimized	Yes	Yes	38,963.27	15.113	10.775
Baseline 1	No	Yes	33,057.41	22.759	21.66
Baseline 2	Yes	No	32,701.11	25.82	19.43
		Optimized vs Baseline 1	17.9%	-33.6%	-50.3%
		Optimized vs Baseline 2	19.1%	-41.5%	-44.5%

Test steps

Start Server: uftrace record -F flushAppendOnlyFile -F fdatasync src/valkey-server valkey.conf
Start valkey-benchmark: taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 2500000 -q --threads 4
Analysis execution time: uftrace graph flushAppendOnlyFile.

Aug 09 '24 08:08 Wenwen-Chen

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

io_uring doesn't bring perfromance improvment on ‘disable Rewrite’ scenario when compared with write SYSCALL ( io_uring: 60336.46 vs write: 61722.51, https://github.com/valkey-io/valkey/pull/750#issuecomment-2220008106).
However, we should focus on 'enable Rewrite' scenario instead of 'disable Rewrite'. User usually enable rewrite feature in the production environment when he/she turns on AOF! io_uring improves perfromance significantly on 'enable Rewrite' scenario(io_uring: 72835.51 vs write: 59576.85, https://github.com/valkey-io/valkey/pull/750#issuecomment-2220008106). According to https://github.com/valkey-io/valkey/pull/750#issuecomment-2277412963 , Valkey get best performance due to io_uring reduces the execution time of flushAppendOnlyFile.

Aug 20 '24 02:08 Wenwen-Chen

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

https://github.com/valkey-io/valkey/blob/e30ae762a8ec7f531005fab90edd275dfa98f72f/valkey.conf#L2374C1-L2388C25

# server-cpulist 0-7:2 # bio-cpulist 1,3 # aof-rewrite-cpulist 8-11 # bgsave-cpulist 1,10-11

Oct 15 '24 13:10 egbaydarov

@Wenwen-Chen do you plan to work on this?

Dec 17 '24 13:12 xbasel

@Wenwen-Chen do you plan to work on this?

Hi, @xbasel I really want to promote this patch, However, I am not expert of Valkey. I have not found the root reason why io_uring enbale + rewrite enable reduces the time of fdatasync. Do you have any suggestion?

Dec 25 '24 13:12 Wenwen-Chen

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

Hi @egbaydarov, thank you very much for your suggestion. It got the same boost with correctly configured affinity.

Dec 26 '24 01:12 Wenwen-Chen

Persist AOF file by io_uring

Codecov Report