Persist AOF file by io_uring
Description
Persisting write commands into AOF file is a method of Valkey to ensure high reliability. When user turn on AOF and set the appendfsync always, the speed of writing data into disk is critical to Valkey. Due to the write operation is synchronous and Valkey server will not response to other requests of Valkey clients.
IO_Uring is a powerful asynchronous I/O API for Linux. This patch optimize Valkey's performance by replace traditional write interface by io_uring when persist AOF file to disk.
We tested the performance by Valkey-benchmark tool. The patch improves perfromance by 29.24%.
Baseline: 48,847.20 Qps -> Optimized: 63,130.57 Qps
Test Environment OPERATING SYSTEM: Ubuntu Kernel: 6.5.0 DISK: SATA SSD PROCESSOR: Intel(R) Xeon(R) Gold 6152 CPU (total 88 Threads, 2 Sockets, 22 Cores per socket, 2 Threads per Core) NUMA info of the processor NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,...,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,...,87 Base: #741 Server and Valkey-benchmark in same socket.
Server config port 9876 bind 127.0.0.1 appendonly yes appendfsync always no-appendfsync-on-rewrite no aof-use-rdb-preamble no daemonize no protected-mode no databases 16 latency-monitor-threshold 1 repl-diskless-sync-delay 0 save io-uring-enalbed yes
Test step
- Start sever with taskset -c 12,14,16,18 src/valkey-server valkey.conf
- Start benchmark using single thread: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q
- Start benchmark using multiple threads: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4 For both single thread and multiple threads, I tested each case 3 times. The average performance are summaried as follow table:
| Mode | Baseline | Optimized | Performance Improvement |
|---|---|---|---|
| Single Thread | 48847.2 | 63130.57 | 29.24% |
| Multiple Threads | 59992.36 | 72723.67 | 21.22% |
Codecov Report
Attention: Patch coverage is 19.04762% with 17 lines in your changes missing coverage. Please review.
Project coverage is 70.34%. Comparing base (
b728e41) to head (c87f7de).
| Files | Patch % | Lines |
|---|---|---|
| src/io_uring.c | 0.00% | 11 Missing :warning: |
| src/server.c | 20.00% | 4 Missing :warning: |
| src/aof.c | 60.00% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## unstable #750 +/- ##
============================================
- Coverage 70.40% 70.34% -0.06%
============================================
Files 112 113 +1
Lines 61467 61487 +20
============================================
- Hits 43275 43253 -22
- Misses 18192 18234 +42
| Files | Coverage Δ | |
|---|---|---|
| src/config.c | 78.69% <ø> (ø) |
|
| src/server.h | 100.00% <ø> (ø) |
|
| src/aof.c | 79.97% <60.00%> (-0.17%) |
:arrow_down: |
| src/server.c | 88.45% <20.00%> (-0.11%) |
:arrow_down: |
| src/io_uring.c | 0.00% <0.00%> (ø) |
Hello. Are you working with @lipzhu? If we do the write with io_uring, we could also do the fsync in the same ring without an extra syscall?
No,I am not working with @lipzhu,but I have following #599 for a long time.
In my opinion, AOF write and fsync can use the same io_uring instance in the manner of time-sharing multiplexing.
29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls? Without io_uring we do
writein a while loop. I wonder if the same improved performance could be achieved withwritevinstead of the loop. Have you tried that?
Yes, the performance improvemet is brought by the less syscalls of io_uring. I did an extra experiment for without io_uring scenario. In order to count the number of 'write' which are called in each aofWrite function, I add some logs in aofWrite . I tested valkey with the same test case. I found that each aofWrite only calls 'write' once . Therefore, I didn't replace write by writev.
29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?
Without io_uring we do
writein a while loop. I wonder if the same improved performance could be achieved withwritevinstead of the loop. Have you tried that?
Echo @zuiderkwast , I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?
@Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC.
OK, I will do these tests ASAP.
To be simpler, let's also disable the rewrite process when enable AOF?
I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?
To be simpler, let's also disable the rewrite process when enable AOF?
I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?
Through config auto-aof-rewrite-min-size 64gb.
Through config
auto-aof-rewrite-min-size 64gb.
Thank you very much.
I did some extra experiments
- Persist AOF file by io_uring bring a bit CPU overhead when compared with traditional write system call
- Why the performance improved when using io_uring? It has relationship with Rewrite feature. However, I don't know the root cause. The detail test result is shown as following.
1. Performance comparison I compared the perfomance by enable/disable rewrite feature. Test command : taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4 It shown that enable io_uring + enable rewrite got best performance.
| Rewrite | Baseline (use write SYSCALL) | Optimized (use io_uring) | Performance Improvement |
|---|---|---|---|
| Disable | 61722.51 | 60336.46 | -2.25% |
| Enable | 59576.85 | 72835.51 | 22.25% |
2. CPU utilization comparison perf stat -p 'pid of valkey-server' sleep 10
- Disable Rewrite
| Index | Baseline(use write SYSCALL) | Optimzed(use io_uring) | Optimzed/ Baseline -1 |
|---|---|---|---|
| Cycles | 21,496,813,799 | 22,242,600,805 | 3.47% |
| Instructions | 21,470,082,059 | 24,364,695,157 | 13.48% |
| Insn Per Cycle | 1 | 1.1 | -17.29% |
| CPU utilized | 0.653 | 0.683 | 4.59% |
- Enable Rewrite
| Index | Baseline(use write SYSCALL) | Optimzed(use io_uring) | Optimzed/ Baseline -1 |
|---|---|---|---|
| Cycles | 24,055,924,761 | 27,149,142,327 | 12.86% |
| Instructions | 23,769,267,973 | 30,960,818,308 | 30.26% |
| Insn Per Cycle | 0.99 | 1.14 | 15.42% |
| CPU utilized | 0.732 | 0.859 | 17.35% |
With io_uring, the kernel can use kernel threads? Maybe that's why it's faster but uses more CPU?
Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?
With higher throughput, we can handle more traffic. It's OK to use more CPU for more traffic, but for the same traffic I hope we don't use very much more CPU.
With io_uring, the kernel can use kernel threads?
It is determined by IO traffic. If Application's IO traffic is low, the kernel will not use kernel thread. Otherwise, the kernel will obtain io worker threads( kernel thread) from io_uring's worker pool to process IO.
Maybe that's why it's faster but uses more CPU?
Yes.
However, why enable io_uring + enable Rewrite get best performance? I don't konw the root cause.
I will deep dive in valkey's rewrite feature and resolve the problem.
I would be very grateful if someone could provide a method to resolve the problem.
Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?
For a fixed duration (10S). They were tested by the command of perf stat -p 'pid of valkey-server' sleep 10
I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?
@lipzhu I have posted the test result. Do you have any comment on the result? I am wondering why enable io_uring + enable rewrite get best performance. According to my unstanding, rewrite feature forks a child process which stores the latest KV data into disk. Is there anything special about rewrite feature which has strong correlation with io_uring?
Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.
I am wondering why enable io_uring + enable rewrite get best performance.
This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?
@lipzhu
This is also my question, can you help to do some analysis of why
enable rewritecould help on performance. what is the cost, or just assign a single CPU to the server?
- Function flushAppendOnlyFile() persists aof_buf to disk by using write system call or aofWriteByIOUring (wrapped io_uring interface in this patch). I analyzed the execution time of flushAppendOnlyFile under 3 scenarios by using uftrace tool https://github.com/namhyung/uftrace
- Optimized: enable io_uring + enable rewrite,
- Baseline 1: disable io_uring + enable rewrite,
- Baseline 2: enable io_uring + disable rewrite.
- For optimized scenario, it get best performance, due to it spend shortest time on flushAppendOnlyFile() . The execution time of flushAppendOnlyFile() mainly come from fdatasync(). The optimized scenario reduces 50.3%/44.5% execution time on fdatasync() when compared with scenario Baseline 1 and Baseline2.
| Type | io-uring-enalbed | Rewite enable | Performance (Qps) | Time of flushAppendOnlyFile(s) | Time of fdatasync (s) |
|---|---|---|---|---|---|
| Optimized | Yes | Yes | 38,963.27 | 15.113 | 10.775 |
| Baseline 1 | No | Yes | 33,057.41 | 22.759 | 21.66 |
| Baseline 2 | Yes | No | 32,701.11 | 25.82 | 19.43 |
| Optimized vs Baseline 1 | 17.9% | -33.6% | -50.3% | ||
| Optimized vs Baseline 2 | 19.1% | -41.5% | -44.5% |
- Test steps
- Start Server: uftrace record -F flushAppendOnlyFile -F fdatasync src/valkey-server valkey.conf
- Start valkey-benchmark: taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 2500000 -q --threads 4
- Analysis execution time: uftrace graph flushAppendOnlyFile.
Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of
disable Rewritehas proven that.
io_uring doesn't bring perfromance improvment on ‘disable Rewrite’ scenario when compared with write SYSCALL ( io_uring: 60336.46 vs write: 61722.51, https://github.com/valkey-io/valkey/pull/750#issuecomment-2220008106).
However, we should focus on 'enable Rewrite' scenario instead of 'disable Rewrite'. User usually enable rewrite feature in the production environment when he/she turns on AOF!
io_uring improves perfromance significantly on 'enable Rewrite' scenario(io_uring: 72835.51 vs write: 59576.85, https://github.com/valkey-io/valkey/pull/750#issuecomment-2220008106). According to https://github.com/valkey-io/valkey/pull/750#issuecomment-2277412963 , Valkey get best performance due to io_uring reduces the execution time of flushAppendOnlyFile.
@Wenwen-Chen
I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?
https://github.com/valkey-io/valkey/blob/e30ae762a8ec7f531005fab90edd275dfa98f72f/valkey.conf#L2374C1-L2388C25
# server-cpulist 0-7:2 # bio-cpulist 1,3 # aof-rewrite-cpulist 8-11 # bgsave-cpulist 1,10-11
@Wenwen-Chen do you plan to work on this?
@Wenwen-Chen do you plan to work on this?
Hi, @xbasel I really want to promote this patch, However, I am not expert of Valkey. I have not found the root reason why io_uring enbale + rewrite enable reduces the time of fdatasync. Do you have any suggestion?
@Wenwen-Chen
I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?
Hi @egbaydarov, thank you very much for your suggestion. It got the same boost with correctly configured affinity.