valkey
valkey copied to clipboard
Use io_uring to batch handle clients pending writes to reduce SYSCALL count.
Description
This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for valkey when handleClientsWithPendingWrites.
With this patch, we can observe more than 6% perf gain for SET/GET.
This patch was implemented based on below discussion during the review:
- Introduce a
io_uring.hto handle the io_uing related API to split it from server logic. - Make
io_uring.hindependent ofserver.h. - Only use
io_uringto gain performance when write client static buffer.
Benchmark Result
Test Env
- OPERATING SYSTEM: Ubuntu 22.04.4 LTS
- Kernel: 5.15.0-116-generic
- PROCESSOR: Intel Xeon Platinum 8380
- Base: 5b9fc465239036b378a39cbebf3706a2db74dd32
- Server and Client in same socket.
Test Steps
- Start valkey-server with below config.
taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf
port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
- Start valkey-benchmark to ensure valkey-server CPU utilized is 1(fully utilized).
taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4
Test Result
QPS of SET and GET can increase 6.5%, 6.6% correspondingly.
Perf Stat
The perf stat info shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.
perf stat -p `pidof valkey-server` sleep 10
# w/o io_uring
Performance counter stats for process id '2267781':
9,993.95 msec task-clock # 0.999 CPUs utilized
625 context-switches # 62.538 /sec
0 cpu-migrations # 0.000 /sec
94,933 page-faults # 9.499 K/sec
33,894,880,825 cycles # 3.392 GHz
39,284,579,699 instructions # 1.16 insn per cycle
7,750,350,988 branches # 775.504 M/sec
73,791,242 branch-misses # 0.95% of all branches
169,474,584,465 slots # 16.958 G/sec
39,212,071,735 topdown-retiring # 23.1% retiring
11,962,902,869 topdown-bad-spec # 7.1% bad speculation
43,199,367,984 topdown-fe-bound # 25.5% frontend bound
75,159,711,305 topdown-be-bound # 44.3% backend bound
10.001262795 seconds time elapsed
# w/ io_uring
Performance counter stats for process id '2273716':
9,970.38 msec task-clock # 0.997 CPUs utilized
1,077 context-switches # 108.020 /sec
1 cpu-migrations # 0.100 /sec
124,080 page-faults # 12.445 K/sec
33,813,062,268 cycles # 3.391 GHz
41,455,816,158 instructions # 1.23 insn per cycle
8,063,017,730 branches # 808.697 M/sec
68,008,453 branch-misses # 0.84% of all branches
169,066,451,360 slots # 16.957 G/sec
38,077,547,648 topdown-retiring # 22.0% retiring
28,509,121,765 topdown-bad-spec # 16.5% bad speculation
41,083,738,441 topdown-fe-bound # 23.8% frontend bound
65,062,545,805 topdown-be-bound # 37.7% backend bound
10.001785198 seconds time elapsed
NOTE
- Since io_uring was adopted from kernel 5.1, if kernel doesn't support io_uring yet, it will use the origin implementation.
- This patch introduce the liburing dependency, it is installed in my local env, to keep it simple, I didn't include liburing dependency in this patch, the CI build may failed.
If you merge latest unstable, the spellcheck is fixed.
Can you add a check for <liburing.h> in Makefile, something like this:
HAS_LIBURING := $(shell sh -c 'echo "$(NUMBER_SIGN_CHAR)include <liburing.h>" > foo.c; \
$(CC) -E foo.c > /dev/null 2>&1 && echo yes; \
rm foo.c')
ifeq ($(HAS_LIBURING),yes)
...
else
...
endif
I have not taken a closer look at the PR but at the minimum I think we need a config to opt in io_uring.
I am also suspecting that the gain is likely coming from tapping into the additional cores that couldn't be utilized efficiently by the current io-threads. If so this would lead to two questions
-
On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?
-
Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?
Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.
- On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?
The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .
- Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?
More context about the multi threading improvements from community?
Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.
I am not security experts, can you give more details about your concern, will this a blocker for community to adopt io_uring?
The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .
Start valkey-becnmark taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4 to ensure valkey-server CPU utilized is 1(fully utilized).
io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads? Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.
More context about the multi threading improvements from community?
#22
I am not security experts, can you give more details about your concern
https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
will this a blocker for community to adopt io_uring?
I would say this would be a serious concern for me. There are two possible outcomes
- the vulns are mostly in the kernel and there isn't much application developers (us) can do.
- the vulns can be mitigated by changes in the application, which is Valkey here
- would reduce the reach of this feature
- would add more work on the Valkey team
io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads?
Actually, I didn't use the busy polling model of io_uring in this patch, no background threads started of io_uring, all the cycles generated by io_uring are classified into Valkey server. Just as I post in top comment, the perf gain benefit from the feature of https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#batching
Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.
I can get a similar result with 2-4 CPUs allocated.
I would say this would be a serious concern for me. There are two possible outcomes
the vulns are mostly in the kernel and there isn't much application developers (us) can do.
the vulns can be mitigated by changes in the application, which is Valkey here
would reduce the reach of this feature
would add more work on the Valkey team
Just glanced the vulns list: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring, seems most of the fixes happened in the kernel side, there isn't application developers can do. I think we can keep supporting the io_uring, user can disable the io_uring if they want.
I think this is really good stuff. Performance is one of the areas we should prioritize IMO.
I think this is really good stuff. Performance is one of the areas we should prioritize IMO.
Thanks @zuiderkwast, the question is how can we push this patch forward?
@lipzhu We are a new team, new project, first release and we are still busy with rebranding from redis to valkey and getting a website up. I think you just need some patience to let other team members have some time to look and think about it.
I'll add it to the backlog for Valkey 8. It will not forgotten.
Yeah this kind of changes requires the reviewers to block off some decent amount of time and to really think through it holistically. This week is really busy for the team as many of us are having in person engagement with the OSS community. We really appreciate your patience.
@lipzhu
Description
This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for Valkey when
handleClientsWithPendingWrites. With this patch, we can observe more than 4% perf gain for SET/GET, and didn't see an obvious performance regression.
As far as I know, IO_Uring is a high efficient IO engine. Do you have any plan to optimize Valkey's other modules by using io_uring technology? For example, ae framework, snapshot operations.
As far as I know, IO_Uring is a high efficient IO engine. Do you have any plan to optimize Valkey's other modules by using io_uring technology? For example, ae framework, snapshot operations.
Sure, but at the beginning when we decide to introduce io_uring, we want to search the scenarios that io_uring really helps on perf gain, this patch is straightforward. And another scenario I come out is disk related operation, I open https://github.com/valkey-io/valkey/issues/255 to understand the details. For the ae framework part, it needs a lot of work to replace the epoll and sync workflow per my understanding. I had a POC before, I can observe the performance gain, but the cost is more CPU resources allocated. So I want to integrate io_uring incrementally, and I also need the help from community, as you know, currently they are busy rebranding :)
Thanks @lipzhu!
I am generally aligned with the high level idea (and good to know that you don't use polling).
I do have some high level feedback around the code structure and I will list them here too
- we should go with an opt-in approach and keep io-uring off by default
- I think we should avoid mixing sync
read()/write()calls with io_uring. Let's explore a way to have a cleaner separation - the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream
BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.
@lipzhu, looking at your results above, the amount of the read calls jumps out too. It will be great if you could apply io-uring to the query path as well.
@PingXie Thanks for your comments :)
@lipzhu, looking at your results above, the amount of the read calls jumps out too.
The counter data is based on the time duration(10s), each query is pair to readQueryFromClient, so I think the SYSCALL count of read increased is sensible because the QPS increased too.
It will be great if you could apply io-uring to the query path as well.
@PingXie I have done this before, but some issues I found are:
- I didn't find a batch read scenario from read query path, if use the
io_uring_prep_readand followingio_uring_submit_and_waitsimply to simulate theread, the SYSCALL count didn't reduce andio_uring_enteris more expensive thanread. - Prefer a small PR, each pr focus only one thing.
Thanks @lipzhu!
I am generally aligned with the high level idea (and good to know that you don't use polling).
I do have some high level feedback around the code structure and I will list them here too
- we should go with an opt-in approach and keep io-uring off by default
Ok, I will introduce a new config like io-uring (default off) in valkey.conf.
- I think we should avoid mixing sync
read()/write()calls with io_uring. Let's explore a way to have a cleaner separation
I will refactor the code to explore a cleaner separation.
- the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream
The reason I didn't do for scatter/gather IOs because I remember I didn't observe the perf gain with io_uring, I will double confirm this later, for the replication stream, thanks to point it out, I will check it later.
BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.
Sure, thanks for your guidance and patience.
