valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Use io_uring to batch handle clients pending writes to reduce SYSCALL count.

Open lipzhu opened this issue 1 year ago • 31 comments

Description

This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for valkey when handleClientsWithPendingWrites. With this patch, we can observe more than 6% perf gain for SET/GET. This patch was implemented based on below discussion during the review:

  1. Introduce a io_uring.h to handle the io_uing related API to split it from server logic.
  2. Make io_uring.h independent of server.h .
  3. Only use io_uring to gain performance when write client static buffer.

Benchmark Result

Test Env

  • OPERATING SYSTEM: Ubuntu 22.04.4 LTS
  • Kernel: 5.15.0-116-generic
  • PROCESSOR: Intel Xeon Platinum 8380
  • Base: 5b9fc465239036b378a39cbebf3706a2db74dd32
  • Server and Client in same socket.

Test Steps

  1. Start valkey-server with below config.
taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
  1. Start valkey-benchmark to ensure valkey-server CPU utilized is 1(fully utilized).
taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4

Test Result

QPS of SET and GET can increase 6.5%, 6.6% correspondingly.

Perf Stat

The perf stat info shows that only 1 CPU resource was used during the test and the IPC also increase 6%, not from more CPU resources.

perf stat -p `pidof valkey-server` sleep 10

# w/o io_uring
 Performance counter stats for process id '2267781':

          9,993.95 msec task-clock                #    0.999 CPUs utilized
               625      context-switches          #   62.538 /sec
                 0      cpu-migrations            #    0.000 /sec
            94,933      page-faults               #    9.499 K/sec
    33,894,880,825      cycles                    #    3.392 GHz
    39,284,579,699      instructions              #    1.16  insn per cycle
     7,750,350,988      branches                  #  775.504 M/sec
        73,791,242      branch-misses             #    0.95% of all branches
   169,474,584,465      slots                     #   16.958 G/sec
    39,212,071,735      topdown-retiring          #     23.1% retiring
    11,962,902,869      topdown-bad-spec          #      7.1% bad speculation
    43,199,367,984      topdown-fe-bound          #     25.5% frontend bound
    75,159,711,305      topdown-be-bound          #     44.3% backend bound

      10.001262795 seconds time elapsed

# w/ io_uring
 Performance counter stats for process id '2273716':

          9,970.38 msec task-clock                #    0.997 CPUs utilized
             1,077      context-switches          #  108.020 /sec
                 1      cpu-migrations            #    0.100 /sec
           124,080      page-faults               #   12.445 K/sec
    33,813,062,268      cycles                    #    3.391 GHz
    41,455,816,158      instructions              #    1.23  insn per cycle
     8,063,017,730      branches                  #  808.697 M/sec
        68,008,453      branch-misses             #    0.84% of all branches
   169,066,451,360      slots                     #   16.957 G/sec
    38,077,547,648      topdown-retiring          #     22.0% retiring
    28,509,121,765      topdown-bad-spec          #     16.5% bad speculation
    41,083,738,441      topdown-fe-bound          #     23.8% frontend bound
    65,062,545,805      topdown-be-bound          #     37.7% backend bound

      10.001785198 seconds time elapsed

NOTE

  • Since io_uring was adopted from kernel 5.1, if kernel doesn't support io_uring yet, it will use the origin implementation.
  • This patch introduce the liburing dependency, it is installed in my local env, to keep it simple, I didn't include liburing dependency in this patch, the CI build may failed.

lipzhu avatar Apr 01 '24 05:04 lipzhu

If you merge latest unstable, the spellcheck is fixed.

Can you add a check for <liburing.h> in Makefile, something like this:

HAS_LIBURING := $(shell sh -c 'echo "$(NUMBER_SIGN_CHAR)include <liburing.h>" > foo.c; \
	$(CC) -E foo.c > /dev/null 2>&1 && echo yes; \
	rm foo.c')
ifeq ($(HAS_LIBURING),yes)
	...
else
	...
endif

zuiderkwast avatar Apr 02 '24 19:04 zuiderkwast

I have not taken a closer look at the PR but at the minimum I think we need a config to opt in io_uring.

I am also suspecting that the gain is likely coming from tapping into the additional cores that couldn't be utilized efficiently by the current io-threads. If so this would lead to two questions

  1. On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?

  2. Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?

Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.

PingXie avatar Apr 15 '24 04:04 PingXie

  1. On lower spec machines with less cores, say 2 or 4, will we see the similar improvements?

The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .

  1. Where do we see io_uring's place in light of the planned multi threading improvements? This improvement would potentially allow Valkey to use cores beyond 4 or 8 a lot more efficiently hence leaving not much room for io-uring?

More context about the multi threading improvements from community?

Lastly,I haven't checked recently but in the past, io-uring has seen quite amount of vulnerabilities. So in addition to the design and PR review, we should also take a hard look at the security implications.

I am not security experts, can you give more details about your concern, will this a blocker for community to adopt io_uring?

lipzhu avatar Apr 15 '24 12:04 lipzhu

The core numbers will not affect the perf gain, you can refer the server config I post in top comment, io-threads is disabled, the maximum CPU utilized is 1, the benchmark clients will make sure server CPU is fully utilized. Just as I described in top comment, the gain benefit from the reduce of SYSCALL .

Start valkey-becnmark taskset -c 16-19 ~/src/valkey-benchmark -p 9001 -t set -d 100 -r 1000000 -n 5000000 -c 50 --threads 4 to ensure valkey-server CPU utilized is 1(fully utilized).

io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads? Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.

More context about the multi threading improvements from community?

#22

I am not security experts, can you give more details about your concern

https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html

will this a blocker for community to adopt io_uring?

I would say this would be a serious concern for me. There are two possible outcomes

  1. the vulns are mostly in the kernel and there isn't much application developers (us) can do.
  2. the vulns can be mitigated by changes in the application, which is Valkey here
  1. would reduce the reach of this feature
  2. would add more work on the Valkey team

PingXie avatar Apr 15 '24 21:04 PingXie

io-uring comes with busy polling outside of the Valkey (io/main) threads. Does this CPU usage include that or just the CPU cycles accumulated by the Valkey threads?

Actually, I didn't use the busy polling model of io_uring in this patch, no background threads started of io_uring, all the cycles generated by io_uring are classified into Valkey server. Just as I post in top comment, the perf gain benefit from the feature of https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#batching

Going back to your original post, it seems to indicate this is just the Valkey CPU usage? I think a more deterministic setup would be using a 2/4 core machine.

I can get a similar result with 2-4 CPUs allocated.

I would say this would be a serious concern for me. There are two possible outcomes

  1. the vulns are mostly in the kernel and there isn't much application developers (us) can do.

  2. the vulns can be mitigated by changes in the application, which is Valkey here

  3. would reduce the reach of this feature

  4. would add more work on the Valkey team

Just glanced the vulns list: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring, seems most of the fixes happened in the kernel side, there isn't application developers can do. I think we can keep supporting the io_uring, user can disable the io_uring if they want.

lipzhu avatar Apr 16 '24 06:04 lipzhu

I think this is really good stuff. Performance is one of the areas we should prioritize IMO.

zuiderkwast avatar Apr 16 '24 10:04 zuiderkwast

I think this is really good stuff. Performance is one of the areas we should prioritize IMO.

Thanks @zuiderkwast, the question is how can we push this patch forward?

lipzhu avatar Apr 17 '24 00:04 lipzhu

@lipzhu We are a new team, new project, first release and we are still busy with rebranding from redis to valkey and getting a website up. I think you just need some patience to let other team members have some time to look and think about it.

I'll add it to the backlog for Valkey 8. It will not forgotten.

zuiderkwast avatar Apr 17 '24 20:04 zuiderkwast

Yeah this kind of changes requires the reviewers to block off some decent amount of time and to really think through it holistically. This week is really busy for the team as many of us are having in person engagement with the OSS community. We really appreciate your patience.

PingXie avatar Apr 18 '24 13:04 PingXie

@lipzhu

Description

This patch try to benefit the io_uring batching feature to reduce the SYSCALL count for Valkey when handleClientsWithPendingWrites. With this patch, we can observe more than 4% perf gain for SET/GET, and didn't see an obvious performance regression.

As far as I know, IO_Uring is a high efficient IO engine. Do you have any plan to optimize Valkey's other modules by using io_uring technology? For example, ae framework, snapshot operations.

Wenwen-Chen avatar Apr 25 '24 07:04 Wenwen-Chen

As far as I know, IO_Uring is a high efficient IO engine. Do you have any plan to optimize Valkey's other modules by using io_uring technology? For example, ae framework, snapshot operations.

Sure, but at the beginning when we decide to introduce io_uring, we want to search the scenarios that io_uring really helps on perf gain, this patch is straightforward. And another scenario I come out is disk related operation, I open https://github.com/valkey-io/valkey/issues/255 to understand the details. For the ae framework part, it needs a lot of work to replace the epoll and sync workflow per my understanding. I had a POC before, I can observe the performance gain, but the cost is more CPU resources allocated. So I want to integrate io_uring incrementally, and I also need the help from community, as you know, currently they are busy rebranding :)

lipzhu avatar Apr 25 '24 10:04 lipzhu

Thanks @lipzhu!

I am generally aligned with the high level idea (and good to know that you don't use polling).

I do have some high level feedback around the code structure and I will list them here too

  1. we should go with an opt-in approach and keep io-uring off by default
  2. I think we should avoid mixing sync read()/write() calls with io_uring. Let's explore a way to have a cleaner separation
  3. the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream

BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.

PingXie avatar Apr 26 '24 21:04 PingXie

@lipzhu, looking at your results above, the amount of the read calls jumps out too. It will be great if you could apply io-uring to the query path as well.

image

PingXie avatar Apr 27 '24 02:04 PingXie

@PingXie Thanks for your comments :)

@lipzhu, looking at your results above, the amount of the read calls jumps out too.

image

The counter data is based on the time duration(10s), each query is pair to readQueryFromClient, so I think the SYSCALL count of read increased is sensible because the QPS increased too.

It will be great if you could apply io-uring to the query path as well.

@PingXie I have done this before, but some issues I found are:

  1. I didn't find a batch read scenario from read query path, if use the io_uring_prep_read and following io_uring_submit_and_wait simply to simulate the read, the SYSCALL count didn't reduce and io_uring_enter is more expensive than read.
  2. Prefer a small PR, each pr focus only one thing.

lipzhu avatar Apr 28 '24 02:04 lipzhu

Thanks @lipzhu!

I am generally aligned with the high level idea (and good to know that you don't use polling).

I do have some high level feedback around the code structure and I will list them here too

  1. we should go with an opt-in approach and keep io-uring off by default

Ok, I will introduce a new config like io-uring (default off) in valkey.conf.

  1. I think we should avoid mixing sync read()/write() calls with io_uring. Let's explore a way to have a cleaner separation

I will refactor the code to explore a cleaner separation.

  1. the io_uring support seems incomplete - we are missing the support for scatter/gather IOs; also not sure about the rationale behind excluding the replication stream

The reason I didn't do for scatter/gather IOs because I remember I didn't observe the perf gain with io_uring, I will double confirm this later, for the replication stream, thanks to point it out, I will check it later.

BTW, I don't have all the details on #22 at the moment so there is a chance that we might have to revisit/rethink this PR, depending on the relative pace of the two. That said, let's continue collaborating on this PR, assuming we would like to incorporate io_uring in Valkey.

Sure, thanks for your guidance and patience.

lipzhu avatar Apr 28 '24 02:04 lipzhu