liburing
liburing copied to clipboard
Investigate WSL performance?
I saw this comment showing worse then EPOLL performance. https://github.com/chriskohlhoff/asio/issues/401#issuecomment-1126778607
I saw this comment showing worse then EPOLL performance. chriskohlhoff/asio#401 (comment)
Well, 5.10 is not interesting at all, lots has happened since then on the optimisation side. And performance will depend on how io_uring is used (operations types, file types, what features utilised, what is the userspace architecture, etc.). Does WSL has newer kernels?
I don't think this is just a case of older kernels, my guess would be suboptimal code that results in io-wq usage.
Right, e.g. at least in one of the tests they were using pipes (which are lacking good support by io_uring for various reasons), but the point is, it's not productive to even start discussing io_uring performance of a 2 year old kernel.
don't know about windows status but git master branch shows linux 5.15 for WSL https://github.com/microsoft/WSL2-Linux-Kernel/commits/linux-msft-wsl-5.15.y
Here are my results OS: Windows 10 21H2 CPU: AMD Ryzen 7 5800X WSL Linux 5.10 epoll 140,000 /sec uring 133,000 /sec WSL Linux 6.0.0 epoll 149,000 /sec uring 170,000 /sec
Testing: Each thread is pinned (1 server thread, 3 client threads) Each client thread is epoll backed with 2048 connections each (6144 total) No fixed files or buffers used for io_uring
Notes: It's worth mentioning that the 6.0.0 uring results are odd epoll maintains a consistent 149-151k /sec while uring averages around 167-173k /sec and sometimes peaks at 184k
On native linux 5.19, performance overall is a lot better at consistent 230-234k /sec for epoll, and consistent 236-242k /sec for io_uring, however the performance difference on Windows is much larger between the two
Another thing is, performance is not much worse by using IORING_OP_POLL_ADD and recv/send on the user side. If I were to guess, I would assume most of the gains are from the reduced system calls. When I was taking a quick look through the kernel code, io_uring poll events have to be task worked (and placed in a linked list) before the poll events are processed, while epoll seems to handle the event more directly (I believe, may be wrong). Despite this, uring seems to consistently perform equal to epoll or outperform by 1-5% on native linux 5.19 with: 1 client thread (iirc), 3 client threads, 1024 connections per thread, 2048 connections per thread, and using rust_echo_bench which is 1 thread per socket
In analysis of perf top
, most (80% or higher) of the CPU time is dedicated to the send
call on a socket, which explains why the reduced system call numbers only result in tiny gains.