liburing
liburing copied to clipboard
Socket prep function needed
To use socket asynchronous these function are needed
- io_uring_prep_socket
- io_uring_prep_getpeername
- io_uring_prep_setsockopt
- io_uring_prep_getsockopt
- io_uring_prep_getaddrinfo - to properly use
io_uring_prep_connect
These 2 if you are managing multiple sockets in 1 event loop and you need to restart 1 without blocking others
- io_uring_prep_bind
- io_uring_prep_listen
Ideally io_uring_prep_socket
, io_uring_prep_bind
and io_uring_prep_listen
would be linked using IOSQE_IO_LINK
Uh, getaddrinfo is implemented in userland, is comletely synchronous, reads various configuration files and environment variables, and just sends UDP queries to registered name servers, and parses them. And the code isn't small.
Right, getaddrinfo can't be implemented in the kernel, neither in liburing as it doesn't have infra around linked requests (that's given away to the users).
For others, they're quite fast when done synchronously. I don't think all that needed at the moment, so I'd add an "enchantment" label and leave it for now.
If we ignore getaddrinfo
for now
All these prep function would be considered basic for socket
(other then whats already implemented).
Lets take io_uring_prep_getpeername
for example. You could say it does not raise EAGAIN
, sure. That you can wrap it directly to getpeername()
sure, easy. Its only a SYSCALL away, right.
Now here is where it would make a difference (if implemented).
sqe = io_uring_get_sqe(ring)
io_uring_prep_accept(sqe, ...)
sqe.flags |= IOSQE_IO_LINK_VALUE # <- does this exist? No, but it would be cool if it did!
sqe.user_data = 1
fd = cqe.res # return from first, of course it wouldn't used cqe.res something internal to io_uring
sqe = io_uring_get_sqe(ring)
io_uring_prep_getpeername(sqe, fd, ...) # last return fd is passed to next on the link_value
sqe.user_data = 2
If something like this can be achieved you can get both task done in < 1 SYSCALL. It might not seems like much but when you are talking about tons of connection, it makes a huge difference.
So many possibilities...
I hope you get where I am going with this! We are talking about procedural tasks with single syscall. In this example we could be making a DNS lookup using UDP call. To showcase io_uring_prep_setsockopt
usage.
WHERE_TO_REPLACE = -123 # should populate with first connect's return(fd) value
io_uring_prep_connect(sqe, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE
io_uring_prep_setsockopt(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE
io_uring_prep_send(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE
io_uring_prep_recv(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE
io_uring_prep_close(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE
This type of single shoot multitasking can't be achieved with normal wrapper to socket functions.
It sounds curious, but that is not implemented and that's the elephant in the room. And it actually reminded me an old idea raised long ago, just sent out patches implementing that: https://lore.kernel.org/io-uring/[email protected]/T/#t
You may also be interesting to try out BPF, though it won't land upstream for a while. https://lore.kernel.org/io-uring/[email protected]/
It sounds curious, but that is not implemented and that's the elephant in the room.
Of course, originally I was waiting for basic file/socket functions to be implemented before creating an feature request for IOSQE_IO_LINK_VALUE
. Since it was confusing why I was pushing for function implementations that wasn't asynchronous, I had to mention it like so.
I am counting of you and @axboe seeing the benefit of this feature. Its wont be just limited to fd
, if its the return value of first function we can be doing multiple read/write, even database get/add record calls in a single syscall.
Not only does it prevent kernel-space to user-space and back (in a new event). In Python for example requires a conversion from c int to python int, this process is very slow, since even a simple int is an object (thus literally creates an class with all its methods).
I can't fully grasp this "old idea". Is it anything like if you accept/open
under io_uring_queue_init
initialized with IORING_SETUP_SQPOLL
flag, that fd
should be auto registered, like in io_uring_register_files
? That I would actually welcome, right now I am having to create user function for register_file
and unregister_file
and the fact that its fixed length at initialization time is a pain, also updating fd
on close! Manually registering fd
for accept
at user-space is yet another step and wastes time since you want to process incoming data asap. Also isn't *register*
synchronous since its dealing directly with ring
? Talking in term of *SQPOLL
.
I can't really comment on BPF since I don't know how it will work within io_uring.
I can't fully grasp this "old idea". Is it anything like if you
accept/open
underio_uring_queue_init
initialized withIORING_SETUP_SQPOLL
flag, thatfd
should be auto registered, like inio_uring_register_files
? That I would actually
Not particularly SQPOLL related. Essentially, it doesn't return fd, but internally calls io_uring_register_files
. In effect, more like the snippet below but in a single SQE.
fd = open()
io_uring_register_file(fd, index);
close(fd);
welcome, right now I am having to create user function for
register_file
andunregister_file
and the fact that its fixed length at initialization time is a pain,
What is fixed length? You mean io_uring_register_files
? You can do unregister-register to expand it if needed.
also updating
fd
on close!
Do you mean that if you want to close an file you should also deregister it from io_uring?
Manually registering
fd
foraccept
at user-space is yet another step and wastes time since you want to process incoming data asap. Also isn't*register*
synchronous since its dealing directly withring
? Talking in term of*SQPOLL
.
Didn't get that. What do you mean? There is a syscall interface for registering and also a request type allowing to update files. Updates are fast and don't quiesce the full thing. Unregister is on contrary, but you hopefully almost never need it.
Not particularly SQPOLL related. Essentially, it doesn't return fd, but internally calls io_uring_register_files. In effect, more like the snippet below but in a single SQE.
Since you implemented it I am sure there is a good reason for it. Though I don't get how you can do read/write task without fd
! I will wait till its in Liburing to see it in action ;)
What is fixed length? You mean io_uring_register_files? You can do unregister-register to expand it if needed.
Yes, also unregister process being slower
Do you mean that if you want to close an file you should also deregister it from io_uring?
Yes
Didn't get that. What do you mean? There is a syscall interface for registering and also a request type allowing to update files. Updates are fast and don't quiesce the full thing.
Actually, this is my bad. In my software I only have register_fd
and unregister_fd
, so register_fd/unregister_fd both use update. io_uring_register_files
is set on startup. I got myself mixed with register_fd and io_uring_register_files.
Unregister is on contrary, but you hopefully almost never need it.
aww you would be surprised what people will do! :D
Is there a limit to how many fd
can be registered? Is it *IOV_MAX
?
Not particularly SQPOLL related. Essentially, it doesn't return fd, but internally calls io_uring_register_files. In effect, more like the snippet below but in a single SQE.
Since you implemented it I am sure there is a good reason for it. Though I don't get how you can do read/write task without
fd
! I will wait till its in Liburing to see it in action ;)
Ok, in case there are misconceptions:
- io_uring doesn't need
io_uring_register_files
or any other file registration to work. E.g. all read/write/etc. should work fine without it, just pass a valid fd in SQE. - All that file registration is mostly for optimisation purposes, because without it does two extra atomics inside. To use a program first needs to register and remember the index where it was registered. And then instead of
io_prep_*opcode*(sqe, fd, ...);
It does:
io_prep_*opcode*(sqe, index, ...);
sqe->flags |= IOSQE_FIXED_FILE;
And the file will be grabbed not from a normal file table of the task, the one where all fds passed to other syscalls are stored, but an internal io_uring's table.
- file updates can be done both through syscall and via submitting requests. Both are fast. Unregister is syscall only and slow. Btw, if you need to resize it, we can actually make it work fast, but need to check details to be sure.
Unregister is on contrary, but you hopefully almost never need it.
aww you would be surprised what people will do! :D
Haha :)
Is there a limit to how many
fd
can be registered? Is it*IOV_MAX
?
For x64 it's 2^15 at the moment, and was so for long enough
@isilence Thanks for your input. Based on that created a basic test https://github.com/YoSTEALTH/Liburing/blob/master/test/register_file_test.py
p.s
with io_uring_unregister_files
time 1.20 seconds vs
without io_uring_unregister_files
time 0.15 seconds
@isilence Just added another test_register_fd_close
into the link above. It mimics your open + register + close fd + write/read. I must say, I had that wow moment, I didn't know you could open/close fd
then still be able to read/write to file/socket! that is just amazing!!!
It will enable you to bypass OS fd limits since you are opening/closing. I take it that's how epoll fake fd featured worked as well?
I totally want to try out that patch, did it land in Linux 5.14?
Opening and closing is every expensive (slow) having this be done at kernel level while you register that fd
is just brilliant.
I totally want to try out that patch, did it land in Linux 5.14?
I'll be sending it for 5.15.
Opening and closing is every expensive (slow) having this be done at kernel level while you register that
fd
is just brilliant.
Depends, but the open part itself should be as expensive as before. We just save on returning back to the userspace for it to register it, so a couple of CQEs and extra requests. However, the semantic is definitely nicer.
fwiw, close requests don't close registered files, but IORING_OP_FILES_UPDATE with fd=-1 does.
Depends, but the open part itself should be as expensive as before. We just save on returning back to the userspace for it to register it, so a couple of CQEs and extra requests. However, the semantic is definitely nicer.
Nice, saving multiple back and forth trip is a big deal, also having slow open/close taken care of in the background userspace event manager can focus on other tasks.
How will it look code wise?
io_uring_prep_openat(...)
sqe.flags |= IOSQE_OPEN_CLOSE_INDEX
io_uring_prep_accept(...)
sqe.flags |= IOSQE_OPEN_CLOSE_INDEX
cqe.res # returns `index`?
fwiw, close requests don't close registered files, but IORING_OP_FILES_UPDATE with fd=-1 does.
Yes
How will it look code wise?
io_uring_prep_openat(...) sqe.flags |= IOSQE_OPEN_CLOSE_INDEX io_uring_prep_accept(...) sqe.flags |= IOSQE_OPEN_CLOSE_INDEX cqe.res # returns `index`?
Definitely don't want to waste sqe.flags bit, for the current version if sqe->buf_index == 0, it works as before, if not then (sqe->buf_index - 1) is the index where it will be placed.
Definitely don't want to waste sqe.flags bit, for the current version if sqe->buf_index == 0, it works as before, if not then (sqe->buf_index - 1) is the index where it will be placed.
That would be awesome.
Btw, is socket()
properly supported to be used with io_uring_register_files
? If so, can I use set sockfd
with registered index
value in:
int setsockopt(int sockfd, int level, int optname, const void *optval, socklen_t optlen);
I wrapped cffi directly to raw socket (no more python socket).
good news: - now getting ruffly 20% increase in speed!
Bad news:
- Can not use registered index
for socket
, bind
, listen
, ..., having to pass fd
(not a major problem).
- Can pass index
into io_uring_prep_accept
, *_send
, *_recv
(which is cool)
- Can't pass index
into getpeername()
to get client ip, port, ...
- Can not setsockopt
, getsockopt
, ... using index
- OSError: [Errno 88] Socket operation on non-socket
So this pretty much puts an end to being able to use registered index for socket.
with register enabled (not using getpeername, ... functions): - Transaction rate: 3690.04 trans/sec - Transaction rate: 3636.36 trans/sec - Transaction rate: 3663.00 trans/sec without: - Transaction rate: 4629.63 trans/sec - Transaction rate: 4484.30 trans/sec - Transaction rate: 4608.29 trans/sec
So using register index is actually slower, most likely since I am manually closing fd after accept and having to register/unregister is adding extra back and forth.
Maybe it will be a different case using that new open/register/close patch!
Can *setsockopt*
and *getsockopt*
be implemented at the very least? So it can be used with io_uring_prep_accept_direct
?
This thread explained why setting socket buffer size didn't work for me. Is there someone currently working on implementing setsockopt
and getsockopt
?
It is being considered, yes. Question is if it should be done separately, or be part of the passthrough work that is currently underway.
passthrough work
?
passthrough work
?
@YoSTEALTH It's probably about the recent io_uring passthrough patchset. See the full messages here: https://lore.kernel.org/io-uring/[email protected]/T/
Right - the point here is that we could easily do set/getsockopt through that, and would be wasteful to add them as separate opcodes given that.
@ammarfaizi2 thanks.
Right - the point here is that we could easily do set/getsockopt through that, and would be wasteful to add them as separate opcodes given that.
Sounds like a good idea.
@axboe any idea of timeline when pass-through will be added into io_uring
+liburing
?
Most likely 5.19.
Is it currently impossible to bind
and listen
on a socket created with io_uring_prep_socket_direct()
because of this?
Closing this one as it's a mix of a bunch of things, some of which are available at this point. Feel free to open a new issue, but please keep each issue specific to a single feature.