liburing icon indicating copy to clipboard operation
liburing copied to clipboard

Socket prep function needed

Open YoSTEALTH opened this issue 4 years ago • 26 comments

To use socket asynchronous these function are needed

  • io_uring_prep_socket
  • io_uring_prep_getpeername
  • io_uring_prep_setsockopt
  • io_uring_prep_getsockopt
  • io_uring_prep_getaddrinfo - to properly use io_uring_prep_connect

These 2 if you are managing multiple sockets in 1 event loop and you need to restart 1 without blocking others

  • io_uring_prep_bind
  • io_uring_prep_listen

Ideally io_uring_prep_socket, io_uring_prep_bind and io_uring_prep_listen would be linked using IOSQE_IO_LINK

YoSTEALTH avatar Nov 08 '20 10:11 YoSTEALTH

Uh, getaddrinfo is implemented in userland, is comletely synchronous, reads various configuration files and environment variables, and just sends UDP queries to registered name servers, and parses them. And the code isn't small.

anon17 avatar Nov 27 '20 13:11 anon17

Right, getaddrinfo can't be implemented in the kernel, neither in liburing as it doesn't have infra around linked requests (that's given away to the users).

For others, they're quite fast when done synchronously. I don't think all that needed at the moment, so I'd add an "enchantment" label and leave it for now.

isilence avatar Feb 01 '21 10:02 isilence

If we ignore getaddrinfo for now

All these prep function would be considered basic for socket (other then whats already implemented).

Lets take io_uring_prep_getpeername for example. You could say it does not raise EAGAIN, sure. That you can wrap it directly to getpeername() sure, easy. Its only a SYSCALL away, right.

Now here is where it would make a difference (if implemented).


    sqe = io_uring_get_sqe(ring)
    io_uring_prep_accept(sqe, ...)
    sqe.flags |= IOSQE_IO_LINK_VALUE    # <- does this exist? No, but it would be cool if it did!
    sqe.user_data = 1

    fd = cqe.res    # return from first, of course it wouldn't used cqe.res something internal to io_uring

    sqe = io_uring_get_sqe(ring)
    io_uring_prep_getpeername(sqe, fd, ...)  # last return fd is passed to next on the link_value
    sqe.user_data = 2

If something like this can be achieved you can get both task done in < 1 SYSCALL. It might not seems like much but when you are talking about tons of connection, it makes a huge difference.

So many possibilities...

YoSTEALTH avatar Jun 21 '21 20:06 YoSTEALTH

I hope you get where I am going with this! We are talking about procedural tasks with single syscall. In this example we could be making a DNS lookup using UDP call. To showcase io_uring_prep_setsockopt usage.


WHERE_TO_REPLACE = -123  # should populate with first connect's return(fd) value

io_uring_prep_connect(sqe, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE

io_uring_prep_setsockopt(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE

io_uring_prep_send(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE

io_uring_prep_recv(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE

io_uring_prep_close(sqe, WHERE_TO_REPLACE, ...)
sqe.flag |= IOSQE_IO_LINK_VALUE

This type of single shoot multitasking can't be achieved with normal wrapper to socket functions.

YoSTEALTH avatar Jun 21 '21 21:06 YoSTEALTH

It sounds curious, but that is not implemented and that's the elephant in the room. And it actually reminded me an old idea raised long ago, just sent out patches implementing that: https://lore.kernel.org/io-uring/[email protected]/T/#t

You may also be interesting to try out BPF, though it won't land upstream for a while. https://lore.kernel.org/io-uring/[email protected]/

isilence avatar Jul 07 '21 14:07 isilence

It sounds curious, but that is not implemented and that's the elephant in the room.

Of course, originally I was waiting for basic file/socket functions to be implemented before creating an feature request for IOSQE_IO_LINK_VALUE. Since it was confusing why I was pushing for function implementations that wasn't asynchronous, I had to mention it like so.

I am counting of you and @axboe seeing the benefit of this feature. Its wont be just limited to fd, if its the return value of first function we can be doing multiple read/write, even database get/add record calls in a single syscall.

Not only does it prevent kernel-space to user-space and back (in a new event). In Python for example requires a conversion from c int to python int, this process is very slow, since even a simple int is an object (thus literally creates an class with all its methods).

I can't fully grasp this "old idea". Is it anything like if you accept/open under io_uring_queue_init initialized with IORING_SETUP_SQPOLL flag, that fd should be auto registered, like in io_uring_register_files? That I would actually welcome, right now I am having to create user function for register_file and unregister_file and the fact that its fixed length at initialization time is a pain, also updating fd on close! Manually registering fd for accept at user-space is yet another step and wastes time since you want to process incoming data asap. Also isn't *register* synchronous since its dealing directly with ring? Talking in term of *SQPOLL.

I can't really comment on BPF since I don't know how it will work within io_uring.

YoSTEALTH avatar Jul 08 '21 04:07 YoSTEALTH

I can't fully grasp this "old idea". Is it anything like if you accept/open under io_uring_queue_init initialized with IORING_SETUP_SQPOLL flag, that fd should be auto registered, like in io_uring_register_files? That I would actually

Not particularly SQPOLL related. Essentially, it doesn't return fd, but internally calls io_uring_register_files. In effect, more like the snippet below but in a single SQE.

fd = open()
io_uring_register_file(fd, index);
close(fd);

welcome, right now I am having to create user function for register_file and unregister_file and the fact that its fixed length at initialization time is a pain,

What is fixed length? You mean io_uring_register_files? You can do unregister-register to expand it if needed.

also updating fd on close!

Do you mean that if you want to close an file you should also deregister it from io_uring?

Manually registering fd for accept at user-space is yet another step and wastes time since you want to process incoming data asap. Also isn't *register* synchronous since its dealing directly with ring? Talking in term of *SQPOLL.

Didn't get that. What do you mean? There is a syscall interface for registering and also a request type allowing to update files. Updates are fast and don't quiesce the full thing. Unregister is on contrary, but you hopefully almost never need it.

isilence avatar Jul 08 '21 09:07 isilence

Not particularly SQPOLL related. Essentially, it doesn't return fd, but internally calls io_uring_register_files. In effect, more like the snippet below but in a single SQE.

Since you implemented it I am sure there is a good reason for it. Though I don't get how you can do read/write task without fd! I will wait till its in Liburing to see it in action ;)

What is fixed length? You mean io_uring_register_files? You can do unregister-register to expand it if needed.

Yes, also unregister process being slower

Do you mean that if you want to close an file you should also deregister it from io_uring?

Yes

Didn't get that. What do you mean? There is a syscall interface for registering and also a request type allowing to update files. Updates are fast and don't quiesce the full thing.

Actually, this is my bad. In my software I only have register_fd and unregister_fd, so register_fd/unregister_fd both use update. io_uring_register_files is set on startup. I got myself mixed with register_fd and io_uring_register_files.

Unregister is on contrary, but you hopefully almost never need it.

aww you would be surprised what people will do! :D

Is there a limit to how many fd can be registered? Is it *IOV_MAX?

YoSTEALTH avatar Jul 08 '21 10:07 YoSTEALTH

Not particularly SQPOLL related. Essentially, it doesn't return fd, but internally calls io_uring_register_files. In effect, more like the snippet below but in a single SQE.

Since you implemented it I am sure there is a good reason for it. Though I don't get how you can do read/write task without fd! I will wait till its in Liburing to see it in action ;)

Ok, in case there are misconceptions:

  1. io_uring doesn't need io_uring_register_files or any other file registration to work. E.g. all read/write/etc. should work fine without it, just pass a valid fd in SQE.
  2. All that file registration is mostly for optimisation purposes, because without it does two extra atomics inside. To use a program first needs to register and remember the index where it was registered. And then instead of
io_prep_*opcode*(sqe, fd, ...);

It does:

io_prep_*opcode*(sqe, index, ...);
sqe->flags |= IOSQE_FIXED_FILE;

And the file will be grabbed not from a normal file table of the task, the one where all fds passed to other syscalls are stored, but an internal io_uring's table.

  1. file updates can be done both through syscall and via submitting requests. Both are fast. Unregister is syscall only and slow. Btw, if you need to resize it, we can actually make it work fast, but need to check details to be sure.

Unregister is on contrary, but you hopefully almost never need it.

aww you would be surprised what people will do! :D

Haha :)

Is there a limit to how many fd can be registered? Is it *IOV_MAX?

For x64 it's 2^15 at the moment, and was so for long enough

isilence avatar Jul 08 '21 12:07 isilence

@isilence Thanks for your input. Based on that created a basic test https://github.com/YoSTEALTH/Liburing/blob/master/test/register_file_test.py

p.s with io_uring_unregister_files time 1.20 seconds vs without io_uring_unregister_files time 0.15 seconds

YoSTEALTH avatar Jul 08 '21 18:07 YoSTEALTH

@isilence Just added another test_register_fd_close into the link above. It mimics your open + register + close fd + write/read. I must say, I had that wow moment, I didn't know you could open/close fd then still be able to read/write to file/socket! that is just amazing!!!

It will enable you to bypass OS fd limits since you are opening/closing. I take it that's how epoll fake fd featured worked as well?

I totally want to try out that patch, did it land in Linux 5.14?

Opening and closing is every expensive (slow) having this be done at kernel level while you register that fd is just brilliant.

YoSTEALTH avatar Jul 11 '21 16:07 YoSTEALTH

I totally want to try out that patch, did it land in Linux 5.14?

I'll be sending it for 5.15.

Opening and closing is every expensive (slow) having this be done at kernel level while you register that fd is just brilliant.

Depends, but the open part itself should be as expensive as before. We just save on returning back to the userspace for it to register it, so a couple of CQEs and extra requests. However, the semantic is definitely nicer.

fwiw, close requests don't close registered files, but IORING_OP_FILES_UPDATE with fd=-1 does.

isilence avatar Jul 13 '21 08:07 isilence

Depends, but the open part itself should be as expensive as before. We just save on returning back to the userspace for it to register it, so a couple of CQEs and extra requests. However, the semantic is definitely nicer.

Nice, saving multiple back and forth trip is a big deal, also having slow open/close taken care of in the background userspace event manager can focus on other tasks.

How will it look code wise?


io_uring_prep_openat(...)
sqe.flags |= IOSQE_OPEN_CLOSE_INDEX

io_uring_prep_accept(...)
sqe.flags |= IOSQE_OPEN_CLOSE_INDEX

cqe.res  # returns `index`?

fwiw, close requests don't close registered files, but IORING_OP_FILES_UPDATE with fd=-1 does.

Yes

YoSTEALTH avatar Jul 13 '21 13:07 YoSTEALTH

How will it look code wise?

io_uring_prep_openat(...)
sqe.flags |= IOSQE_OPEN_CLOSE_INDEX

io_uring_prep_accept(...)
sqe.flags |= IOSQE_OPEN_CLOSE_INDEX

cqe.res  # returns `index`?

Definitely don't want to waste sqe.flags bit, for the current version if sqe->buf_index == 0, it works as before, if not then (sqe->buf_index - 1) is the index where it will be placed.

isilence avatar Jul 14 '21 13:07 isilence

Definitely don't want to waste sqe.flags bit, for the current version if sqe->buf_index == 0, it works as before, if not then (sqe->buf_index - 1) is the index where it will be placed.

That would be awesome.

Btw, is socket() properly supported to be used with io_uring_register_files? If so, can I use set sockfd with registered index value in:

int setsockopt(int sockfd, int level, int optname, const void *optval, socklen_t optlen);

YoSTEALTH avatar Jul 17 '21 02:07 YoSTEALTH

I wrapped cffi directly to raw socket (no more python socket).

good news: - now getting ruffly 20% increase in speed!

Bad news: - Can not use registered index for socket, bind, listen, ..., having to pass fd (not a major problem). - Can pass index into io_uring_prep_accept, *_send, *_recv (which is cool) - Can't pass index into getpeername() to get client ip, port, ... - Can not setsockopt, getsockopt, ... using index - OSError: [Errno 88] Socket operation on non-socket

So this pretty much puts an end to being able to use registered index for socket.

with register enabled (not using getpeername, ... functions): - Transaction rate: 3690.04 trans/sec - Transaction rate: 3636.36 trans/sec - Transaction rate: 3663.00 trans/sec without: - Transaction rate: 4629.63 trans/sec - Transaction rate: 4484.30 trans/sec - Transaction rate: 4608.29 trans/sec

So using register index is actually slower, most likely since I am manually closing fd after accept and having to register/unregister is adding extra back and forth.

Maybe it will be a different case using that new open/register/close patch!

YoSTEALTH avatar Jul 20 '21 22:07 YoSTEALTH

Can *setsockopt* and *getsockopt* be implemented at the very least? So it can be used with io_uring_prep_accept_direct?

YoSTEALTH avatar Oct 21 '21 22:10 YoSTEALTH

This thread explained why setting socket buffer size didn't work for me. Is there someone currently working on implementing setsockopt and getsockopt?

tofes avatar Mar 24 '22 19:03 tofes

It is being considered, yes. Question is if it should be done separately, or be part of the passthrough work that is currently underway.

axboe avatar Mar 24 '22 19:03 axboe

passthrough work

?

YoSTEALTH avatar Mar 28 '22 05:03 YoSTEALTH

passthrough work

?

@YoSTEALTH It's probably about the recent io_uring passthrough patchset. See the full messages here: https://lore.kernel.org/io-uring/[email protected]/T/

ammarfaizi2 avatar Mar 28 '22 06:03 ammarfaizi2

Right - the point here is that we could easily do set/getsockopt through that, and would be wasteful to add them as separate opcodes given that.

axboe avatar Mar 28 '22 12:03 axboe

@ammarfaizi2 thanks.

Right - the point here is that we could easily do set/getsockopt through that, and would be wasteful to add them as separate opcodes given that.

Sounds like a good idea.

YoSTEALTH avatar Mar 29 '22 00:03 YoSTEALTH

@axboe any idea of timeline when pass-through will be added into io_uring+liburing?

YoSTEALTH avatar Apr 08 '22 20:04 YoSTEALTH

Most likely 5.19.

axboe avatar Apr 08 '22 20:04 axboe

Is it currently impossible to bind and listen on a socket created with io_uring_prep_socket_direct() because of this?

overmighty avatar Sep 09 '22 11:09 overmighty

Closing this one as it's a mix of a bunch of things, some of which are available at this point. Feel free to open a new issue, but please keep each issue specific to a single feature.

axboe avatar Oct 20 '22 21:10 axboe