liburing
liburing copied to clipboard
kjeeping tcp read latency down when everything is a short read
On TCP receive pretty much all my reads are going to be short consuming small messages from the network. Under epoll that isn't an issue because the data is delivered after the notice of data is posted. During that period between, the kernel keeps reading and appending to the kernel socket buffer, and then the userland code sucks it all up at once (most of the time) My userland buffers are sized for this, and it keeps latency to the most recently received packet the lowest.
On io_uring, the notice of data and the posting of the data are at the same time so with uneven packet flow, I constantly having to throttle delivery in various ways to try and read more, but that is a guessing game that will never be good enough. I can't link reads because the short read will cancel the chain,. I dont want to keep creating read ops because that just fragments the data making parsing more difficult and wastes cycles. I'm getting a lot of small reads when I know the kernel has more but is waiting to process the read that will drain the kernel buffer. And unlike epoll there is no good way of knowing if there is more data sitting the machine waiting to be posted.
I'm having a very difficult time getting my latency numbers down ti epoll levels. Most of the benchmarks are done in a way that they get full reads or know the exact amount to expect (my messages are tiny compared to their amount - so i just want to read for everything every time). I dont understand how people are getting latency improvements (or even throughput) with uring with large tcp buffers (my smallest is 1MB and my largest was approx 64MB on one systems).
How do I keep these reads going full ?
I don't fully understand your workload, but maybe MSG_WAITALL is what you want to pass to RECV or RECVMSG, it's also available for SEND/SENDMSG[_ZC] in order to avoid short writes.
The workload is a large number of relatively small messages (10s to a couple hundred bytes each) and the source is very bursty. For epoll I oversize my read buffer so i knew when i had drained it for the edge trigger to re-fire. I dont know what to do with uring to make it as latency efficient.
When i service theh events on a socket, I want all the data the kernel has so i can bulk process the events (much better for me) but i dont want to wait on anything more.. I would really like it to just keep adding the same buffer until it is exausted then grab another amd do it again. periodically a thread would come a process up to the write pointer from last processed message. This keeps my data contiguous which is also helpful.
Setting up the ring with IORING_SETUP_DEFER_TASKRUN might help quite a bit for this case.
Do you se the difference im getting between epoll (where ill suck in =everything the kernel has) and uring where im lucky to get more than 1 packet)?
IORING_SETUP_DEFER_TASKRUN
Does this matter if i am running thred per core with a queue per core and tx queue pinned too?
IORING_SETUP_DEFER_TASKRUN
Does this matter if i am running thred per core with a queue per core and tx queue pinned too?
I think so, it'll avoid an immediate interruption when data notification arrives with io_uring, and instead defer processing of the receive until you run some variant of io_uring_wait_cqe().
I think so
I'll try that too, but still doesn't fix the main problem in that I want in that short reads are causing me a lof of pain since epoll kind of naturally coalessced them i the time between notification and recv called.
But DEFER will kind of do the same, it'll distance the receive from when the notification happened.
If I want to just put distance i can turn interrupt delivery back on and use the nic coalescing for that (all polled right now - havent tried the napi support yet).
OT: ORING_FEAT_RSRC_TAGS finally can update the buffer ring without waiting for everything for everything to clear. nice.
Is IORING_CQE_F_SOCK_NONEMPTY continuously updated or just when a read is posted? eg, can i tell by jureading that flag if the recv buffer is nonempty? Is mutishot recv done and does it work with streams?
@jnordwick , as someone having similar use case I am interested in understanding performance implications of using io_uring.
In the OP you describe delay between notification and read is something you look for:
Under epoll that isn't an issue because the data is delivered after the notice of data is posted. During that period between, the kernel keeps reading and appending to the kernel socket buffer, and then the userland code sucks it all up at once (most of the time)
But when commenting on IORING_SETUP_DEFER_TASKRUN you say, that delay (distance in time) is not enough:
If I want to just put distance i can turn interrupt delivery back on and use the nic coalescing for that
To me both cases sound very similar, what epoll + read have, but IORING_SETUP_DEFER_TASKRUN doesn't?