tokio
tokio copied to clipboard
Keep reading when there's space in the buffer for tokio::io::copy
This is the current poll method for tokio::io::copy
and tokio::io::copy_bidirectional
.
https://github.com/tokio-rs/tokio/blob/adad8fc3cdeffb6449edac55558510f4d12c95be/tokio/src/io/util/copy.rs#L28-L76
However, as long as self.cap < 2048
, it would be possible to attempt to read more data into the buffer. We should do this. One might also consider copying data from the end of the buffer to the start to make space for further reading.
See also #3572.
I would like to try this.
Go ahead!
@cssivision ~~Can't you see I'm trying to fix this?~~
~~Shouldn't you ask before you do it? Don't waste my time right?~~
It doesn't matter. Whatever.
Please be more careful when you take on PRs that someone have already claimed. If you are concerned that it is stalled, you should ask whether they are still working on it.
Yes, you are right.
@cssivision You don't have to close your PR, I'm just saying you should have asked me in advance instead of just submitting a PR, which would have caused everyone's time to be wasted. Please feel free to reopen your PR.(If you don't want to, that's fine) Thanks!
Is this a real problem that needs to be handled? I find it hard to believe that, with modern drives and network speeds, the 2k buffer isn't filled all the time. I started a conversation that resulted in #3722 because I found reading and writing 2k at a time severely limited throughput.
I've found that reducing the calls into the kernel can increase performance due to context switches being a fairly expensive operation. Checking to see if a socket is readable or writable probably results in a kernel call (via ioctl
) and then being added to the underlying poll
is a call, albeit for all handles the scheduler blocks on. Making it do this multiple times for a 2k buffer seems like the wrong direction.
Unless you're targeting tiny machines (I don't know if Rust compiles code for embedded micros) 2k is a tiny amount of RAM (barely bigger than an MTU and much smaller than a GigE jumbo packet!)
I'm asking that, if this is a feature that is to be added, please benchmark it to see how it performs in reading or writing to a gigabit Ethernet link.
we should not poll that resource again during this poll
That's good to know. However, I believe my other concerns are still valid. io::copy
is a valuable building block but, if it doesn't transfer data quickly, everyone will be writing their own performant version.
Yeah, we should probably change the default buffer size too. The one in std uses 8k, so I guess that would be reasonable?
8k should definitely be the minimum since Jumbo frames may be more common as GigE becomes the "slowest" network interface. It'll cut the number of kernel calls to a quarter of they are now, at least.
Another thing to consider is TCP's NO_DELAY option. Some network traffic is chit-chatty (e.g. ssh
and telnet
.) By default, TCP tries to buffer up more data to fill the MTU. But for login sessions or protocols with short request/reply messages, you don't want to wait for a full buffer to send because more data may be held off until a reply is received. So some applications turn on the NO_DELAY option for the socket so that small packets are sent immediately.
Trying to fill the buffer defeats that option. So I'd avoid trying to fill the 8k buffer; just send what was received. This covers short exchanges that don't fill the buffer and large data transfers that completely fill it.
(in other words, short messages won't see a benefit of 8k but large data transfers will.)
When it comes to NO_DELAY, that's not something io::copy
controls. You can set it on the socket when you create it.
As for filling the buffer, the design I had in mind would only try to read more data into the buffer if the AsyncWrite
is not able to send all of the data immediately. So it seems to me that this would not affect the case where the IO resources are extremely fast. It's still worth testing it of course.
That is, the idea is that if we are not able to write anything right now, we might as well read some more data from the socket instead if we have space for it. It would not involve postponing writing any data in the cases where we are able to write data for the sake of buffering many smaller reads together.
When it comes to NO_DELAY, that's not something io::copy controls. You can set it on the socket when you create it.
Sorry, I wasn't being clear. If I set NO_DELAY on the socket, but you're waiting to fill the 2k buffer before sending it on, then you've defeated the purpose of NO_DELAY.
But your later response indicates you're only filling when you can't write which I'm more in favor for. Especially if you increase the buffer size.
Right, I understand now what you meant with NO_DELAY. To confirm, I am not suggesting that we wait for the buffer to be filled before we try to send, rather I am suggesting that if the send returns WouldBlock
, then we try to do another read if there is space in the buffer.
the design I had in mind would only try to read more data into the buffer if the AsyncWrite is not able to send all of the data immediately
@Darksonn This sounds to me like the condition i < self.cap
, where i = ready!(writer.as_mut().poll_write(cx, &me.buf[me.pos..me.cap]))?
, but in the current implementation we already get back into the reading part if that condition holds. Or do you mean to try to read more data if poll_write
returns Pending
? Can you clarify what you mean by 'not able to send all of the data'?
Hey! So it does seem like the issue got a little stale, do you think I could offer my take on this?
@CthulhuDen go ahead!
Is anyone currently actively working on this (@CthulhuDen perhaps?) If not, I'd like to tackle it.
@farnz Please go ahead! Sorry I don't have numbers on hand to back me, but I found this fix changes nothing, cause we're always bottlenecked on either input or output and basically either way change did not affect transfer throughput for me. But, maybe you will arrive at positive results. Good luck!
Definitely do some benchmarks before and after to make sure your changes improve things. I know it's not easy writing test benchmarks since they don't necessarily reflect the real world, but maximizing throughput can be affected in unexpected ways.
In a (C++) project, we wrapped a helper function around getting the system time. It was so easy to use, it got called in several places every time the program handled an I/O event. Getting the system time requires entering the kernel, which is pretty expensive. We changed the program to get the system time when it woke up and saved it. The helper function simply returned the cached value for the current I/O event. That easily doubled our throughput.
In another instance, I called an ioctl
to see how much space was in the socket buffer to decide whether to do a send
. This meant every successful send
required two entries into the kernel. Instead, the socket was set to return an error if there wasn't enough space. That change gave us a substantial bump in throughput.
I know these approaches don't translate well to async
coding because futures need to be cancelable so, sometimes, you need to make the extra system calls to meet async
requirements. But if you can make the code "Correct" while reducing system calls, it'll be a win.
The changes to actually do extra reads while waiting for the write to become possible aren't that complex; I can put them up for review, but I'd prefer to finish the benchmarks to show that they're worthwhile first. I'm working now on benchmarks showing that:
- When doing memory-to-memory "I/O", there's no regression caused - this is the best case for the existing code, since neither reader nor writer ever block, and thus any slowdown here indicates that we're slowing things down in the fast case.
- That there is a case where the change helps - this is likely to need a reader supplying data in tiny blocks (on the order of 2 KiB, so that we can get multiple blocks into the buffer), and a writer that is IOPS-bound, not throughput bound (so can handle very large writes, but only lets you make a small number of writes per second).
So, benchmarks done, and I've been able to create one case in which the change speeds things up, and 4 in which it doesn't affect things.
The no change cases are:
- Straight fast-as-possible memory to memory copy
- Memory to HDD simulation copy
- Slow network simulation to memory copy
- Slow network simulation to HDD simulation copy
And the one that benefits (significantly)
- Slow network simulation to IOPS throttled unbuffered device copy.
However, I cannot see a case in which an IOPS throttled unbuffered device is a realistic target. If you're sending over a network, the network stack adds buffering; if it's local, the kernel will normally add buffering. The only time this could matter is if you've implemented AsyncWrite
yourself for a custom device, and it's badly implemented, and you can't afford the memory cost of adding a BufWriter
with a small buffer around your AsyncWrite
.
My changes (with benchmark numbers on my aging Xeon E3-1245v2) are at https://github.com/tokio-rs/tokio/compare/master...farnz:tokio:copy_buffer - I'm happy to turn them into a proper pull request if there's still interest in this, because they do at least make it slightly harder to shoot yourself in the foot.
I've come up with a realistic benchmark that shows improvements, and I'm going to force-push my branch ready for a pull request. https://github.com/farnz/tokio/commit/382c0bc1e099aae3c558f69495d1f866d9e54735 is the old benchmarks before the force push for historic reference (my previous comment).
The "trick" is to make the HDD stall randomly as the buffer fills up. I believe this is a realistic behaviour, since it's what a real NAS would do if it was under overload scenario, and you were unlucky enough to come in just after the NAS assigns buffer space to another writer.
And with a bit more thinking, I've been able to write a realistic benchmark that does show improvement - I'm going to force push my branch and use it for the pull request, but https://github.com/farnz/tokio/commit/382c0bc1e099aae3c558f69495d1f866d9e54735 has the old benchmarks.
The missing part in my thinking is that concurrent users introduce random chances of blocking because all buffers are full; it's reasonable to model this as a chance of blocking that's proportional to how full your buffer is (since your buffer empties regularly, this represents more of a chance of blocking as the underlying HDD buffer fills up).
With this model in the benchmark, I see wins copying from a slow source to an overloaded HDD simulation.