tokio icon indicating copy to clipboard operation
tokio copied to clipboard

Keep reading when there's space in the buffer for tokio::io::copy

Open Darksonn opened this issue 3 years ago • 16 comments

This is the current poll method for tokio::io::copy and tokio::io::copy_bidirectional.

https://github.com/tokio-rs/tokio/blob/adad8fc3cdeffb6449edac55558510f4d12c95be/tokio/src/io/util/copy.rs#L28-L76

However, as long as self.cap < 2048, it would be possible to attempt to read more data into the buffer. We should do this. One might also consider copying data from the end of the buffer to the start to make space for further reading.

See also #3572.

Darksonn avatar Apr 12 '21 12:04 Darksonn

I would like to try this.

hi-rustin avatar Apr 20 '21 05:04 hi-rustin

Go ahead!

Darksonn avatar Apr 20 '21 07:04 Darksonn

@cssivision ~~Can't you see I'm trying to fix this?~~

~~Shouldn't you ask before you do it? Don't waste my time right?~~

It doesn't matter. Whatever.

hi-rustin avatar Apr 22 '21 07:04 hi-rustin

Please be more careful when you take on PRs that someone have already claimed. If you are concerned that it is stalled, you should ask whether they are still working on it.

Darksonn avatar Apr 22 '21 10:04 Darksonn

Yes, you are right.

@cssivision You don't have to close your PR, I'm just saying you should have asked me in advance instead of just submitting a PR, which would have caused everyone's time to be wasted. Please feel free to reopen your PR.(If you don't want to, that's fine) Thanks!

hi-rustin avatar Apr 22 '21 14:04 hi-rustin

Is this a real problem that needs to be handled? I find it hard to believe that, with modern drives and network speeds, the 2k buffer isn't filled all the time. I started a conversation that resulted in #3722 because I found reading and writing 2k at a time severely limited throughput.

I've found that reducing the calls into the kernel can increase performance due to context switches being a fairly expensive operation. Checking to see if a socket is readable or writable probably results in a kernel call (via ioctl) and then being added to the underlying poll is a call, albeit for all handles the scheduler blocks on. Making it do this multiple times for a 2k buffer seems like the wrong direction.

Unless you're targeting tiny machines (I don't know if Rust compiles code for embedded micros) 2k is a tiny amount of RAM (barely bigger than an MTU and much smaller than a GigE jumbo packet!)

I'm asking that, if this is a feature that is to be added, please benchmark it to see how it performs in reading or writing to a gigabit Ethernet link.

rneswold avatar Apr 23 '21 15:04 rneswold

we should not poll that resource again during this poll

That's good to know. However, I believe my other concerns are still valid. io::copy is a valuable building block but, if it doesn't transfer data quickly, everyone will be writing their own performant version.

rneswold avatar Apr 26 '21 15:04 rneswold

Yeah, we should probably change the default buffer size too. The one in std uses 8k, so I guess that would be reasonable?

Darksonn avatar Apr 26 '21 16:04 Darksonn

8k should definitely be the minimum since Jumbo frames may be more common as GigE becomes the "slowest" network interface. It'll cut the number of kernel calls to a quarter of they are now, at least.

Another thing to consider is TCP's NO_DELAY option. Some network traffic is chit-chatty (e.g. ssh and telnet.) By default, TCP tries to buffer up more data to fill the MTU. But for login sessions or protocols with short request/reply messages, you don't want to wait for a full buffer to send because more data may be held off until a reply is received. So some applications turn on the NO_DELAY option for the socket so that small packets are sent immediately.

Trying to fill the buffer defeats that option. So I'd avoid trying to fill the 8k buffer; just send what was received. This covers short exchanges that don't fill the buffer and large data transfers that completely fill it.

(in other words, short messages won't see a benefit of 8k but large data transfers will.)

rneswold avatar Apr 26 '21 18:04 rneswold

When it comes to NO_DELAY, that's not something io::copy controls. You can set it on the socket when you create it.

As for filling the buffer, the design I had in mind would only try to read more data into the buffer if the AsyncWrite is not able to send all of the data immediately. So it seems to me that this would not affect the case where the IO resources are extremely fast. It's still worth testing it of course.

Darksonn avatar Apr 26 '21 20:04 Darksonn

That is, the idea is that if we are not able to write anything right now, we might as well read some more data from the socket instead if we have space for it. It would not involve postponing writing any data in the cases where we are able to write data for the sake of buffering many smaller reads together.

Darksonn avatar Apr 26 '21 20:04 Darksonn

When it comes to NO_DELAY, that's not something io::copy controls. You can set it on the socket when you create it.

Sorry, I wasn't being clear. If I set NO_DELAY on the socket, but you're waiting to fill the 2k buffer before sending it on, then you've defeated the purpose of NO_DELAY.

But your later response indicates you're only filling when you can't write which I'm more in favor for. Especially if you increase the buffer size.

rneswold avatar Apr 26 '21 20:04 rneswold

Right, I understand now what you meant with NO_DELAY. To confirm, I am not suggesting that we wait for the buffer to be filled before we try to send, rather I am suggesting that if the send returns WouldBlock, then we try to do another read if there is space in the buffer.

Darksonn avatar Apr 26 '21 20:04 Darksonn

the design I had in mind would only try to read more data into the buffer if the AsyncWrite is not able to send all of the data immediately

@Darksonn This sounds to me like the condition i < self.cap, where i = ready!(writer.as_mut().poll_write(cx, &me.buf[me.pos..me.cap]))?, but in the current implementation we already get back into the reading part if that condition holds. Or do you mean to try to read more data if poll_write returns Pending? Can you clarify what you mean by 'not able to send all of the data'?

b-naber avatar Jan 10 '22 17:01 b-naber

Hey! So it does seem like the issue got a little stale, do you think I could offer my take on this?

CthulhuDen avatar Aug 21 '22 19:08 CthulhuDen

@CthulhuDen go ahead!

Noah-Kennedy avatar Aug 21 '22 19:08 Noah-Kennedy

Is anyone currently actively working on this (@CthulhuDen perhaps?) If not, I'd like to tackle it.

farnz avatar Sep 23 '22 13:09 farnz

@farnz Please go ahead! Sorry I don't have numbers on hand to back me, but I found this fix changes nothing, cause we're always bottlenecked on either input or output and basically either way change did not affect transfer throughput for me. But, maybe you will arrive at positive results. Good luck!

CthulhuDen avatar Sep 23 '22 14:09 CthulhuDen

Definitely do some benchmarks before and after to make sure your changes improve things. I know it's not easy writing test benchmarks since they don't necessarily reflect the real world, but maximizing throughput can be affected in unexpected ways.

In a (C++) project, we wrapped a helper function around getting the system time. It was so easy to use, it got called in several places every time the program handled an I/O event. Getting the system time requires entering the kernel, which is pretty expensive. We changed the program to get the system time when it woke up and saved it. The helper function simply returned the cached value for the current I/O event. That easily doubled our throughput.

In another instance, I called an ioctl to see how much space was in the socket buffer to decide whether to do a send. This meant every successful send required two entries into the kernel. Instead, the socket was set to return an error if there wasn't enough space. That change gave us a substantial bump in throughput.

I know these approaches don't translate well to async coding because futures need to be cancelable so, sometimes, you need to make the extra system calls to meet async requirements. But if you can make the code "Correct" while reducing system calls, it'll be a win.

rneswold avatar Sep 23 '22 14:09 rneswold

The changes to actually do extra reads while waiting for the write to become possible aren't that complex; I can put them up for review, but I'd prefer to finish the benchmarks to show that they're worthwhile first. I'm working now on benchmarks showing that:

  1. When doing memory-to-memory "I/O", there's no regression caused - this is the best case for the existing code, since neither reader nor writer ever block, and thus any slowdown here indicates that we're slowing things down in the fast case.
  2. That there is a case where the change helps - this is likely to need a reader supplying data in tiny blocks (on the order of 2 KiB, so that we can get multiple blocks into the buffer), and a writer that is IOPS-bound, not throughput bound (so can handle very large writes, but only lets you make a small number of writes per second).

farnz avatar Sep 28 '22 19:09 farnz

So, benchmarks done, and I've been able to create one case in which the change speeds things up, and 4 in which it doesn't affect things.

The no change cases are:

  • Straight fast-as-possible memory to memory copy
  • Memory to HDD simulation copy
  • Slow network simulation to memory copy
  • Slow network simulation to HDD simulation copy

And the one that benefits (significantly)

  • Slow network simulation to IOPS throttled unbuffered device copy.

However, I cannot see a case in which an IOPS throttled unbuffered device is a realistic target. If you're sending over a network, the network stack adds buffering; if it's local, the kernel will normally add buffering. The only time this could matter is if you've implemented AsyncWrite yourself for a custom device, and it's badly implemented, and you can't afford the memory cost of adding a BufWriter with a small buffer around your AsyncWrite.

My changes (with benchmark numbers on my aging Xeon E3-1245v2) are at https://github.com/tokio-rs/tokio/compare/master...farnz:tokio:copy_buffer - I'm happy to turn them into a proper pull request if there's still interest in this, because they do at least make it slightly harder to shoot yourself in the foot.

farnz avatar Sep 29 '22 15:09 farnz

I've come up with a realistic benchmark that shows improvements, and I'm going to force-push my branch ready for a pull request. https://github.com/farnz/tokio/commit/382c0bc1e099aae3c558f69495d1f866d9e54735 is the old benchmarks before the force push for historic reference (my previous comment).

The "trick" is to make the HDD stall randomly as the buffer fills up. I believe this is a realistic behaviour, since it's what a real NAS would do if it was under overload scenario, and you were unlucky enough to come in just after the NAS assigns buffer space to another writer.

farnz avatar Sep 30 '22 14:09 farnz

And with a bit more thinking, I've been able to write a realistic benchmark that does show improvement - I'm going to force push my branch and use it for the pull request, but https://github.com/farnz/tokio/commit/382c0bc1e099aae3c558f69495d1f866d9e54735 has the old benchmarks.

The missing part in my thinking is that concurrent users introduce random chances of blocking because all buffers are full; it's reasonable to model this as a chance of blocking that's proportional to how full your buffer is (since your buffer empties regularly, this represents more of a chance of blocking as the underlying HDD buffer fills up).

With this model in the benchmark, I see wins copying from a slow source to an overloaded HDD simulation.

farnz avatar Sep 30 '22 14:09 farnz