rust
rust copied to clipboard
Tracking issue for `Write::write_all_vectored`
This is a tracking issue for io::Write::write_all_vectored.
Feature gate: #![feature(write_all_vectored)].
Steps:
- [x] Implement the RFC (#70612).
- [ ] Stabilization PR.
Unresolved questions:
- [ ] Can we improve the API? Currently the method takes the
IoSlices as mutable slice and modifies them. This is a pretty unusual and potentially error-prone API. We could either find a way to not mutate the argument or to enforce (via the type system) the argument can't be used by the user afterwards. Or we can decide that such an unusual API is fine for this rather low level method.
Original issue:
In the io::Write trait we've got the helpful write_all method, which calls write in a loop to write all bytes. However there is no such function for write_vectored. I would suggest adding a function called write_all_vectored to performance such a task.
A possible implementation. Note that bufs is a mutable slice, also see the discussion in https://github.com/rust-lang/futures-rs/pull/1741/files. On the playground: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=872e9d973bd8101e7724292f87a82869.
pub trait Write {
// ...
fn write_all_vectored(&mut self, mut bufs: &mut [IoSlice<'_>]) -> io::Result<()> {
while !bufs.is_empty() {
match self.write_vectored(bufs) {
Ok(0) => {
return Err(Error::new(
ErrorKind::WriteZero,
"failed to write whole buffer",
));
}
Ok(n) => bufs = IoSlice::advance(mem::replace(&mut bufs, &mut []), n),
Err(ref e) if e.kind() == ErrorKind::Interrupted => {}
Err(e) => return Err(e),
}
}
Ok(())
}
}
Related: https://github.com/rust-lang/futures-rs/pull/1741 /cc @cramertj
cc @sfackler, @withoutboats
The main API question I have (which is more about IoSlice::advance than this I guess) is if we're okay saying that the content of the bufs slice is unspecified after this call returns. It's probably fine, but just one "weird" affect that you may not naively expect.
@sfackler I agree, but I don't know how an alternative that takes &[IoSlice<'_>], like you would expect, would work.
It could for example perform vectored writes when the current cursor position is at a slice boundry and switch to a normal write_all if it's in the middle of one of the slices.
But then you're missing out on the efficiency of vectored writes, if that were the case I wouldn't consider using it.
You are only missing out on the efficienty of vectored writes some of the time.
True, but with the implementation I provided (which I still admit isn't great) you don't miss out. I've made the point already in https://github.com/rust-lang/futures-rs/pull/1741#discussion_r398715869, but personally I don't its really a big deal that the provided slice is modified. In most cases I imagine the bufs slice being build just before this call and only used in this call.
One possibility would be to keep a fixed-size buffer of, say, [IoSlice<'_>; 128] and copy sections of the caller's slice into that, pairing it with an index for the last-most unwritten IoSlice . It's more complicated in that you have to re-load that IoSlice at some clever point in order to both minimize shifting but also maximize the number of IoSlices being offered to the underlying reader, but it prevents mutation of the original caller's slice while also allowing some fixed amount of vectored writes.
I'm not necessarily advocating for this idea, as I think it might be fine to mutate the caller's slice (they can always make a clone if they don't want to deal with that), but it's an option.
@LukasKalbertodt I've changed the issue to be the tracking issue, could you apply the correct labels?
I'm very happy to see this method added, thank you!
I regularly find myself wanting this method, when writing various different pieces of data into a buffer and not wanting to copy them around excessively.
Regarding the API:
Normally, you can't split a writev call without creating a potential semantic difference. However, if writev returns early and hasn't written all the data, then the write already won't be atomic, so there's no requirement to keep the remaining iovecs together.
With that in mind, why don't we implement write_all_vectored to take a non-mutable &[IoSlice], and then if write_vectored does a partial write, we use write_all to write the rest of the current partial IoSlice if any, then resume calling write_vectored with the remaining subslice of the caller's &[IoSlice] values? That doesn't require any allocation or copying at all.
@joshtriplett - yep, that approach was discussed up above. The thought for the current implementation is that the caller isn't going to care about the contents of the buffers slice after the call completes, and mutating it minimizes the number of write calls.
If the calls are returning early, there may not be as much value in minimizing the number of calls.
Also, could we restore the slice afterwards, even if we mutate it as we go? For each call to write_vectored, at most one change to the slice needs to occur, so we could easily change it back after that call. That way, we're requiring exclusive access to the slice but giving it back.
If the calls are returning early, there may not be as much value in minimizing the number of calls.
I don't quite understand this statement, the whole point of using vectored I/O is minimizing the number of system calls.
Also, could we restore the slice afterwards, even if we mutate it as we go? For each call to
write_vectored, at most one change to the slice needs to occur, so we could easily change it back after that call. That way, we're requiring exclusive access to the slice but giving it back.
I guess we could but are there many cases in which you need to use the buffers after having written them all? I've personally never encountered that.
We could also create an IoSlices type that wraps &mut [IoSlice<'_>] and take ownership of that to make the ownership part of the API clearer.
On Fri, Apr 17, 2020 at 01:12:55AM -0700, Thomas de Zeeuw wrote:
If the calls are returning early, there may not be as much value in minimizing the number of calls.
I don't quite understand this statement, the whole point of using vectored I/O is minimizing the number of system calls.
Under normal circumstances, I'd expect the kernel to process as much of
a writev call as possible. What I'm suggesting is that if writev
returns early, such that you have to re-call it in order to finish
writing, then making an additional write call before doing so does not
seem likely to cause a problem.
If someone is trying to optimize the number of syscalls there, they
would want to start by figuring out why the kernel returned early from
writev in the first place.
Also, could we restore the slice afterwards, even if we mutate it as we go? For each call to
write_vectored, at most one change to the slice needs to occur, so we could easily change it back after that call. That way, we're requiring exclusive access to the slice but giving it back.I guess we could but are there many cases in which you need to use the buffers after having written them all? I've personally never encountered that.
If you want to reuse the same buffers, it may also make sense to reuse the [IoSlice].
We could also create an
IoSlicestype that wraps&mut [IoSlice<'_>]and take ownership of that to make the ownership part of the API clearer.
That would improve on the current API. Or we could accept a Cow, and
only make a copy if we need to mutate it.
Under normal circumstances, I'd expect the kernel to process as much of a
writevcall as possible. What I'm suggesting is that ifwritevreturns early, such that you have to re-call it in order to finish writing, then making an additionalwritecall before doing so does not seem likely to cause a problem.If someone is trying to optimize the number of syscalls there, they would want to start by figuring out why the kernel returned early from
writevin the first place.
That can have many causes that change from OS to OS and even from OS version to version, I don't think its worth it to determine the cause for each. But we should still keep the goal of minimising the number of system calls, using write followed by more writev calls goes against this goal.
If you want to reuse the same buffers, it may also make sense to reuse the [IoSlice].
The underlying buffers can be reused, so all you'll need to do is IoSlice::new which is effectively a copy of the pointer and length. I don't think that is too much overhead.
That would improve on the current API. Or we could accept a
Cow, and only make a copy if we need to mutate it.
Maybe, but what would the ToOwned impl for IoSlice be? For a slice its converted into a Vec, which allocates, something we don't want for this API.
On Tue, Apr 21, 2020 at 02:22:57AM -0700, Thomas de Zeeuw wrote:
If you want to reuse the same buffers, it may also make sense to reuse the [IoSlice].
The underlying buffers can be reused, so all you'll need to do is
IoSlice::newwhich is effectively a copy of the pointer and length. I don't think that is too much overhead.
Consider the case where you have a long [IoSlice] because you're stitching together a buffer from many disparate pieces.
That would improve on the current API. Or we could accept a
Cow, and only make a copy if we need to mutate it.Maybe, but what would the
ToOwnedimpl forIoSlicebe? For a slice its converted into aVec, which allocates, something we don't want for this API.
Good point. Adding IoSlices to represent an owned [IoSlice] seems like a good step, then, if we decide to consume the IoSlice.
Hey y'all: Just noticed an issue with write_all_vectored as I was playing around with it. The documentation says:
If the buffer contains no data, this will never call write_vectored.
Looking at the implementation, that's not what happens. In particular, if bufs is nonempty, but only contains empty IoSlices, write_all_vectored not only will call write_vectored, but it will return an error even when the call succeeded.
fn write_all_vectored(&mut self, mut bufs: &mut [IoSlice<'_>]) -> Result<()> {
while !bufs.is_empty() {
match self.write_vectored(bufs) {
Ok(0) => {
return Err(Error::new(ErrorKind::WriteZero, "failed to write whole buffer"));
}
//...
@ezrosent @tmiasko fixed it in #74088.
@Thomasdezeeuw is it possible to also implement a similar feature for reading?
Like read_exact_vectored() which would try to fill &mut [IoMutSlice] until it encounters an EOF. And if there is not enough data, it would return UnexpectedEof.
Similar to this:
pub trait Read {
// ...
fn read_exact_vectored(&mut self, mut bufs: &mut [IoSliceMut]) -> std::io::Result<()> {
// Guarantee that bufs is empty if it contains no data,
// to avoid calling write_vectored if there is no data to be written.
bufs = IoSliceMut::advance(bufs, 0);
while !bufs.is_empty() {
match self.read_vectored(bufs) {
Ok(0) => {
return Err(std::io::Error::new(std::io::ErrorKind::UnexpectedEof, "failed to fill whole buffer"));
}
Ok(n) => bufs = IoSliceMut::advance(bufs, n),
Err(ref e) if e.kind() == std::io::ErrorKind::Interrupted => {}
Err(e) => return Err(e),
}
}
Ok(())
}
}
@q2p I guess so, but let's not track that in this issue. You can create a pr for it.
It seems like the biggest blocker for this is the open question about the API. The current API requires the caller to pass a &mut, so that write_all_vectored can modify it in place if it needs to make an additional writev call. This seems to have been proposed for performance reasons, to minimize the work required to make subsequent calls.
I would like to propose changing this, for a couple of reasons:
- In practice, based on many codebases I've seen, I expect the common case to have 2-3 iovecs, not hundreds or thousands. People primarily seem to use this for cases like writing from a ringbuffer and handling wraparound (using two iovecs), or writing a header and body.
- In the slow path of having to make an additional syscall due to a partial initial write, the cost of a syscall will vastly eclipse the cost of copying a few iovecs.
Meanwhile, requiring a writable slice of iovecs forces the caller to reconstruct their iovecs for every call, rather than reusing them for multiple calls (possibly with some modifications, but not rewriting them from scratch). This pessimizes the fast path in favor of optimizing the slower path.
I would propose changing this to an immutable reference, instead. If the initial write comes back partial, write_all_vectored can either copy the iovecs into a small stack buffer, or (in the uncommon case of a huge slice of iovecs) it can do a separate write syscall for the initial buffer.
This would both optimize the common case, and make all cases easier to invoke.
We discussed this in the @rust-lang/libs-api meeting today. Some others felt that this was a reasonable approach to simplify the API; others didn't feel strongly either way; nobody felt that we should keep the current &mut API.
In practice, based on many codebases I've seen, I expect the common case to have 2-3 iovecs, not hundreds or thousands. People primarily seem to use this for cases like writing from a ringbuffer and handling wraparound (using two iovecs), or writing a header and body.
I am not saying that it is a common pattern, but in a ecological simulation I am using a single call to write_all_vectored with up to 10^5 IoSlice items to write out the simulation state as a single copy into the page buffer, i.e. without assembling a separate serialized representation first. (Since the indirections I need to follow for this might have been re-allocated when the next state dump comes, I can not reuse the content of the Vec<IoSlice>.) (This is almost surely UB writing padding to disk and such, so probably another argument for not supporting such patterns.)
I would propose changing this to an immutable reference, instead. If the initial write comes back partial, write_all_vectored can either copy the iovecs into a small stack buffer, or (in the uncommon case of a huge slice of iovecs) it can do a separate write syscall for the initial buffer.
Agreeing that the above situation is not common, I still wonder if with the platform-specific knowledge embedded in libstd, that temporary buffer could be made sufficiently large to "saturate" the system call interface, e.g. contain 1024 IoSlice elements on Linux, so that the number of system calls would not increase in the slow path even though the small overhead of copying the IoSlices is taken. But I am not sure at which point the possibly repeated copying would be worse than doing an additional system call.
It seems like the biggest blocker for this is the open question about the API. The current API requires the caller to pass a
&mut, so thatwrite_all_vectoredcan modify it in place if it needs to make an additional writev call. This seems to have been proposed for performance reasons, to minimize the work required to make subsequent calls.
Note that I'm the one who added that question and after working with for a while it's not that big a deal. It's a little different then what you might expect, e.g. compare to the write_vectored call, but it's very usable.
I would like to propose changing this, for a couple of reasons:
* In practice, based on many codebases I've seen, I expect the common case to have 2-3 iovecs, not hundreds or thousands. People primarily seem to use this for cases like writing from a ringbuffer and handling wraparound (using two iovecs), or writing a header and body.
I don't think that is true. If I remember correctly the limit is 1024, and I've been near it before so I don't think we should optimise of for the 2-3 iovecs use case too much. Of course all this is just anecdotal.
* In the slow path of having to make an additional syscall due to a partial initial write, the cost of a syscall will vastly eclipse the cost of copying a few iovecs.Meanwhile, requiring a writable slice of iovecs forces the caller to reconstruct their iovecs for every call, rather than reusing them for multiple calls (possibly with some modifications, but not rewriting them from scratch). This pessimizes the fast path in favor of optimizing the slower path.
I would propose changing this to an immutable reference, instead. If the initial write comes back partial,
write_all_vectoredcan either copy the iovecs into a small stack buffer, or (in the uncommon case of a huge slice of iovecs) it can do a separatewritesyscall for the initial buffer.
I have to say I disagree with this approach. Do you want to add a stack of 1024 (IOV_MAX) iovec to the function for copying the buffers?
I know the current API is a little odd at first, but when you assume that the buffer (slices) you passed in are not usable after the call it's an OK API to use. Futhermore as it's currently implemented it's zero overhead, while your suggestion of copying is not. If you're doing vectored I/O in the first place it's likely you care about this.
This would both optimize the common case, and make all cases easier to invoke.
We discussed this in the @rust-lang/libs-api meeting today. Some others felt that this was a reasonable approach to simplify the API; others didn't feel strongly either way; nobody felt that we should keep the current
&mutAPI.
I didn't attend this meeting, nor am I part of the libs team, but I do feel strongly against this preposed case. May I ask how many of the people actually use the API? Because that might influence their vote. (hope I don't come off too strong with the last sentence, I'm just curious if people deciding on this are actually using the API)
@Thomasdezeeuw I'm not suggesting that we optimize for the 2-3 case; I'm suggesting that we don't need to optimize heavily for an already-less-optimized subcase (partial writes) of the 1024 case, at the expense of both usability (not having to recreate buffers) and other optimization (not having to recreate buffers).
I have to say I disagree with this approach. Do you want to add a stack of 1024 (
IOV_MAX)iovecto the function for copying the buffers?
No, I personally would add a local array of 8, and if larger than that, call write on any partial iovec before calling writev on the remainder.
Also, we have helpers that'd make it easy to write a mutate-in-place write_all_vectored using write_vectored.
as it's currently implemented it's zero overhead, while your suggestion of copying is not.
It's not zero-overhead if you count the cost of recreating a new set of iovecs for each call.
If you're doing vectored I/O in the first place it's likely you care about this.
In the optimal case, your writev call completes in a single syscall; in any other case, syscall overhead will by far exceed any other overhead. If you're doing vectored I/O for efficiency, you should try to ensure your operation completes in one writev call.
@Thomasdezeeuw I'm not suggesting that we optimize for the 2-3 case; I'm suggesting that we don't need to optimize heavily for an already-less-optimized subcase (partial writes) of the 1024 case, at the expense of both usability (not having to recreate buffers) and other optimization (not having to recreate buffers).
I agree that the usability improvement is nice, but I've found I always have to recreate my buffers anyway (see below). If it's such a big problem maybe this needs a new type that wraps &mut [IoSlice] to show that ownership is passed to the function?
I have to say I disagree with this approach. Do you want to add a stack of 1024 (
IOV_MAX)iovecto the function for copying the buffers?No, I personally would add a local array of 8, and if larger than that, call
writeon any partial iovec before callingwritevon the remainder.
That would make this function unusable for my case, which would be a real shame.
Also, we have helpers that'd make it easy to write a mutate-in-place
write_all_vectoredusingwrite_vectored.as it's currently implemented it's zero overhead, while your suggestion of copying is not.
It's not zero-overhead if you count the cost of recreating a new set of iovecs for each call.
But you'll almost always have to recreating the iovecs before the write call. Take your example of a ring buffer; how do you write into it while maintaining your iovecs? I don't think it's possible you can as you need both a mutable (for writing) and immutable (for the writev call) reference into the data, which Rust of course doesn't allow. So I don't really think this argument holds water.
Could you show an example where it is possible to keep your iovecs around while modifying the buffer?
If you're doing vectored I/O in the first place it's likely you care about this.
In the optimal case, your
writevcall completes in a single syscall; in any other case, syscall overhead will by far exceed any other overhead. If you're doing vectored I/O for efficiency, you should try to ensure your operation completes in onewritevcall.
I agree, but it's not possible for a program to ensure the operation completes in one writev call. The amount of bytes the OS can write in a single writev call differs per OS and even per target, i.e. different filesystem, socket, etc.
@Thomasdezeeuw
If it's such a big problem maybe this needs a new type that wraps &mut [IoSlice] to show that ownership is passed to the function?
The issue isn't just that callers will forget this (&mut is a big hint that you can't count on it being unmodified), though that's also a problem. The primary issue seems like having to deal with it being overwritten by the function in the first place.
Could you show an example where it is possible to keep your iovecs around while modifying the buffer?
let mut iovecs = [
IoSlice::new(header),
IoSlice::new(x),
];
w.write_all_vectored(&iovecs)?;
iovecs[1] = IoSlice::new(y);
w.write_all_vectored(&iovecs)?;
Also, there are several functions we could add to make this better. For instance, we could add a function from &mut [IoSliceMut] to &[IoSlice], so that people can mutate a buffer through the IoSliceMut.
FWIW, I could also work with a write_all_vectored that requires a &mut but guarantees to change the slices back when done. "This uses your IoSlices as scratch space, but will always restore them to an unmodified state afterwards." But we can't make that guarantee about user impls of Write, only about our standard implementation; we could only do that if we made this a method that you can't implement yourself (e.g. a sealed WriteExt with a blanket impl). I don't think that's worth it.
@Thomasdezeeuw
If it's such a big problem maybe this needs a new type that wraps &mut [IoSlice] to show that ownership is passed to the function?
The issue isn't just that callers will forget this (
&mutis a big hint that you can't count on it being unmodified), though that's also a problem. The primary issue seems like having to deal with it being overwritten by the function in the first place.
To be fair that's why the documentation says
Once this function returns, the contents of bufs are unspecified, as this depends on how many calls to
write_vectoredwere necessary.
If you program with that in mind, essentially passing ownership of bufs to the function (this why I think a type might help), I don't think its a problem. But I think I'm just rehashing points I've made before.
Could you show an example where it is possible to keep your iovecs around while modifying the buffer?
let mut iovecs = [ IoSlice::new(header), IoSlice::new(x), ]; w.write_all_vectored(&iovecs)?; iovecs[1] = IoSlice::new(y); w.write_all_vectored(&iovecs)?;
What I meant what do you have an example where you actually modify the underlying buffers. If you have the headers prepared (especially if it's just a single buffer) I don't think the overhead of recreating it is too much.
Also, there are several functions we could add to make this better. For instance, we could add a function from
&mut [IoSliceMut]to&[IoSlice], so that people can mutate a buffer through theIoSliceMut.FWIW, I could also work with a
write_all_vectoredthat requires a&mutbut guarantees to change the slices back when done. "This uses your IoSlices as scratch space, but will always restore them to an unmodified state afterwards." But we can't make that guarantee about user impls ofWrite, only about our standard implementation; we could only do that if we made this a method that you can't implement yourself (e.g. a sealedWriteExtwith a blanket impl). I don't think that's worth it.
That could, especially if the restoring can be optimised away if the compiler can see the buffers aren't used any more. 👍
People primarily seem to use this for cases like writing from a ringbuffer and handling wraparound (using two iovecs), or writing a header and body.
For what it's worth, I was planning to use it for the case of assembling a text document from potentially thousands of pieces. For my use case, making a big Vec<IoSlice<_>> that I only use for a single call and then drop is exactly what I want.
Accepting a Vec directly, so that it is dropped afterwards, would fit my needs and would eliminate the "wait heck some of my IoSlices got altered?!" problem. It wouldn't be optimal for other cases, though, including the "adding a header" case.
You know, now that I think about it, I'm totally free to just implement that function myself in terms of write_vectored...
How about an owned type that wraps T: AsRef<[IoSlice<'_>]> (or Deref) to indicate that the write call "owns" the buffers and can modify them, but allow the buffers wrapper type to be reused (i.e. the Vec in Vec<IoSlice<'_>>). It makes the API a bit more conflicted, but perhaps that will resolve @joshtriplett concerns.