liburing
liburing copied to clipboard
Add fsync to examples/io_uring-cp.c
First of all, thanks for your work on io_uring and liburing!
I tried examples/io_uring-cp.c and found sometimes the size of the destination file is zero. This pull request fixes this problem.
Signed-off-by: Hiroaki Nakamura [email protected]
Thanks - it's s shame that you put the good explanation in this pull request, though, and not in the actual commit itself, as the latter is what people will see when they browse the repo whereas this one is just gone. Can you amend your commit and put the explanation in the commit message itself?
Also, one thing that might be nifty is to use 'write_left' to link the fsync to the last write instead. That's more in tune with what you'd do with io_uring, instead of using an extra system call for the fsync.
Just to be clear, I'm fine with the existing patch, just thought it'd be cool if we linked the final write with the fsync, but it's certainly a bit trickier to get right.
Thanks for your feedback! I've updated the commit message and linked the final write to the fsync. Could you take another look?
Can I ask you a question or two? I think this modification assume that the other writes will be completed before the last write will be completed. Is this assumption correct? In other words, the write operations for the same file descriptor will be processed in the submitted order?
How about read operations? When we submit multiple reads for the same file descriptor, will they processed in the submit order and cqes are also in that order? Or the read operations are processed in parallel, and cqes order may be different from the submit order?
Order or completions aren’t given, for either writes or reads. But we know it’s the last write, so what you want to do is mark that as a barrier as well. Then it won’t be started until previous writes have completed, and hence we’ll know that the fsync is the last request issued.
On May 3, 2020, at 6:33 PM, Hiroaki Nakamura [email protected] wrote:
Can I ask you a question or two? I think this modification assume that the other writes will be completed before the last write will be completed. Is this assumption correct? In other words, the write operations for the same file descriptor will be processed in the submitted order?
How about read operations? When we submit multiple reads for the same file descriptor, will they processed in the submit order and cqes are also in that order? Or the read operations are processed in parallel, and cqes order may be different from the submit order?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for your comment. I'd like to check to make sure I understand correctly.
Though we know it's the last write we submitted, other write operations may be completed after the completion of the last submitted write. So we must wait for all write operations to be completed before kernel processes a fsync
operation.
I don't think setting IOSQE_IO_LINK
to all write operations works because read and write operations are mixed in one queue in this example.
Then I found you described the exact situation at 5.1 SQE ORDERING
in https://kernel.dk/io_uring.pdf.
So the best way to for integrity writes may be using a dedicated queue for writes, submitting all write operations with IOSQE_IO_DRAIN
sqe flags, and then submitting a fdatasync
or fsync
operation at the last.
But what we are going to do if we get -EAGAIN
incqe->res
or short writes for write operations? I don't think adjusting and requeuing write operations is not ideal since the previously submitted fsync
operation may be processed before resubmitted write operations, so we need to submit another fsync
operation.
So I think we need to track the total number of completed bytes of write operations and when it becomes the same as insize
, then we can submit a fsync
operation.
It that correct?
You don't need to mark all of them, just the one you care about. So for your final write, you would set IOSQE_IO_DRAIN | IOSQE_IO_LINK, and have the fsync after that.This will guarantee that:
- Previously submitted requests are finished before this one is
- The linked fsync isn't started until that write is done
and provide the ordering we need.
Thanks for your detail explanation. I think now I understood and I've updated the commit.
I think we call the queue_fsync() from the queue_write() handler, and probably pass it in as 'bool link_fsync' or something like that. Then set the correct flags in queue_write() for queue_prepped(), if link_fsync is true.
The reason why I think it should be done there is that we need an sqe to do it. What happens if the io_uring_get_sqe() fails in queue_fsync()? I'd put that logic in queue_write(), and only set DRAIN|LINK if we manage to get an sqe for fsync. If we fail, we should have a global "do_manual_fsync" variable that we then set to true, and do a regular fsync() call for that case.
Normally I would not care so much about getting it absolutely right as it's just a demo case, but people tend to copy paste this kind of code, or use it as a reference. Hence it's important we don't have any weird cases where things just don't work properly.
I agree that getting this example absolutely right is important, because I would like to know how to use liburing correctly by looking at API docs, tests, and examples.
If I understood correctly, io_uring_get_sqe()
fails only when the submission queue is full. If we get NULL
from io_uring_get_sqe()
, retrying io_uring_get_sqe()
after calling io_uring_submit
never fails like io_uring_get_sqe_safe in liburing4cpp.
But I don't understand why io_uring_cq_advance
is called here. Since the submission queue and the completion queue is two separate things, I think there is no need to call io_uring_cq_advance
in io_uring_get_sqe_safe
.
But in this example, we call io_uring_submit
for each sqe
instead of deferring io_uring_submit
and batching multiple sqe
s to be submitted like in liburing4cpp. So I suppose calling sqe = io_uring_get_sqe(ring)
and assert(sqe)
is enough.
And I don't quite understand setting DRAIN|LINK for the final write would suffice. Since a short write may happen for the final write, I think we need to submit a fsync after we make sure all the bytes are completed to be written.
I think it would be nice if we can have a flag or something to avoid short reads/writes. In other words, with this flag being set, when a short read/write happen, the buffer will be automatically adjusted and retried in the kernel side. I have no idea about whether this can be implemented practically or not though.
@CarterLi Thanks for the pointer. Now I understand why io_uring_cq_advance
is called in io_uring_get_sqe_safe
.
I encountered this exact same issue and was also wondering if fsync could be the issue. Luckily I stumbled upon this conversation, and this PR works perfectly on my machine. Thanks for the great work!
I still have a question, though. Shouldn't write requests be eventually fulfilled, even if they get delayed/buffered?! Does the exit of a process cancel its pending write requests?!
Actually I'm not so certain about short writes. But the code below does handling short writes, I suppose it is needed. https://github.com/axboe/liburing/blob/fe500488190ff39716346ee1c0fe66bde300d0fb/examples/io_uring-cp.c#L197-L204
I noticed examples/link-cp.c
does not seem to handle short reads/writes and it does not call fsync.
However it seems that is copies small files correctly.
Why the difference?
Does that mean short writes for files never happen?
Then I noticed the detailed explanation at "5.2 LINKED SQES" in http://kernel.dk/io_uring.pdf
If short read happen for a read operation with IOSQE_IO_LINK, the chained write operation is canceled with -ECANCELED
. And examples/link-cp.
does handle that.
For examples/link-cp.c
it is easy to set IOSQE_IO_LINK to the final write and add a fsync operation.
FYI, I'm writing an easy-to-use wrapper library of liburing for my own usescases.
I rewrote this copy example and link-cp.c
using my library.
- iorn/base-cp.c at master · hnakamur/iorn - read/write version (handling short read/writes)
- iorn/vecs-cp.c at master · hnakamur/iorn - readv/writev version (handling short read/writes)
- iorn/link-cp.c at master · hnakamur/iorn - IOSQE_IO_LINK version
All three examples use liburing for system calls except ioctl
.
@hnakamur In my other experiment code, I found if I use O_SYNC
in the open()
call, then I don't need to submit the final IORING_OP_FSYNC
request at all, and the code actually runs faster. But for some reason, this trick doesn't work for this link-cp.c
example. Just FYI.