liburing Add fsync to examples/io

Add fsync to examples/io_uring-cp.c

Open hnakamur opened this issue 4 years ago • 19 comments

First of all, thanks for your work on io_uring and liburing!

I tried examples/io_uring-cp.c and found sometimes the size of the destination file is zero. This pull request fixes this problem.

Signed-off-by: Hiroaki Nakamura [email protected]

May 03 '20 05:05 hnakamur

Thanks - it's s shame that you put the good explanation in this pull request, though, and not in the actual commit itself, as the latter is what people will see when they browse the repo whereas this one is just gone. Can you amend your commit and put the explanation in the commit message itself?

May 03 '20 16:05 axboe

Also, one thing that might be nifty is to use 'write_left' to link the fsync to the last write instead. That's more in tune with what you'd do with io_uring, instead of using an extra system call for the fsync.

May 03 '20 16:05 axboe

Just to be clear, I'm fine with the existing patch, just thought it'd be cool if we linked the final write with the fsync, but it's certainly a bit trickier to get right.

May 03 '20 16:05 axboe

Thanks for your feedback! I've updated the commit message and linked the final write to the fsync. Could you take another look?

May 03 '20 23:05 hnakamur

Can I ask you a question or two? I think this modification assume that the other writes will be completed before the last write will be completed. Is this assumption correct? In other words, the write operations for the same file descriptor will be processed in the submitted order?

How about read operations? When we submit multiple reads for the same file descriptor, will they processed in the submit order and cqes are also in that order? Or the read operations are processed in parallel, and cqes order may be different from the submit order?

May 04 '20 00:05 hnakamur

Order or completions aren’t given, for either writes or reads. But we know it’s the last write, so what you want to do is mark that as a barrier as well. Then it won’t be started until previous writes have completed, and hence we’ll know that the fsync is the last request issued.

On May 3, 2020, at 6:33 PM, Hiroaki Nakamura [email protected] wrote:

Can I ask you a question or two? I think this modification assume that the other writes will be completed before the last write will be completed. Is this assumption correct? In other words, the write operations for the same file descriptor will be processed in the submitted order?

How about read operations? When we submit multiple reads for the same file descriptor, will they processed in the submit order and cqes are also in that order? Or the read operations are processed in parallel, and cqes order may be different from the submit order?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

May 04 '20 02:05 axboe

Thanks for your comment. I'd like to check to make sure I understand correctly.

Though we know it's the last write we submitted, other write operations may be completed after the completion of the last submitted write. So we must wait for all write operations to be completed before kernel processes a fsync operation.

I don't think setting IOSQE_IO_LINK to all write operations works because read and write operations are mixed in one queue in this example.

Then I found you described the exact situation at 5.1 SQE ORDERING in https://kernel.dk/io_uring.pdf.

So the best way to for integrity writes may be using a dedicated queue for writes, submitting all write operations with IOSQE_IO_DRAIN sqe flags, and then submitting a fdatasync or fsync operation at the last.

But what we are going to do if we get -EAGAIN incqe->res or short writes for write operations? I don't think adjusting and requeuing write operations is not ideal since the previously submitted fsync operation may be processed before resubmitted write operations, so we need to submit another fsync operation.

So I think we need to track the total number of completed bytes of write operations and when it becomes the same as insize, then we can submit a fsync operation.

It that correct?

May 04 '20 06:05 hnakamur

You don't need to mark all of them, just the one you care about. So for your final write, you would set IOSQE_IO_DRAIN | IOSQE_IO_LINK, and have the fsync after that.This will guarantee that:

Previously submitted requests are finished before this one is
The linked fsync isn't started until that write is done

and provide the ordering we need.

May 04 '20 13:05 axboe

Thanks for your detail explanation. I think now I understood and I've updated the commit.

May 04 '20 15:05 hnakamur

I think we call the queue_fsync() from the queue_write() handler, and probably pass it in as 'bool link_fsync' or something like that. Then set the correct flags in queue_write() for queue_prepped(), if link_fsync is true.

The reason why I think it should be done there is that we need an sqe to do it. What happens if the io_uring_get_sqe() fails in queue_fsync()? I'd put that logic in queue_write(), and only set DRAIN|LINK if we manage to get an sqe for fsync. If we fail, we should have a global "do_manual_fsync" variable that we then set to true, and do a regular fsync() call for that case.

Normally I would not care so much about getting it absolutely right as it's just a demo case, but people tend to copy paste this kind of code, or use it as a reference. Hence it's important we don't have any weird cases where things just don't work properly.

May 04 '20 16:05 axboe

I agree that getting this example absolutely right is important, because I would like to know how to use liburing correctly by looking at API docs, tests, and examples.

If I understood correctly, io_uring_get_sqe() fails only when the submission queue is full. If we get NULL from io_uring_get_sqe(), retrying io_uring_get_sqe() after calling io_uring_submit never fails like io_uring_get_sqe_safe in liburing4cpp. But I don't understand why io_uring_cq_advance is called here. Since the submission queue and the completion queue is two separate things, I think there is no need to call io_uring_cq_advance in io_uring_get_sqe_safe.

But in this example, we call io_uring_submit for each sqe instead of deferring io_uring_submit and batching multiple sqes to be submitted like in liburing4cpp. So I suppose calling sqe = io_uring_get_sqe(ring) and assert(sqe) is enough.

And I don't quite understand setting DRAIN|LINK for the final write would suffice. Since a short write may happen for the final write, I think we need to submit a fsync after we make sure all the bytes are completed to be written.

May 05 '20 00:05 hnakamur

I think it would be nice if we can have a flag or something to avoid short reads/writes. In other words, with this flag being set, when a short read/write happen, the buffer will be automatically adjusted and retried in the kernel side. I have no idea about whether this can be implemented practically or not though.

May 05 '20 09:05 hnakamur

io_uring_submit will fail if the cq is full

May 05 '20 11:05 CarterLi

@CarterLi Thanks for the pointer. Now I understand why io_uring_cq_advance is called in io_uring_get_sqe_safe.

May 05 '20 12:05 hnakamur

I encountered this exact same issue and was also wondering if fsync could be the issue. Luckily I stumbled upon this conversation, and this PR works perfectly on my machine. Thanks for the great work!

I still have a question, though. Shouldn't write requests be eventually fulfilled, even if they get delayed/buffered?! Does the exit of a process cancel its pending write requests?!

May 06 '20 23:05 mengyouyang

Actually I'm not so certain about short writes. But the code below does handling short writes, I suppose it is needed. https://github.com/axboe/liburing/blob/fe500488190ff39716346ee1c0fe66bde300d0fb/examples/io_uring-cp.c#L197-L204

May 06 '20 23:05 hnakamur

I noticed examples/link-cp.c does not seem to handle short reads/writes and it does not call fsync. However it seems that is copies small files correctly. Why the difference? Does that mean short writes for files never happen?

Then I noticed the detailed explanation at "5.2 LINKED SQES" in http://kernel.dk/io_uring.pdf If short read happen for a read operation with IOSQE_IO_LINK, the chained write operation is canceled with -ECANCELED. And examples/link-cp. does handle that.

For examples/link-cp.c it is easy to set IOSQE_IO_LINK to the final write and add a fsync operation.

May 07 '20 22:05 hnakamur

FYI, I'm writing an easy-to-use wrapper library of liburing for my own usescases. I rewrote this copy example and link-cp.c using my library.

iorn/base-cp.c at master · hnakamur/iorn - read/write version (handling short read/writes)
iorn/vecs-cp.c at master · hnakamur/iorn - readv/writev version (handling short read/writes)
iorn/link-cp.c at master · hnakamur/iorn - IOSQE_IO_LINK version

All three examples use liburing for system calls except ioctl.

May 10 '20 14:05 hnakamur

@hnakamur In my other experiment code, I found if I use O_SYNC in the open() call, then I don't need to submit the final IORING_OP_FSYNC request at all, and the code actually runs faster. But for some reason, this trick doesn't work for this link-cp.c example. Just FYI.

Jun 07 '20 04:06 mengyouyang

liburing liburing copied to clipboard

Add fsync to examples/io_uring-cp.c

liburing
liburing copied to clipboard