liburing icon indicating copy to clipboard operation
liburing copied to clipboard

Add fsync to examples/io_uring-cp.c

Open hnakamur opened this issue 4 years ago • 19 comments

First of all, thanks for your work on io_uring and liburing!

I tried examples/io_uring-cp.c and found sometimes the size of the destination file is zero. This pull request fixes this problem.

Signed-off-by: Hiroaki Nakamura [email protected]

hnakamur avatar May 03 '20 05:05 hnakamur

Thanks - it's s shame that you put the good explanation in this pull request, though, and not in the actual commit itself, as the latter is what people will see when they browse the repo whereas this one is just gone. Can you amend your commit and put the explanation in the commit message itself?

axboe avatar May 03 '20 16:05 axboe

Also, one thing that might be nifty is to use 'write_left' to link the fsync to the last write instead. That's more in tune with what you'd do with io_uring, instead of using an extra system call for the fsync.

axboe avatar May 03 '20 16:05 axboe

Just to be clear, I'm fine with the existing patch, just thought it'd be cool if we linked the final write with the fsync, but it's certainly a bit trickier to get right.

axboe avatar May 03 '20 16:05 axboe

Thanks for your feedback! I've updated the commit message and linked the final write to the fsync. Could you take another look?

hnakamur avatar May 03 '20 23:05 hnakamur

Can I ask you a question or two? I think this modification assume that the other writes will be completed before the last write will be completed. Is this assumption correct? In other words, the write operations for the same file descriptor will be processed in the submitted order?

How about read operations? When we submit multiple reads for the same file descriptor, will they processed in the submit order and cqes are also in that order? Or the read operations are processed in parallel, and cqes order may be different from the submit order?

hnakamur avatar May 04 '20 00:05 hnakamur

Order or completions aren’t given, for either writes or reads. But we know it’s the last write, so what you want to do is mark that as a barrier as well. Then it won’t be started until previous writes have completed, and hence we’ll know that the fsync is the last request issued.

On May 3, 2020, at 6:33 PM, Hiroaki Nakamura [email protected] wrote:

 Can I ask you a question or two? I think this modification assume that the other writes will be completed before the last write will be completed. Is this assumption correct? In other words, the write operations for the same file descriptor will be processed in the submitted order?

How about read operations? When we submit multiple reads for the same file descriptor, will they processed in the submit order and cqes are also in that order? Or the read operations are processed in parallel, and cqes order may be different from the submit order?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

axboe avatar May 04 '20 02:05 axboe

Thanks for your comment. I'd like to check to make sure I understand correctly.

Though we know it's the last write we submitted, other write operations may be completed after the completion of the last submitted write. So we must wait for all write operations to be completed before kernel processes a fsync operation.

I don't think setting IOSQE_IO_LINK to all write operations works because read and write operations are mixed in one queue in this example.

Then I found you described the exact situation at 5.1 SQE ORDERING in https://kernel.dk/io_uring.pdf.

So the best way to for integrity writes may be using a dedicated queue for writes, submitting all write operations with IOSQE_IO_DRAIN sqe flags, and then submitting a fdatasync or fsync operation at the last.

But what we are going to do if we get -EAGAIN incqe->res or short writes for write operations? I don't think adjusting and requeuing write operations is not ideal since the previously submitted fsync operation may be processed before resubmitted write operations, so we need to submit another fsync operation.

So I think we need to track the total number of completed bytes of write operations and when it becomes the same as insize, then we can submit a fsync operation.

It that correct?

hnakamur avatar May 04 '20 06:05 hnakamur

You don't need to mark all of them, just the one you care about. So for your final write, you would set IOSQE_IO_DRAIN | IOSQE_IO_LINK, and have the fsync after that.This will guarantee that:

  1. Previously submitted requests are finished before this one is
  2. The linked fsync isn't started until that write is done

and provide the ordering we need.

axboe avatar May 04 '20 13:05 axboe

Thanks for your detail explanation. I think now I understood and I've updated the commit.

hnakamur avatar May 04 '20 15:05 hnakamur

I think we call the queue_fsync() from the queue_write() handler, and probably pass it in as 'bool link_fsync' or something like that. Then set the correct flags in queue_write() for queue_prepped(), if link_fsync is true.

The reason why I think it should be done there is that we need an sqe to do it. What happens if the io_uring_get_sqe() fails in queue_fsync()? I'd put that logic in queue_write(), and only set DRAIN|LINK if we manage to get an sqe for fsync. If we fail, we should have a global "do_manual_fsync" variable that we then set to true, and do a regular fsync() call for that case.

Normally I would not care so much about getting it absolutely right as it's just a demo case, but people tend to copy paste this kind of code, or use it as a reference. Hence it's important we don't have any weird cases where things just don't work properly.

axboe avatar May 04 '20 16:05 axboe

I agree that getting this example absolutely right is important, because I would like to know how to use liburing correctly by looking at API docs, tests, and examples.

If I understood correctly, io_uring_get_sqe() fails only when the submission queue is full. If we get NULL from io_uring_get_sqe(), retrying io_uring_get_sqe() after calling io_uring_submit never fails like io_uring_get_sqe_safe in liburing4cpp. But I don't understand why io_uring_cq_advance is called here. Since the submission queue and the completion queue is two separate things, I think there is no need to call io_uring_cq_advance in io_uring_get_sqe_safe.

But in this example, we call io_uring_submit for each sqe instead of deferring io_uring_submit and batching multiple sqes to be submitted like in liburing4cpp. So I suppose calling sqe = io_uring_get_sqe(ring) and assert(sqe) is enough.

And I don't quite understand setting DRAIN|LINK for the final write would suffice. Since a short write may happen for the final write, I think we need to submit a fsync after we make sure all the bytes are completed to be written.

hnakamur avatar May 05 '20 00:05 hnakamur

I think it would be nice if we can have a flag or something to avoid short reads/writes. In other words, with this flag being set, when a short read/write happen, the buffer will be automatically adjusted and retried in the kernel side. I have no idea about whether this can be implemented practically or not though.

hnakamur avatar May 05 '20 09:05 hnakamur

@CarterLi Thanks for the pointer. Now I understand why io_uring_cq_advance is called in io_uring_get_sqe_safe.

hnakamur avatar May 05 '20 12:05 hnakamur

I encountered this exact same issue and was also wondering if fsync could be the issue. Luckily I stumbled upon this conversation, and this PR works perfectly on my machine. Thanks for the great work!

I still have a question, though. Shouldn't write requests be eventually fulfilled, even if they get delayed/buffered?! Does the exit of a process cancel its pending write requests?!

mengyouyang avatar May 06 '20 23:05 mengyouyang

Actually I'm not so certain about short writes. But the code below does handling short writes, I suppose it is needed. https://github.com/axboe/liburing/blob/fe500488190ff39716346ee1c0fe66bde300d0fb/examples/io_uring-cp.c#L197-L204

hnakamur avatar May 06 '20 23:05 hnakamur

I noticed examples/link-cp.c does not seem to handle short reads/writes and it does not call fsync. However it seems that is copies small files correctly. Why the difference? Does that mean short writes for files never happen?

Then I noticed the detailed explanation at "5.2 LINKED SQES" in http://kernel.dk/io_uring.pdf If short read happen for a read operation with IOSQE_IO_LINK, the chained write operation is canceled with -ECANCELED. And examples/link-cp. does handle that.

For examples/link-cp.c it is easy to set IOSQE_IO_LINK to the final write and add a fsync operation.

hnakamur avatar May 07 '20 22:05 hnakamur

FYI, I'm writing an easy-to-use wrapper library of liburing for my own usescases. I rewrote this copy example and link-cp.c using my library.

All three examples use liburing for system calls except ioctl.

hnakamur avatar May 10 '20 14:05 hnakamur

@hnakamur In my other experiment code, I found if I use O_SYNC in the open() call, then I don't need to submit the final IORING_OP_FSYNC request at all, and the code actually runs faster. But for some reason, this trick doesn't work for this link-cp.c example. Just FYI.

mengyouyang avatar Jun 07 '20 04:06 mengyouyang