liburing icon indicating copy to clipboard operation
liburing copied to clipboard

zero-copy send with IOSQE_CQE_SKIP_SUCCESS is broken

Open illiliti opened this issue 1 year ago • 10 comments

I noticed that zero-copy send can't be used with IOSQE_CQE_SKIP_SUCCESS. It took quite a while to find that out(i thought the problem was on my end), so I think it should be documented at least.

illiliti avatar Mar 13 '23 05:03 illiliti

It's rather not yet supported, I need to add and agree that mans should be adjusted in the meanwhile.

Btw, let me ask what kind of behaviour you was expecting from it considering that there are two CQEs including notifications?

isilence avatar Mar 13 '23 17:03 isilence

I was expecting no CQEs at all. I'm sending static buffer, I don't need notifications.

illiliti avatar Mar 14 '23 01:03 illiliti

Agreed with illiliti, if I used IOSQE_CQE_SKIP_SUCCESS with sendzc, I imagine it would only be for buffers that remain "alive" for the entire duration of the program i.e. so I would not be freeing them upon notif CQEs.

More precisely, if IOSQE_CQE_SKIP_SUCCESS is set: (first CQE) of sendzc should only show up if it failed or if it was a short send (second CQE i.e. notif CQE) should never show up with IOSQE_CQE_SKIP_SUCCESS.

bnbajwa avatar Mar 14 '23 12:03 bnbajwa

It doesn't seem viable apart from benchmarks and highly specific cases. The user should not be rewriting buffers before getting a notification, so it needs a way to know that the buffer is freed. My plan is to:

  1. Make IOSQE_CQE_SKIP_SUCCESS to affect only the first CQE according to the generic rules. That gives consistency.
  2. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

And allow to freely mix them. How does it sound?

isilence avatar Mar 16 '23 23:03 isilence

The buffer is constant. An example of such buffer is a file #embeded at compile time or mmaped at program startup. It never changes, it needs no way to know if it's freed, thus it need no (from application/user perspective)useless notifications about its state.

@bnbajwa proposal is exactly what is needed, basically make IOSQE_CQE_SKIP_SUCCESS work as it work for normal non-zerocopy send + discard second CQE. Your proposal void the whole point of IOSQE_CQE_SKIP_SUCCESS because user would still have to handle CQE.

illiliti avatar Mar 17 '23 06:03 illiliti

  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

And allow to freely mix them. How does it sound?

If IOSQE_CQE_SKIP_SUCCESS and the send-zc flag can be freely mixed together, why does the send-zc flag need to implicitly skip the completion (1st) CQE? send-zc flag should just copy the return code from completion CQE into the notification CQE — it should not automatically skip the first CQE. If the user wants to to skip the completion (1st) CQE, then they can just add the IOSQE_CQE_SKIP_SUCCESS flag too (i.e. IOSQE_CQE_SKIP_SUCCESS + send-zc).

send-zc flag would be useful, it would simplify my TCP sendzc implementation. This is because — currently — when we get a notif CQE, we can't be sure if we should free the buffer or not as we can get notif CQEs even on failures and short sends — in which case I will want to retry sendzc request with same buffer. So knowing the return code in the notif CQE is important. I've implemented this in userspace by simply copying the return code from 1st CQE into allocated memory which the notif CQE then reads from. So, its simple to implement in userspace, thus, the flag itself is not essential but would be nice.

bnbajwa avatar Mar 17 '23 06:03 bnbajwa

The user should not be rewriting buffers before getting a notification, so it needs a way to know that the buffer is freed.

As with @illiliti, my imagined use case would also be for constant buffers that are never modified e.g. a loaded file in memory.

  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

I've been thinking about this further, is your proposal that the completion (1st) CQE would always be skipped with this flag even for failures/short sends? So essentially, it would turn sendzc from a 2 CQE request into an always 1 CQE request? If so, seems to be reasonable but not sure if it would be widely applicable.

It doesn't seem viable apart from benchmarks and highly specific cases.

I don't think any of the proposals in this thread for IOSQE_CQE_SKIP_SUCCESS are general enough to be honest. They are maybe applicable to quite specific cases only.

Make IOSQE_CQE_SKIP_SUCCESS to affect only the first CQE according to the generic rules. That gives consistency.

I think IOSQE_CQE_SKIP_SUCCESS affecting only the first CQE seems fine/reasonable. However, I think we do need a flag to allow the 2nd notif CQE to be skipped (always?) as well.

Being able to control exactly which CQEs to skip seems like a more general proposal. i.e. we should be able to skip any of:

  • first CQE only
  • second CQE only
  • both CQEs

For example, suppose I had a constant buffer loaded in memory that I wanted to send in chunks. I would want to skip second CQE because I won't be modifying the buffer. However, I would want to receive the first CQE so I immediately know when I can send the next chunk (high throughput).

bnbajwa avatar Mar 17 '23 07:03 bnbajwa

The buffer is constant. An example of such buffer is a file #embeded at compile time or mmaped at program startup. It never changes, it needs no way to know if it's freed, thus it need no (from application/user perspective)useless notifications about its state.

Again, that's too limited of a use case and more seems like a benchmark to me. We're not going to cater it to one very specific case.

@bnbajwa proposal is exactly what is needed, basically make IOSQE_CQE_SKIP_SUCCESS work as it work for normal non-zerocopy send + discard second CQE. Your proposal void the whole point of IOSQE_CQE_SKIP_SUCCESS because user would still have to handle CQE.

How so? Would you elaborate? Set two of them and you get a usual IOSQE_CQE_SKIP_SUCCESS behaviour, one CQE on failure or none if succeed, which is, to be honest, pretty useless unless linked with other requests.

isilence avatar Mar 17 '23 12:03 isilence

  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

And allow to freely mix them. How does it sound?

If IOSQE_CQE_SKIP_SUCCESS and the send-zc flag can be freely mixed together, why does the send-zc flag need to implicitly skip the completion (1st) CQE? send-zc flag should just copy the return code from completion CQE into the notification CQE — it should not automatically skip the first CQE. If the user wants to to skip the completion (1st) CQE, then they can just add the IOSQE_CQE_SKIP_SUCCESS flag too (i.e. IOSQE_CQE_SKIP_SUCCESS + send-zc).

It seems you don't understand the idea. You can also think of it as combining two CQEs into one, which will be posted at a time when previously the notification would've been posted.

send-zc flag would be useful, it would simplify my TCP sendzc implementation. This is because — currently — when we get a notif CQE, we can't be sure if we should free the buffer or not as we can get notif CQEs even on failures and short sends — in which case I will want to retry sendzc request with same buffer. So knowing the return code in the notif CQE is important. I've implemented this in userspace by simply copying the return code from 1st CQE into allocated memory which the notif CQE then reads from. So, its simple to implement in userspace, thus, the flag itself is not essential but would be nice.

A notification CQE is posted strictly after the corresponding completion CQEs. If it's not the case, please report a bug.

isilence avatar Mar 17 '23 12:03 isilence

The user should not be rewriting buffers before getting a notification, so it needs a way to know that the buffer is freed.

As with @illiliti, my imagined use case would also be for constant buffers that are never modified e.g. a loaded file in memory.

  1. Add a send-zc flag that would copy the return code into the notification CQE and skip the completion (1st) CQE. IOW, you would get it roughly with an ACK for TCP.

I've been thinking about this further, is your proposal that the completion (1st) CQE would always be skipped with this flag even for failures/short sends? So essentially, it would turn sendzc from a 2 CQE request into an always 1 CQE request? If so, seems to be reasonable but not sure if it would be widely applicable.

No, the idea is to combine two into one, and then you can apply generic IOSQE_CQE_SKIP_SUCCESS rules on top of that one, i.e. CQE on error or none otherwise.

It doesn't seem viable apart from benchmarks and highly specific cases.

I don't think any of the proposals in this thread for IOSQE_CQE_SKIP_SUCCESS are general enough to be honest. They are maybe applicable to quite specific cases only.

Make IOSQE_CQE_SKIP_SUCCESS to affect only the first CQE according to the generic rules. That gives consistency.

I think IOSQE_CQE_SKIP_SUCCESS affecting only the first CQE seems fine/reasonable. However, I think we do need a flag to allow the 2nd notif CQE to be skipped (always?) as well.

Being able to control exactly which CQEs to skip seems like a more general proposal. i.e. we should be able to skip any of:

* first CQE only

* second CQE only

* both CQEs

For example, suppose I had a constant buffer loaded in memory that I wanted to send in chunks. I would want to skip second CQE because I won't be modifying the buffer. However, I would want to receive the first CQE so I immediately know when I can send the next chunk (high throughput).

In general case the user need to do sth with the buffer afterward (free / reuse) and so should have a way to infer the state of the buffer. otherwise it can't be reused or even normally freed but only munmap()'ed at the end. Just skipping is not going to work.

isilence avatar Mar 17 '23 13:03 isilence