tokio-uring possibly deferring work in the creation of an operation

Currently, some actual uring work is done when an Op<T> is created in the submit_with function, as seen here. https://github.com/tokio-rs/tokio-uring/blob/913aa14c567cfae026b7d4f88a5155921cfb9575/src/driver/op.rs#L61-L91

So a uring submission queue entry is created and added to the actual uring that is shared with the kernel when submit_with is called. It is only waiting for a response that is accomplished with an await on the returned result.

Given the discussion in #115, I wanted to ask if it is really desirable to do this work at creation time?

An alternative would be to return a Future that would first perform the submission and then await the response.

I ask because I can see the design working either way, and this seems as good a place as any to document why the crate design in this regards is what it is.

Having read through what the try_join! and select! macros are doing and seeing how a good crate will have documented what is cancel safe, this seemed like the right time to ask and get the reason down in writing.

I don't see how we can make the uring operations cancel safe anyway, operations where the kernel has begun reading data seem inherently not cancel safe if waiting for the result is canceled (in select! terminology). So I think I like the design the way it is, but wondering what others see are the tradeoffs.

Sep 23 '22 12:09 FrankReh

@FrankReh I've also been thinking about this. I think deferred work is the way to go. Partly because this is how future is described , and we don't match that behavior currently

Futures alone are inert; they must be actively polled to make progress, meaning that each time the current task is woken up, it should actively re-poll pending futures that it still has an interest in.

Oct 25 '22 14:10 ollie-etl

@ollie-etl I'm not disagreeing with you but asking for a clarification. Are you saying you think the function that returns the future for getting a response shouldn't be putting an entry in the squeue yet? We have both seen that adding an sqe to the squeue can run into the problem of the squeue needing to be submitted because it is full and that means a syscall which has to block long enough to drain the squeue. (I don't mean block in a bad way here, the current thread is going to have to make the syscall, whether it is when the operation is created or when it is first waited on, anyway.)

I just want to make sure you are leaning towards more work being done the first time the future is polled, rather than having some of the work done before the poll.

I'm almost on the fence on this but do like that the future could be dropped and nothing would have been put into the submission queue. Except ... I also like the ability at the moment to get an operation sent to the uring interface without having to wait for it - that comes in handy in drop functions when we want to use the uring interface to close or shutdown something.

Operations are likely to get more options, perhaps with a builder api of their own. Perhaps that would enable the option of directly adding to the submission queue in some cases but not in others.

Oct 25 '22 15:10 FrankReh

@FrankReh yes, I think the work should only be put on the squeue on poll - I think that's more in keeping with the spirit of futures. I think that is perhaps least surprising for users too.

I briefly started implementing this to see what it'd look like, but haven't got very far yet

Oct 25 '22 20:10 ollie-etl

@ollie-etl Perhaps orthogonal to what you're doing, but I wanted to make sure you were aware it is possible the kernel is not being informed of entries being added to the ring, in a timely manor, because the submission sync isn't called for every entry or even for every 'n' entries. But I'm think of it like a flush and I don't know how long a cache line is allowed to be on two CPUs without the stale one being replaced.

Oct 25 '22 20:10 FrankReh

So, what I'd like from an API perspective is to have oneshot futures first submit an op and then return a future referencing the in-flight op which provides support for operations like cancellation.

I'd also like to have something like a builder API for constructing an op and operating on it pre-submission. I've been meaning to get around to implementing both.

Oct 26 '22 18:10 Noah-Kennedy

@Noah-Kennedy : The behaviour of your benchmark in #151 is exactly what I meant by trying to keep to the principle of "least suprise". Your benchmark is submitting almost all operations prior to the benchmark measurement.

I think a builder API is very useful, not least because I'd like to be able to link submissions, but I maintain that futures do nothing until polled is the only sane behavior. It allows, bth on benchmarks and reality, us to define operation outside of the main execution loop

Oct 27 '22 09:10 ollie-etl

I'm going to be opening a PR hopefully tomorrow to restructure how we do op submissions. I'm actually thinking that it would ne better to make the transition to in-flight explicit rather than implicit, as is proposed here. It turns out that the change proposed here actually makes reasoning about or implementing linked SQEs kinda hard.

Nov 05 '22 00:11 Noah-Kennedy

It turns out that the change proposed here actually makes reasoning about or implementing linked SQEs kinda hard.

I'm not sure I agree. I too have been thinking about this (#165), and although haven't got round to linking SQE's yet, have given it some consideration, and don't see the blockers. Maybe they're the unknown unknowns

Nov 05 '22 21:11 ollie-etl

I'm toying around with things now, and this is considerably more nuanced than I had thought. Let's postpone any work on this until after SQE linking is supported.

Nov 06 '22 01:11 Noah-Kennedy

I'm working on this right now, and hoping to get this in soon:tm:.

Nov 06 '22 01:11 Noah-Kennedy

@Noah-Kennedy It's maybe not fundemental, but quite central to the implementation approach I'm toying with for linking

Nov 06 '22 20:11 oliverbunting

@oliverbunting I think we should discuss the SQE linking then, not this.

Nov 06 '22 22:11 Noah-Kennedy

I think deferred Ops are actually more important. That is because I think the behavior of the current API is pretty close to being a bug - its certainly not intuitive, and goes against the documented behavior for implementations of the Future trait, whereas Linking SQE's is a feature.

Nov 07 '22 10:11 ollie-etl

@Noah-Kennedy See #169

Nov 07 '22 10:11 ollie-etl

@FrankReh @oliverbunting I'd actually like us to go another route: towards more granular control of when an op is in-flight. I think this approach here has a couple of issues that more granular control solves better.

First, deferring work to time of poll produces inconsistent cancellation behavior. Say that a user creates a future to send data over a socket. If you submit the op when constructing the future, you have predictable drop/cancellation behavior. You know that the write will either occur or error if you drop the future. If you defer that work, you have no predicable behavior here. This produces a footgun and source of confusion.

Second, being able to control when an Op hits the queue is super useful for optimization. I'd really like to be able to submit an op and get it in the queue before I spawn off a task to handle the response and do further work, for instance. This allows you to save round trips through the scheduler and can actually be a pretty significant optimization under certain benchmarks.

If we give the user more granular control, we can solve these issues and open up a lot of interesting patterns without really sacrificing much:

/// Constructed but unsubmitted op
pub struct Unsubmitted<T> { ... }

/// Operation which has been pushed to the squeue
pub struct InFlightSingleShot<T> { ... }

/// Example read implementation.
///
/// This constructs an unsubmitted read op. The user explicitly controls when this gets submitted, which is useful for op linking and other pre-submission operations.
///
/// Note: for simplicity's sake I'm ignoring IoBuf
pub fn read(&self, buf: Vec<u8>) -> Unsubmitted<Vec<u8>> {
    // build op, providing custom vtable, either via a 'static trait object or a custom vtable like in Waker
    // this vtable basically just specifies what to do on completion as far as any post-processing is concerned on the returned data
    build_op(&READ_VTABLE, buf)
}

impl<T> Unsubmitted<T> {
    // there would be other fns that let you do thinks like build chains, unfortunately, I'm still working out the precise details here.

    // todo does this return io::result? this should handle any reasonable transient failure or flush the squeue and receive completions if needed. there are other, more permanent failure causes, but not sure if those should be panic or error.
    fn submit(self) -> InFlightSingleShot<T>;
}

impl Future for InFlightSingleShot<T> {
    Target = io::Result<T>;
    ...
}

/// still not sure if I like having intofuture here
impl IntoFuture for Unsubmitted<T> {
    Output = io::Result<T>;
    type IntoFuture = InFlightSingleShot<Self::Output>
    ...
}

I have yet to fully get this implemented in my PR unfortunately but I have largely solidified an API. I think that this is probably the middle ground that we want. It removes the footgun by making things much more explicit, and is kinda required to do SQE linking.

Nov 07 '22 16:11 Noah-Kennedy

@Noah-Kennedy wrt sqe linking, I think that's fundementally wrong. I'll finish a raft pr tonight with proves this. It won't be pretty, but will show my point

Nov 07 '22 16:11 oliverbunting

@Noah-Kennedy wet saw linking, I think that's fundementally wrong. I'll finish a raft pr tonight with proves this. It won't be pretty, but will show my point

Nov 07 '22 17:11 oliverbunting

@Noah-Kennedy you'll note I have provided an override for immediate submission, although that is explicit

Nov 07 '22 17:11 oliverbunting

@oliverbunting can you give me a short explanation of how this is wrong? I know that there are ways to do this entirely in-driver, but everything that I've messed with there led to a huge increase in complexity and a reduction in performance.

Nov 07 '22 17:11 Noah-Kennedy

@Noah-Kennedy Sure, lets keep things abstract to start with, although #165 is a concrete example. First lets consider differed work only, not linking SQE's

Implementing this is quite straightforward. At Op creating, we do work to create an Sqe. We do this regardless of if we defer work or not. To defer work until poll, we must simply store it as Pending until Polled, at which point it gets entered into the queue. As we were doing this work anyway, the cost is approximately identical. Indeed I've benchmarks in #165 to prove this. For usecases which are approximately

Op::create_an_op().await

I strongly suspect the compiler does basically the same thing in both cases.

On to #165 in particular, the above is what we do, with the addition of enqueue(), which is the manual override for I want to submit this right away

We can now front loaded computations like your benchmark in #151, giving us the option to move work out of critical section, giving a performance boost in the region of interest

  // pre compute work
  let mut js = JoinSet::new();

  for _ in 0..opts.iterations {
      js.spawn_local(tokio_uring::no_op());
  }
  
  // timing critical region
  while let Some(res) = js.join_next().await {
      res.unwrap().unwrap();
  }
  //

Linking SQE's is orthogonal to the problem, I think with a builder API you could do it with or without deferred work. I'm going to explain with, because that is what I have considered

My approach basically boils down to having a new Op type Op<Link<A,B>, LinkMarker>, which we can create from any Op builder Send Accept Connect etc. When this is polled, we enqueue both ops, and poll the first. When it completes we return (<A as Future>::Output, Op<B,_>).

Nov 07 '22 17:11 ollie-etl

Let's slow down a bit. What does this change make easier or less dangerous? It still isn't clear to me that this actually improves anything for the complexity that it adds to future work. I'd like us to try to model the semantics of a uring operation as closely as we can, and this really just seems to obfuscate the underlying semantics of uring.

I really can't see any cases that this helps, other than the very contrived situation with that benchmark. If you can show me situations where this improves things, I'd be a lot more onboard here. At present, this really just seems to add some indirection to things. What does this improve or allow us to improve, either for us as maintainers, or for users?

Nov 07 '22 18:11 Noah-Kennedy

@FrankReh I'm less concerned about matching io_uring, and more concerned that our futures violate the published behaviour of Futures. As I've at least partially demonstrated, it doesn't hurt performance

Nov 07 '22 19:11 ollie-etl

This is the behavior of futures, but it isn't a general invariant with async APIs, and in fact its actually pretty common to not do this. Futures themselves do no work unless polled, largely because of the readiness model they follow.

In fact, it's actually super common for reasons of efficiency or API requirements to do work up-front and return a future that tracks the progress of an in-flight operation. For example, IIRC reqwest does this by putting a request in flight and returning a future which tracks its progress: https://docs.rs/reqwest/latest/src/reqwest/async_impl/client.rs.html#1453

Completion-based stuff seems to be a common case where people don't defer work until polling.

Nov 07 '22 20:11 Noah-Kennedy

@Noah-Kennedy I think with a bit of reinterpretation, your and my approaches actually have a lot in common

Nov 07 '22 20:11 ollie-etl

My approach is rather 3 states, not 2 Builder -> Deferred -> In flight

The way I'm approaching Builder in the first instance is for use the Op struct as the builder. By that, I mean the Op<THIS_OP_TYPE,_>. Those structs already have all the info required to build an operation.

I'll give an example for fsync, although I'll have a PR ready shortly also. I thought I'd do this first, as links are a short step from this

pub trait Buildable 
where 
    Self: 'static + Sized + Completable
{
    /// The CqeType type which results from Submission
    type CqeType; //  = SingleCQE; // If default types were stable

    /// The operation to consume the builder to create the sqe
    fn create_sqe(&mut self) -> squeue::Entry;
   
   /// Was submit_with in `Op` 
   fn submit(mut self) -> io::Result<Op<Self, <Self as Buildable>::CqeType>> {
   ...
   }
   
   pub(crate) struct Fsync {
    fd: SharedFd,
    flags: Option<types::FsyncFlags>,
}

impl Op<Fsync> {
    pub(crate) fn fsync(fd: &SharedFd) -> io::Result<Op<Fsync>> {
        Fsync { fd: fd.clone(), flags: None }.submit()
    }

    pub(crate) fn datasync(fd: &SharedFd) -> io::Result<Op<Fsync>> {
        Fsync { fd: fd.clone(), flags: Some(types::FsyncFlags::DATASYNC)}.submit()            
        
    }
}

impl Buildable for Fsync
where 
    Self: 'static + Sized
{
    type CqeType = op::SingleCQE;

    fn create_sqe(&mut self) -> io_uring::squeue::Entry {
        let mut opcode = opcode::Fsync::new(types::Fd(self.fd.raw_fd()));
        
        if let Some(flags) = self.flags{
            opcode.flags(flags);
        }

        opcode.build()
    }
}
```

Nov 07 '22 20:11 ollie-etl

@ollie-etl I'd agree, it seems like we were talking past each other in a lot of ways. I don't think we necessarily disagree on the core issues with the current API and on what largely needs done. I'll see if I can get up my draft PR with my proposed API changes tonight (my time, not yours), so that you can take a look. If I trim out anything related to SQE linking or async cancellation, I should be able to land it pretty quickly.

Nov 07 '22 20:11 Noah-Kennedy

likewise, I'll get my stuff I'm up. I'd be very suprised if eitehr approach is rigth first time out. Lets get our Pr's up, and take some time to review. I suspect we'll come to a hybrid, whcihh is usually the way of things

Nov 07 '22 20:11 ollie-etl

likewise, I'll get my stuff I'm up. I'd be very suprised if eitehr approach is rigth first time out. Lets get our Pr's up, and take some time to review. I suspect we'll come to a hybrid, whcihh is usually the way of things

Yeah, my thoughts as well. I think a builder is actually a logical second step once the "constructed and linkable" and "in flight" stages are in place. The "Unsubmitted/Prepared" could be obtained via the builder or via a call to a function like read, and I think it's easier to omit the builder initially and add it in a follow-up.

Nov 07 '22 20:11 Noah-Kennedy

It will be interesting to compare the two approaches. I suspect we'll pull my phases in with your builder.

Nov 07 '22 20:11 Noah-Kennedy

I, as usual, have gone the other way. I was thinking that if I had a builder, linking to the next op is a logical way of consuming a builder

Nov 07 '22 20:11 ollie-etl

tokio-uring tokio-uring copied to clipboard

possibly deferring work in the creation of an operation

tokio-uring
tokio-uring copied to clipboard