Allow pipeline offloading to disk
The problem
If a client dispatches a lot of pipeline commands before it attempts to read any of the responses back it might deadlock itself. Why? The AsyncFb will eventually block on a command response because the socket's send buffers are saturated (because the client is not reading). This will halt command execution on the AsyncFb because the fiber is blocked on io. The ConnFb will keep reading from the socket and dispatch the requests asynchronously to the AsyncFb until it gets saturated eventually as well. At this point, both ConnFb snd AsyncFb are not any making progress and client is stuck on sending more pipeline commands. Client will not start reading until it sends all of its pipeline requests and dragonfly will not read any of them as its dispatch_q_ is saturated and its ConnFb blocked.
The solution
Is to unblock the ConnFb by offloading socket backpressure to disk. Client requests won't freeze that way and ConnFb will keep serving until the client switches to reader mode (and when that happens, AsyncFb will eventually unblock).
First approach: offload parsed commands to disk directly
As seen in the original prototype in https://github.com/dragonflydb/dragonfly/pull/6011. To summarize the flow here:
a. ConnFb -> b. ReadFromSocket -> c. ParseRequest -> d. Offload to disk or SendAsync
The problem with this approach is now the utility class that writes and reads to disk in step (d) needs to serialize it again when offloaded to disk. The reason for that is that our in memory structure for a parsed command can't be deserialized as is (and if written as is) because binary strings lack separation characters. We would like to avoid this ping-pong behavior. Note that resp3 is already a serialization protocol and is at least redundant and inefficient to parse and deserialize twice a given command.
Second approach: offload socket data as is to disk. This way we don't need to serialize etc.
This solves any of the redundancies discussed in the previous section but it does come with its own set of side effects.
Notice, that the flow within the ConnFb is synchronous. The fiber first reads from the socket and then dispatches. What happens if the ConnFb is stuck on client read and the AsyncFb must make progress and the only data available exists on disk (because it drained the in memory backpressure) ? Naturally the answer should be, AsyncFb reads from disk. Great, but who does the parsing if ConnFb is blocked waiting for client input ?
The problem here is that parsing is the responsibility of the ConnFb -- the AsyncFb is just there for action. Without separation of concerns the producer/consumer behaviour of the two falls apart. But it's not only the logical semantics of the two that are important -- had we delegated some of the parsing responsibility to AsyncFb we would have had a much more complicated state machine that is shared between the two. This is not a good direction.
https://github.com/dragonflydb/dragonfly/issues/6006 discusses a redesign of pipelining.
Specifically, I quote:
- Stop using FiberRead and switch to OnRecv interface (https://github.com/romange/helio/pull/480). As OnRecv separates read notification from reading data, we can reduce reliance on per connection io_buf. OnRecv can be called at any time when a fiber is preempted. Meaning that it will be able to read data during its other preemption points like cmd.dispatch.
By switching the ConnFb socket read path to use OnRecv we can unblock it from waiting for data to be available on the socket. That way, ConnFb can now parse data solving "the parsing in separate fibers" issue.
Bonus points is that now we have the first step in check for #6006
- [ ] Integrate and test OnRecv in ConnFb #6028
- [ ] Implement a utility class that writes or reads a stream of data to disk [#6029]
- [ ] TlsSocket and OnRecv in helio https://github.com/romange/helio/pull/480/files#diff-e23cdb5eee2e5d12ad3b9853e48dcfdc99ddbb3670ee960206c491dbcbe79c5aR472
- [ ] Offload connection backpressure to disk + test #6030
We can also use https://github.com/dragonflydb/dragonfly/pull/6011 to experiment and prototype.
The problem is that it's not possible to recognize when using synchronous interface because before you call Send( )
you can not know whether it will get stuck or not. With SendAsync it is possible but this is not how we designed our ReplyBuilder.
@romange I do understand the general direction and the root cause but I am still not sure about a specific implementation detail.
Regardless of how we dispatch on the connection level, eventually a write will get stuck in-flight. What that means is, is that the submitted sq entry will not complete and reach the completion queue. In other words, the Write() call will get fiber blocked in FiberCall::Get() method.
One way to solve this is as you wrote, replace blocking writes with asynchronous one. I am not confident about this as it implies a core logic change in how we write to the socket within the ReplyBuilder.
So what I was really looking here is:
a) For a write W of B bytes, is there a smart way to know before I submit/prepare my write to the sq if it would block and stay in flight ? In other words, is there a way to monitor the send buffers in such a way that we know beforehand that we are in this situation ? Based on what I read online this can't really be done. We have no control of that. What there is, IORING_RECVSEND_POLL_FIRST which is a pesimist heuristic -- it assumes the socket is not readable/writable but again this does not solve this because if a socket is writtable it does not mean it would overflow/fill the send buffers.
b) Maybe we could have a similar approach as we do with breaking the stalled connections ? We can cancel the in flight write. The only problem here is that iouring does not guarantee atomicity of that write (I could be wrong correct me if that's not the case) in case of cancel (it could be that we are in some partial state)
In short the idea is to avoid an async interface on the reply builder to keep the scope of changes small and contained. If we figure out early (before we fiber block) or cancel the in-flight, then the rest is handling backpressure in some disk backed up fd :)
I think the misconception is that I asked to accurately recognise a blocked socket and do the offloading after the fact.
This is not what I am suggesting at all. I suggest having a static condition, say - when the queue grows beyond K items, or its size above N bytes, do offloading of the subsequent items.
The deadlock you described happens due to backpressure we have - we stop reading from the socket and the client is stuck at pipeline.execute() and can not proceed with reading.
if we continue reading data (and offloading it) instead of stopping the reads - the deadlock will disappear.
Gotcha -- you want to offload pipeline backpressure directly on the queue level
Makes sense and I will follow up