libvfio-user vfu_dma_read breaks if client sends message in between

There are case where a libvfio-user public function (e.g. vfu_dma_read) sends a message to the client and then synchronously waits for the response. The problem is that right after having sent the message to the client, the client might send its own message, which will be received by the blocking call of the server. However, the server is expecting a completely different message. vfu_dma_read is one of the function that has this problem, effectively every public function that uses vfu_msg has the same problem, except the version negotiation ones.

vfu_irq_trigger has the same problem, since the server might send a trigger IRQ message and immediatelly receive a new command from the client, while it's expecting to receive the response for the IRQ message it just sent.

A similar situation can occur when the client is waiting for a response from the server and the server happens to send it an VFIO_USER_VM_INTERRUPT. This is a valid use case though, the client should be expecting such messages.

Jan 28 '21 12:01 tmakatos

There are currently only two paths that have this issue, vfu_dma_read() and vfu_dma_write() via tran->send_msg().

vfu_irq_message() (what Thanos was referring to above) has been removed for now.

Mar 10 '21 11:03 jlevon

is this issue still not resolved?

Dec 28 '21 10:12 awarus

is this issue still not resolved?

Still not resolved.

Jan 04 '22 11:01 tmakatos

This issue has bitten me, here are some thoughts.

For context, I've been using libvfio-user to hook up PCI device models to hardware PCIe root port implementations, basically by bridging the PCIe transaction layer to VFIO-user. You can think of this as plugging a software-emulated PCIe card into a PCIe hardware design, where the hardware side is augmented with a VFIO-user client that talks to qemu as the vfio-user server.

In this scenario, I can't use memory-mapped DMA, because I need to inject DMA transactions back into PCIe hardware as memory read and write transaction layer packets. So I rely on VFIO-user's message-based DMA via VFIO_USER_DMA_{READ,WRITE} for my DMA transactions (side note: I have a qemu patch to add support for message-based DMA, happy to contribute that). However, with message-based DMA I immediately run into the problem described here: It's pretty common that MMIO traffic arrives while qemu is waiting for the reply to a DMA operation it has requested, and thus the synchronous wait-for-DMA-reply approach doesn't work. So, I'm interested in fixing this problem.

Here are a couple avenues I can think of:

While waiting for a DMA reply message, don't bail if a command arrives, but queue the command message up. Once the DMA reply comes in, process any queued commands. I have a very crude implementation of this to unblock my work, it certainly feels like a hack (see also below).
Rewrite libvfio-user to work fully asynchronously, allowing commands and replies to be received in arbitrary order. This would also require API changes - vfu_sgl_read and vfu_sgl_write would have to become asynchronous and report their result via a callback.
Tackle the root cause of the issue: Don't allow command messages in both directions over the socket. AFAICT, VFIO_USER_DMA_{READ,WRITE} are the only instances of server -> client commands (note that interrupts, which are the other server -> client interaction, already use event fds). So, it seems realistic to deprecate server -> client commands on the socket entirely (they've never worked properly in the libvfio-user implementation in the light of this bug anyways) and do message-based DMA in a different way: Introduce a 3rd flavor of VFIO_USER_DMA_MAP that passes a separate socket file descriptor to the server, which the server then uses to send VFIO_USER_DMA_{READ,WRITE} commands to the client. Note that this DMA socket would carry only carry commands in the server -> client direction. This would be a protocol-level change, but one that fits cleanly into the existing protocol architecture.

Assessing these options:

(1) is nice in that it can be done entirely inside libvfio-user, without any API changes. (For completeness, note that vfu_get_poll_fd can no longer be just the socket file descriptor, since it must indicate readiness also for queued messages, but there's a workaround using eventfd + epoll). (2) and (3) imply API changes, and (3) even requires a protocol-level change.
(2) is fundamentally the hardest to implement for clients and servers, since synchronous request-reply code can't easily be fixed. Even if we were to make libvfio-user's message handling completely asynchronous, switching DMA API functions to become asynchronous will have ripple effects. For example, qemu's DMA handling is currently synchronous and probably isn't practical to change to asynchronous, thus forcing qemu to implement a variant of (1) to hide the asynchronicity. The implementation of (1) is a bit messy since we need to buffer potentially unlimited messages in vfu_ctx, but perhaps that can be limited to a reasonable value in practice? (3) is pretty simple to implement since it is independent of the main socket.
Whatever solution we pick, there's a potential for deadlocks if both sides choose to use synchronous I/O. For example, the server might wait for a DMA operation while the client waits for an MMIO access to complete. At least one side will have to process messages "out of band", and thus implement some form of asynchronicity. IMO, (3) provides the most flexibility in that the client can choose to either be fully asynchronous and listen for the main socket as well as any DMA sockets, or it can choose to service the DMA socket(s) on a different thread. Implementations that don't require message-based DMA can choose simpler synchronous implementations. qemu can keep its synchronous DMA implementation (at the cost of the client needing to be able to handle asynchronous DMA requests, but it seems fair to pay that cost in the client if the client wants message-based DMA).

Overall, I like (3) best. I'm interested to hear other thoughts / opinions. Assuming we can reach consensus on how to move forward, I'll be willing to implement a solution.

Jun 19 '23 23:06 mnissler-rivos

I do like idea (3) the most. I even think we've entairtained this idea of having a separate socket for server -> client messages in the very early days, when we were implementing the protocol specification, but didn't have a solid use case at that point and decided to not bake it in the protocol. I don't think modifying the protocol is an issue, it's still experimental in QEMU (and we're still going to modify the protocol when we adopt migration v2). Let's hear from @jlevon and @jraman567 .

Jun 20 '23 16:06 tmakatos

Thanks @tmakatos. If anyone sees issues with approach (3) please let me know - I hope to have some cycles next week to start working on this.

Jun 22 '23 16:06 mnissler-rivos

plugging a software-emulated PCIe card into a PCIe hardware design

Cool!

note that interrupts, which are the other server -> client interaction, already use event fds

While that's true today, that wouldn't be true if we ever wanted to do other implementations, over TCP/IP to a remote host, for example.

So I don't think it would make sense to build something specific to the DMA path.

This problem is the one of the reasons we didn't want to stabilize the library yet. I agree with the enumeration of the three possible approaches. I don't really like 1).

I'd sort of be OK with 3), though I can't speak for the qemu or vfio-user client maintainers. It feels like a bit of a workaround of inadequate libvfio-user design though, to be honest. It would also mean that async consumers of the API, still need to block during device->guest DMA transactions, which is an issue in itself. The separate socket of 3) doesn't help at all there. That is, the dma read/write APIs are fundamentally broken anyway.

is admittedly a lot more work - but it would then mirror what the client side does, which can handle this problem. So ideally we'd do 2) - Any thoughts on implementing that Mattias? I haven't really thought through the difficulties of implementation in terms of libvfio-user itself, but I think it shouldn't be too difficult:

->send_msg() no longer takes or waits for a reply. In fact we somehow make the support for synchronous request then reply, be only usable by samples/client.c
it takes a callback and private data pointer; we store this somewhere: probably need a list/hash for multiple outstanding requests.
is_valid_header() changes to allow replies just for DMA
handle_request() takes the reply and uses the given callback
change quiesce to ensure no DMA op is ongoing somehow
same thing for migration state
TBD if we need to allow for cancellation: I'd guess not, since we can't control if the client side is going to write into the provided buffer.

Jun 23 '23 11:06 jlevon

It feels like a bit of a workaround of inadequate libvfio-user design though, to be honest. It would also mean that async consumers of the API, still need to block during device->guest DMA transactions, which is an issue in itself. The separate socket of 3) doesn't help at all there. That is, the dma read/write APIs are fundamentally broken anyway.

I see what you're saying, but for clarity we should separate the discussion into two concerns:

What kind of DMA APIs we provide: Synchronous, asynchronous or both. The decision should be informed by the needs of the consumers of the API. See below.
What the right protocol-level implementation is. Decision here is mostly determined by whether the protocol can support the APIs we need to support in a reasonable way.

So ideally we'd do 2) - Any thoughts on implementing that Mattias?

The libvfio-user changes to go fully async are relatively straightforward, I agree. The challenge is for code that consumes the API, because consuming async APIs causes ripple effects. If you look at qemu for example, it relies on synchronous pci_dma_read and pci_dma_write APIs. Switching these APIs to async would be a major effort in qemu, and it would mean touching all the hardware model implementations to make their DMA request code paths async. IMO, that's in the ain't-gonna-happen bucket, at least unless qemu has appetite for changing their DMA APIs for additional reasons.

If we assume that qemu hardware models will stick with a synchronous DMA API, then there are two paths:

libvfio-user offers both sync and async DMA APIs and lets the consumer choose what works for them. That's pretty simple to support if we have separate sockets.
libvfio-user's DMA API is async only, and qemu needs to adapt that to a synchronous API somehow. It's questionable whether this is at all possible. The only thing I can think of would probably look similar to the queuing approach I described as (1) above, but at the libvfio-user API surface level - ugh.

Bottom line: I don't really see how a libvfio-user that doesn't provide a synchronous DMA API can be consumed by qemu. This arguably causes some design warts, but a middle ground design that allows both sync/async DMA doesn't seem such a bad way forward, and it also gives API consumers an upgrade path to async.

Jun 23 '23 15:06 mnissler-rivos

into two concerns

Sure, but one informs the other: right now, we can't provide an async API, because the underlying API is sync. While it's easier to build a sync API on top of an async implementation.

Switching these APIs to async

Yes, I would expect there to be both sync and async APIs, precisely for the sort of reasons you are mentioning. Of course, the former would basically need to insist on there being no interleaving still, so would still have the issue. That's true of form 3) as well, since we'd have to return back to run vfu_run_ctx() for the reply anyway.

It's possible the sync version could read then re-queue any intermediate messages, perhaps, but there's risks there in the cases of quiesce, migration, unmap, and so on.

Jun 23 '23 20:06 jlevon

It's possible the sync version could read then re-queue any intermediate messages, perhaps, but there's risks there in the cases of quiesce, migration, unmap, and so on.

Plus, it's pretty heavy-weight since at least in theory it needs to queue up an unbounded number of messages of arbitrary sizes. It works though, it's approach (1) in my post above, and I have a hack that I'm currently running with.

It seems like we're all in agreement that we should have async DMA (in addition to sync) at the libvfio-user API level. Then the open question is whether we do the queuing approach, or we introduce separate socket(s) for the server->client direction.

Trying to summarize pros/cons (please call out anything I have missed):

queuing:

(+) no protocol changes required
(+) conceptually cleaner: single communication channel
(-) need to allocate unpredictable amounts of memory to hold queued messages
(-) vfio-user implementations are forcibly async with the need to match replies to commands

separate socket(s):

(+) can process any message as it's read from the socket, when handling sync DMA other commands stay on the main socket, natural back pressure to client.
(+) simplified implementations are possible that handle the sockets using send-request-read-reply loops (and can achieve async by using threads). judging from what's out there, this style is popular, see for example https://github.com/rust-vmm/vfio-user/blob/main/src/lib.rs
(-) protocol change

As is probably obvious by know, I'm leaning towards separate socket(s), mostly because I think it's simpler and makes vfio-user easier to implement, but I can totally relate to the single-socket solution feeling conceptually cleaner. In the end, I don't really feel strongly though, just hoping to reach closure so we can move forward.

Jun 24 '23 05:06 mnissler-rivos

I'm not clear on how a separate socket helps at all with providing a sync API; since the reply arrives on the original socket still, you still need to somehow process other incoming messages.

Jun 24 '23 14:06 jlevon

Ah, so we've been talking past each other, I'm not intending to send the DMA reply on the main socket.

Let me recap the separate socket idea again to clarify:

Client obtains a pair of sockets by calling socketpair(), let's call them dma_client_fd, and dma_server_fd, respectively.
When calling VFIO_USER_DMA_MAP command, the client passes a flag to indicate a new mode of performing DMA operations (let's call it VFIO_USER_F_DMA_REGION_SOCKET_IO), as well as dma_server_fd.
When the server wants to perform DMA for the mapping in question, it puts an VFIO_USER_DMA_READ command on dma_server_fd.
The client monitors and reads dma_client_fd. When it receives the VFIO_USER_DMA_READ command, it performs the DMA operation and sends back the reply on dma_client_fd.

This enables the DMA operation to be handled "out of band" of the main socket. Thus, a synchronous write-DMA-command-then-read-back-reply on the sockets are unproblematic (since commands on the main socket are only ever sent by the client, and commands on the DMA socket are only ever sent by the server, the condition that is the root cause of the bug never occurs). Furthermore, the server can choose to synchronously wait for the DMA operation to complete and ignore all other I/O on the main socket during that time (this is what I was referring to when I said this is simpler to implement).

Jun 26 '23 13:06 mnissler-rivos

Note that this DMA socket would carry only carry commands in the server -> client direction.

I think I read "messages" here instead of "commands", hence the confusion. It's bi-directional, in fact.

So, yeah, that sounds reasonable.

Jun 26 '23 14:06 jlevon

oops

Jun 26 '23 14:06 jlevon

@mnissler-rivos I asked Jag about this on slack if you'd like to join

Jun 26 '23 20:06 jlevon

I've just started reading the rest of the discussion, but the following just occured to me:

There are cases where a libvfio-user public function (e.g. vfu_dma_read) sends a message to the client and then synchronously waits for the response.

What's the point of the vfio-user server requiring a response from the client for a DMA command? If the client doesn't like the command it received, the server doesn't need to know about that specific fact, the client might deal with it any way it sees fit (e.g. reset the device). So if we remove this behavior there's no problem in the first place, right?

Jul 06 '23 12:07 tmakatos

DMA read requires a response with the data

Jul 06 '23 12:07 jlevon

What's the point of the vfio-user server requiring a response from the client for a DMA command?

For a DMA read, it wants the data from the client. There is no conceptual reason why that couldn't be delivered asynchronously though. In case of qemu, the device models often have code that perform DMA reads to get the next network packet and then process it, for example, and that code is mostly synchronous right now. So, IMHO we should definitely provide an async API, but keep a synchronous version to avoid breaking more simplistic server designs.

Jul 06 '23 12:07 mnissler-rivos

DMA read requires a response with the data

D'oh of course.

Furthermore, the server can choose to synchronously wait for the DMA operation to complete and ignore all other I/O on the main socket during that time (this is what I was referring to when I said this is simpler to implement).

What if the client never responds to the DMA read? What if it decides to reset the device, so the server should actually be paying attention to the main socket? I guess the client will have to start making assumptions that the server won't be paying attention to the main socket so it first has to fail the DMA read?

Jul 06 '23 13:07 tmakatos

Yep - we'll need to clearly define exact behaviour in these sorts of cases

Jul 06 '23 13:07 jlevon

I guess it boils down to what real HW would do: if it was in the process of doing DMA would it be handling MMIO in the first place?

Jul 06 '23 13:07 tmakatos

I don't think there's a universal answer on how hardware behaves. At the PCIe TLP level, all packets are tagged in order to match up completions, so there aren't any inherent restrictions. I'm also not aware of any requirements to that effect in the PCIe spec, but I'll ask a coworker who knows it much better than me to make sure. That said, the spec does include a completion timeout mechanism (section 2.8) to deal with situations where an expected reply never arrives.

From a VFIO software perspective, robust clients and servers should ideally not make any assumptions on communication patterns in my opinion. Having implemented a fully asynchronous client now, I am painfully aware though that this can be challenging in practice. Plus, you can't just mandate full async and ignore the practical constraints in existing software such as qemu.

Note that separating the sockets offer increased flexibility in implementing a correct client: It can spawn a thread to handle DMA separately from the main socket, and unless there is problematic cross-thread synchronization, everything will be fully async safe even without switching to an event-based design.

Jul 06 '23 15:07 mnissler-rivos

FWIW, I have implemented a first draft of the separate socket approach here: https://github.com/mnissler-rivos/libvfio-user/compare/master...mnissler-rivos:libvfio-user:socket_dma

I've tested it with qemu's vfio-server implementation, and it seems to work fine. It's still lacking tests and the async API though. Wanted to share what I have though since I'll be out until Monday.

Jul 06 '23 20:07 mnissler-rivos

I don't think there's a universal answer on how hardware behaves.

Fair enough. I think the dual-socket approach makes sense.

FWIW, I have implemented a first draft of the separate socket approach

Looks sensible to me, though not sure I follow the distinction between DMA_ACCESS_MODE_MESSAGE and DMA_ACCESS_MODE_SOCKET, since they both use sockets. Are you treating DMA_ACCESS_MODE_MESSAGE as potentially something different, not a socket? If we didn't make this distinction then the patch would be simpler. Regardless, we need to update the spec in qemu-devel and document corner cases, e.g. what happens if the server sends a DMA request on the main socket while there's a specific DMA socket for that region?

Also, it would be nice to have some kind of Python-based unit test, which could also act as a sample (e.g. a server with two DMA regions, one accessible via its own socket, the other not, and the client doing DMA transfers on both regions).

Jul 10 '23 13:07 tmakatos

Looks sensible to me, though not sure I follow the distinction between DMA_ACCESS_MODE_MESSAGE and DMA_ACCESS_MODE_SOCKET, since they both use sockets. Are you treating DMA_ACCESS_MODE_MESSAGE as potentially something different, not a socket? If we didn't make this distinction then the patch would be simpler.

DMA_ACCESS_MODE_MESSAGE is the old behavior with the DMA command passed over the main socket. I was (perhaps incorrectly?) assuming that we'll want to keep existing behavior for backwards compatibility and enable the new way on request by the client.

Regardless, we need to update the spec in qemu-devel and document corner cases, e.g. what happens if the server sends a DMA request on the main socket while there's a specific DMA socket for that region?

Interesting point. I guess it would make sense to reject, but given that you can technically have multiple DMA mappings, each with its own mode, I'm not so sure. If we conclude that we won't support VFIO_USER_DMA_{READ,WRITE} on the main socket, then it makes sense to reject them obviously.

Also, it would be nice to have some kind of Python-based unit test, which could also act as a sample (e.g. a server with two DMA regions, one accessible via its own socket, the other not, and the client doing DMA transfers on both regions).

Yes, I will add that before sending an actual pull request.

Jul 10 '23 14:07 mnissler-rivos

Looks sensible to me, though not sure I follow the distinction between DMA_ACCESS_MODE_MESSAGE and DMA_ACCESS_MODE_SOCKET, since they both use sockets. Are you treating DMA_ACCESS_MODE_MESSAGE as potentially something different, not a socket? If we didn't make this distinction then the patch would be simpler.

DMA_ACCESS_MODE_MESSAGE is the old behavior with the DMA command passed over the main socket. I was (perhaps incorrectly?) assuming that we'll want to keep existing behavior for backwards compatibility and enable the new way on request by the client.

I misremembered the semantics of VFIO_USER_DMA_MAP: previously we could get an FD which meant we could mmap(2) it, now we might get an FD which is not a file but a socket, so we'd have to send(2) to it, therefore the client needs to tell the server what kind of FD it is. (I think this could be a useful explanation in the commit message?) So the distiction you make is necessary.

What I wanted to avoid is the special casing in vfu_dma_transfer(), but I see it's necessary since vfu_ctx->tran->send_msg() uses the default transport FD. I can't come up with a nicer way, perhaps add an extra fd argument to vfu_ctx->tran->send_msg() that if it's not -1 then use that instead? Not sure it's worth it TBH.

Regardless, we need to update the spec in qemu-devel and document corner cases, e.g. what happens if the server sends a DMA request on the main socket while there's a specific DMA socket for that region?

Interesting point. I guess it would make sense to reject, but given that you can technically have multiple DMA mappings, each with its own mode, I'm not so sure. If we conclude that we won't support VFIO_USER_DMA_{READ,WRITE} on the main socket, then it makes sense to reject them obviously.

makes sense

Jul 11 '23 11:07 tmakatos

Hi, I got to think about this problem for a bit.

To step back, we have assumed that the QEMU end would be the client and the remote end would be the server. This assumption is not accurate. When the remote end initiates DMA reads, the remote takes the role of the client. As such, there is a possibility of a deadlock when both ends initiate requests simultaneously.

We should have a lock to manage concurrent access to the communication channel. Given QEMU and the remote are in different processes, it may take a lot of work to implement a locking model correctly and fairly. Please advise if any C library implements locking between different processes.

Given the above, I suggest going with two socket pairs. QEMU should exclusively initiate requests on one channel, and the remote end should do the same with the other one. Please share your thoughts.

Jul 17 '23 14:07 jraman567

vfio-user, as a protocol, shouldn't be encoding stuff like processes, so any kind of locking is a non-starter. But, anyway, I don't accept that there is a deadlock here, that depends entirely on the implementation of both sides. You'd only have a deadlock in an ABBA sort of situation which is not what we have here.

Practically speaking, we have to think through the reset etc. scenarios for sure, but sounds like consensus is the separate socket thing.

Jul 17 '23 15:07 jlevon

Thanks @jraman567 for sharing your thoughts, and I'm happy to hear we are aligned on using a separate socket for server->client commands and their replies.

@jraman567 what do you think about my specific proposal to optionally pass another socket in VFIO_USER_DMA_MAP request? This essentially establishes a server->client communication channel per DMA mapping. As an alternative, we could set up a single socket for reverse communication as part of negotiate. The former fits in more naturally with the current structure of the protocol IMO, the latter is probably preferable if we anticipate to add more server-initiated commands in the future.

Jul 18 '23 11:07 mnissler-rivos

I'm not really a big fan of tying it to DMA_MAP. I'd rather have some negotiate step provide an fd.

Jul 18 '23 12:07 jlevon

libvfio-user libvfio-user copied to clipboard

vfu_dma_read breaks if client sends message in between

libvfio-user
libvfio-user copied to clipboard