dma_ip_drivers icon indicating copy to clipboard operation
dma_ip_drivers copied to clipboard

XDMA: timeouts/signal delivery during xdma_xfer_submit breaks subsequent dma.

Open mindbeast opened this issue 5 years ago • 4 comments

I have a custom driver which pulls in libxdma. If I happen to interrupt a xdma_xfer_submit while it's blocked blocked on the transfer wait queue / swait queue (xfer->wq), I'm not able to submit any more DMAs until the driver is reloaded.

I'm fairly confident what the issue is, less so in the fix. After a xdma_xfer_submit times out/gets a signal, the driver immediately:

(1) engine_status_read(engine, 0, 1); (2) transfer_abort(engine, xfer); (3) xdma_engine_stop(engine);

Part (3) above seems to asynchronously stop the engine hw after it finishes the current descriptor it's working on. However, the software state representing whether or not the engine is running ( engine->running), is not set to zero. This means that all future dmas assume the engine is running, and queue their transfers. The engine isn't running though, so they all time out until the driver is reloaded.

My quick but likely incorrect fix is to set engine->running = 0 after xdma_engine_stop(engine), but it's unclear to me if this is safe, given that the engine stop is asynchronous. Hoping that someone with more context can provide the correct resolution.

mindbeast avatar May 01 '20 22:05 mindbeast

I've the same issue and tried your workaround and it did not work for me. I think I've found two workaround.

Workaround 1

Make sure your buffer size is a multiple of a power of 2 (mine was 0x96000 and it failed, but once set to 0x100000 it worked flawlessly, with 0x100000 = 2¹⁰).

Workaround 2

This requires a bit more work on your code. Typically when it fails, the driver times out. You do get this information from user space since read returns a cryptic 512 error.

So first, I've changed the timeout to something very small (via ioctl IOCTL_XDMA_TIMEOUT_SET on /dev/xdma0_c2h_0) so it's not noticeable.

It must be a bit longer than the expected transfer time (you are in guesswhatwonderland here). Then when it fails, I'm issuing 2 ioctl on the /dev/xdma0_control, the former with XDMA_IOCOFFLINE and the latter with XDMA_IOCONLINE.

This has for effect of resetting the driver's state to some known state (without resetting the PCIe device) and it's very fast. Then you can resume the failing transfer and it'll complete correctly up to the next time.

Obviously, the first workaround is preferred but anyway...

X-Ryl669 avatar Jun 23 '20 16:06 X-Ryl669

Is fixed by #68

jberaud avatar Sep 17 '20 12:09 jberaud

But the message could be fixed because when interrupting a transfer, we could check that wait_event_interruptible has been interrupted by a signal before printing a trace that we have a timeout (which is not true).

jberaud avatar Sep 17 '20 12:09 jberaud

Hello,

My name is Mark Harfouche. I am not affiliated with Xilinx in any way. Over the years of using QDMA, I've been wanted better community organization.

I've created a fork of dma_ip_drivers which I intend to maintain and work with the community at large to improve.

The fork can be found https://github.com/hmaarrfk/dma_ip_drivers

For now, I am stating the main goals of the repository in https://github.com/hmaarrfk/dma_ip_drivers/issues/2

If you are interested in working together, feel free to open an issue or PR to my fork.

Best,

Mark

hmaarrfk avatar Aug 22 '22 04:08 hmaarrfk