dma_ip_drivers XDMA data corruption issue (0xFFFFFFFF every other read) was fixed yet was not mentioned anywhere

TL;DR: if you have the same issue, you should upgrade to Vivado 2019.1 or newer. Or if you are a Xilinx/AMD employee, then you should write your documentation better I was using Vivado 2018.2 and corresponding XDMA IP connected to iMX8. I had to use @alonbl's patch set yet I still faced issues #311 and #314. But there was one more issue that I couldn't explain and couldn't find anything related to it. I have AXI peripherals connected to AXI DMA port and also some peripherals connected to AXI Lite port for register access. After some time of my app working just fine, I start getting 0xFFFFFFFF instead of data every other read. The kernel module gets registers data corrupted the same way, which leads to #314 and some other issues, slowing everything down and eventually crashing the kernel. Unloading kernel module does not help. The problem persists until system restart. Debugging kernel module lead me to ioread32 function that already gets corrupted data, so the problem goes further into XDMA IP itself. Looking on Xilinx/AMD web site revealed that there is no support tickets, no design advisories, and my IP core version (4.1) is the latest one. IDK how much time I'd spent on this bug if I accidentally did not look into XDMA changelog from Vivado 2022.1. And there it is:

2019.1:
 * Version 4.1 (Rev. 3)
 ...
 * Bug Fix: Fixed back to back reads failure for 7series Gen2 DMA.
  ...

So first of all, the version is not 4.1, it is 4.1.X, which is not what official publicly available documentation says. Second, I don't know if this bug fix is for the issue that I described above. Because Xilinx did not share anything about this issue. How a developer supposed to know that there was an issue and it was fixed? So I'm doing work instead of Xilinx, sharing as much info as I can for those who face the same issue and try to google for a solution.

Example log, take a look at fields that have 0xffffffff in them.

[  139.737742] xdma:engine_service: Engine was not running!!! Clearing status
[  150.049815] xdma:xdma_xfer_submit: xfer 0x00000000ed2983a6,576, s 0x1 timed out, ep 0xe240.
[  150.049829] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000ae7e6564) = 0x1fc10006 (id).
[  150.093194] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000003af4d3f7) = 0xffffffff (status).                       <= BUG
[  150.093204] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069a83cc0) = 0x00000000 (control)
[  150.136564] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000ef7bce09) = 0x00f83e1f (first_desc_lo)
[  150.179928] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f04f09bc) = 0xffffffff (first_desc_hi).                       <= BUG
[  150.179939] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000003a241afc) = 0x00000000 (first_desc_adjacent).
[  150.223303] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000068406d0f) = 0xffffffff (completed_desc_count)..                       <= BUG
[  150.223315] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000002c7973e8) = 0x00f83e1e (interrupt_enable_mask)
[  150.266686] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0xffffffff: BUSY,DESC_STOPPED,DESC_COMPL,ALIGN_MISMATCH MAGIC_STOPPED INVALID_LEN IDLE_STOPPED,R:DECODE_ERR SLAVE_ERR,DESC_ERR:UNSUPP_REQ COMPL_ABORT PARITY HEADER_EP UNEXP_COMPL
[  150.266698] xdma:transfer_abort: abort transfer 0x00000000ed2983a6, desc 1, engine desc queued 0.

Jan 17 '25 14:01 dmitrym1

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

And in my case the driver print this error logs periodically, do you know that does it mean?

Nov 30 03:00:26 ubuntu kernel: xdma:xdma_xfer_submit: xfer 0x0000000090ad39ee,4, s 0x1 timed out, ep 0xa8008080. Nov 30 03:00:26 ubuntu kernel: xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000009f88c41) = 0xffffffff (id). Nov 30 03:00:26 ubuntu kernel: xdma:engine_reg_dump: 0-C2H0-MM: engine id missing, 0xfff00000 exp. & 0xfff00000 = 0x1fc00000 Nov 30 03:00:26 ubuntu kernel: xdma:engine_status_read: Failed to dump register Nov 30 03:00:26 ubuntu kernel: xdma:xdma_xfer_submit: Failed to read engine status

Jan 19 '25 07:01 jason77-wang

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

https://github.com/Xilinx/dma_ip_drivers/pull/240

Jan 19 '25 10:01 alonbl

Hi @jason77-wang. In your log there is a failed transaction. The driver says it's because of a timeout but as you've got 0xffffffff from ioread32 I can say it's a communication problem. There could be various reasons why you can get this result. I've seen the same log and had the same issue in my application, and it turned out to be an XDMA IP bug. In my case I had to do a few dozens restarts of my software and this quickly and reliably triggered the issue. Otherwise it could reproduce by itself after a few days of continuous operation. Once it goes to that state, it stays there until I restart the whole system. I've updated Vivado to 2020.1 and upgraded IP cores, and this fixed the problem. The changelog for XDMA says the problem should be fixed since 2019.1. So you could try to update too and see if this fixes the issue. If it does not, then unfortunately I won't be able to help you any further.

Jan 20 '25 13:01 dmitrym1

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

#240

Thanks.

Jan 23 '25 01:01 jason77-wang

Hi @jason77-wang. In your log there is a failed transaction. The driver says it's because of a timeout but as you've got 0xffffffff from ioread32 I can say it's a communication problem. There could be various reasons why you can get this result. I've seen the same log and had the same issue in my application, and it turned out to be an XDMA IP bug. In my case I had to do a few dozens restarts of my software and this quickly and reliably triggered the issue. Otherwise it could reproduce by itself after a few days of continuous operation. Once it goes to that state, it stays there until I restart the whole system. I've updated Vivado to 2020.1 and upgraded IP cores, and this fixed the problem. The changelog for XDMA says the problem should be fixed since 2019.1. So you could try to update too and see if this fixes the issue. If it does not, then unfortunately I won't be able to help you any further.

Okay, got it. thanks.

Jan 23 '25 01:01 jason77-wang