XRT icon indicating copy to clipboard operation
XRT copied to clipboard

DMA vaildation fails

Open vkomenda opened this issue 3 years ago • 17 comments

I have DMA vaildation test failure as follows.

$ xbutil validate -d 0000:42:00.1 --verbose -r "DMA"
Verbose: Enabling Verbosity
Starting validation for 1 devices

Validate Device           : [0000:42:00.1]
    Platform              : xilinx_u55n_gen3x4_xdma_base_1
    SC Version            : 7.1.12
    Platform ID           : 9964C19C-DB53-EB6C-69B6-FD2072401583
-------------------------------------------------------------------------------
Test 1 [0000:42:00.1]     : DMA 
    Description           : Run dma test
    Error(s)              : DMA failed: Input/output error
    Test Status           : [FAILED]
-------------------------------------------------------------------------------
Validation failed

dmesg output

[11453.057972] [drm:xocl_userptr_bo_ioctl [xocl]] *ERROR* object creation failed user_flags 0, size 0x1000000
[11453.062161] xocl 0000:42:00.1: icap.u.22020096 ffff9282419fe810 icap_cached_ocl_frequency: no cached data for 3
[11463.302051] xocl:xdma_xfer_fastpath: Wait for request timed out
[11463.302053] xocl:xdma_xfer_fastpath: Wait for request timed out
[11463.302061] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000a3497aa0) = 0x1fc00006 (id).
[11463.302061] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x000000003a66a58c) = 0x1fc00106 (id).
[11463.302071] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000579efa37) = 0x00000001 (status).
[11463.302071] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000381a3ef9) = 0x00000001 (status).
[11463.302077] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000f15cda8f) = 0x00f83e1f (control)
[11463.302077] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000075ab0183) = 0x00f83e1f (control)
[11463.302080] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x0000000021389e02) = 0xfffc0000 (first_desc_lo)
[11463.302082] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000ff4018b4) = 0xfff80000 (first_desc_lo)
[11463.302085] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000001d75168c) = 0x00000000 (first_desc_hi)
[11463.302086] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x000000008ec5bede) = 0x00000000 (first_desc_hi)
[11463.302089] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000e9a4d6fb) = 0x0000001f (first_desc_adjacent).
[11463.302090] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000065410bfd) = 0x0000001f (first_desc_adjacent).
[11463.302093] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000eae91d1a) = 0x00000021 (completed_desc_count).
[11463.302095] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000922ca738) = 0x00000020 (completed_desc_count).
[11463.302097] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000e019732e) = 0x00f83e1e (interrupt_enable_mask)
[11463.302099] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x000000001d4e7180) = 0x00f83e1e (interrupt_enable_mask)
[11463.302102] xocl:check_nonzero_interrupt_status: 0000:42:00.1 xdma0 user_int_enable = 0x00001e00
[11463.302103] xocl:check_nonzero_interrupt_status: 0000:42:00.1 xdma0 user_int_enable = 0x00001e00
[11463.302106] xocl:check_nonzero_interrupt_status: 0000:42:00.1 xdma0 channel_int_enable = 0x0000000f
[11463.302109] xocl:check_nonzero_interrupt_status: 0000:42:00.1 xdma0 channel_int_enable = 0x0000000f
[11463.302118] xocl 0000:42:00.1: dma.xdma.u.5242880 ffff928280067010 xdma_migrate_bo: DMA failed, Dumping SG Page Table, ep addr 0
[11463.302121] xocl 0000:42:00.1: dma.xdma.u.5242880 ffff928280067010 xdma_migrate_bo: DMA failed, Dumping SG Page Table, ep addr 8000000
[11463.302125] xocl 0000:42:00.1: dma.xdma.u.5242880 ffff928280067010 xdma_migrate_bo: 0, 0x3112b9000
[11463.302129] xocl 0000:42:00.1: dma.xdma.u.5242880 ffff928280067010 xdma_migrate_bo: 0, 0x1c36a9000
[11463.302129] xocl 0000:42:00.1: dma.xdma.u.5242880 ffff928280067010 xdma_migrate_bo: 1, 0x279ac9000
[11463.302133] xocl 0000:42:00.1: dma.xdma.u.5242880 ffff928280067010 xdma_migrate_bo: 2, 0x3f1a7d000
[11463.302135] xocl 0000:42:00.1: dma.xdma.u.5242880 ffff928280067010 xdma_migrate_bo: 3, 0x21aaf4000
...

vkomenda avatar Jan 03 '22 17:01 vkomenda

The same DMA error blocks execution of Vitis tutorial examples. Which effectively makes XRT useless (on kernel 5.15). Can you maybe suggest how to debug this?

Edit Tested on kernel 5.10 - same error.

vkomenda avatar Jan 04 '22 14:01 vkomenda

This issue arises in the current master. Same output on 5.10, 5.11, 5.15. The latter requires a patch #6092 for kernels 5.14+. It's possibly a regression because 2.12.427 works fine.

vkomenda avatar Jan 06 '22 17:01 vkomenda

I have also experienced this on Alveo u50. Just recompiled master, xrt 2.13.0. Running kernel 5.13.0 on Ubuntu 20.04.03 LTS.

$ xbutil validate --device 0000:4b:00.1 --verbose -r "DMA"
Verbose: Enabling Verbosity
Starting validation for 1 devices

Validate Device           : [0000:4b:00.1]
    Platform              : xilinx_u50_gen3x16_xdma_201920_3
    SC Version            : 5.2.6
    Platform ID           : F465B0A3-AE8C-64F6-19BC-150384ACE69B
-------------------------------------------------------------------------------
Test 1 [0000:4b:00.1]     : dma
    Description           : Run dma test
    Error(s)              : DMA failed: Input/output error
    Test Status           : [FAILED]

dmesg shows

[  825.366780] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 _xocl_drvinst_open: OPEN 1
[  825.366834] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_create_client: created KDS client for pid(33515), ret: 0
[  825.368064] xocl 0000:4b:00.1: xmc.u.18874368 ffff9d8c61460c10 xmc_read_from_peer: reading from peer
[  825.368074] xocl 0000:4b:00.1: mailbox.u.9437184 ffff9d8c61467810 _mailbox_request: sending request: 10 via HW
[  825.443880] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 process_request: received request from peer: 10, passed on
[  825.443886] xclmgmt 0000:4b:00.0: xclmgmt_read_subdev_req: req kind 0
[  825.443957] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 mailbox_post_response: posting response for: 10 via HW
[  825.463336] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_read_from_peer: reading from peer
[  825.463344] xocl 0000:4b:00.1: mailbox.u.9437184 ffff9d8c61467810 _mailbox_request: sending request: 10 via HW
[  825.463630] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 process_request: received request from peer: 10, passed on
[  825.463634] xclmgmt 0000:4b:00.0: xclmgmt_read_subdev_req: req kind 1
[  825.463645] xclmgmt 0000:4b:00.0: clock_wizard.m.27262976 ffff9d8c63a25810 clock_wiz_get_freq_by_id: freq = 300
[  825.463652] xclmgmt 0000:4b:00.0: clock_wizard.m.27262976 ffff9d8c63a25810 clock_wiz_get_freq_by_id: freq = 500
[  825.463659] xclmgmt 0000:4b:00.0: clock_wizard.m.27262976 ffff9d8c63a25810 clock_wiz_get_freq_by_id: freq = 450
[  825.465655] xclmgmt 0000:4b:00.0: clock_freq_counter.m.28311552 ffff9d8c63a20410 clock_counter_get_freq: khz: 300000
[  825.467650] xclmgmt 0000:4b:00.0: clock_freq_counter.m.28311552 ffff9d8c63a20410 clock_counter_get_freq: khz: 500000
[  825.469645] xclmgmt 0000:4b:00.0: clock_freq_counter.m.28311552 ffff9d8c63a20410 clock_counter_get_freq: khz: 450000
[  825.469649] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 mailbox_post_response: posting response for: 10 via HW
[  825.470061] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3
[  825.523704] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_destroy_client: client exits pid(33515)
[  825.523708] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_drvinst_close: CLOSE 2
[  825.523710] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_drvinst_close: NOTIFY 0000000047877a5f
[  825.523729] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 _xocl_drvinst_open: OPEN 1
[  825.523793] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_create_client: created KDS client for pid(33515), ret: 0
[  825.528101] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3
[  825.528598] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 _xocl_drvinst_open: OPEN 2
[  825.528651] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_create_client: created KDS client for pid(33515), ret: 0
[  825.532775] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3
[  825.535549] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_destroy_client: client exits pid(33515)
[  825.535552] xocl 0000:4b:00.1:  ffff9d8c42c0f0c8 xocl_drvinst_close: CLOSE 3
[  825.578882] [drm:xocl_userptr_bo_ioctl [xocl]] *ERROR* object creation failed user_flags 0, size 0x1000000
[  825.583321] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3
[  835.675029] xocl:xdma_xfer_fastpath: Wait for request timed out
[  835.675034] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000069d4d79f) = 0x1fc00106 (id).
[  835.675042] xocl:xdma_xfer_fastpath: Wait for request timed out
[  835.675233] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000069e2858) = 0x00000001 (status).
[  835.675233] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000484f43bd) = 0x1fc00006 (id).
[  835.675422] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000050a09caa) = 0x00f83e1f (control)
[  835.675423] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x0000000053dbbb01) = 0x00000001 (status).
[  835.675425] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000d405ff68) = 0xfed80000 (first_desc_lo)
[  835.675425] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000001fa25302) = 0x00f83e1f (control)
[  835.675427] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x000000003fd66380) = 0x00000000 (first_desc_hi)
[  835.675428] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000f95b413a) = 0xfedc0000 (first_desc_lo)
[  835.675430] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000054666220) = 0x00000009 (first_desc_adjacent).
[  835.675431] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000007a2ee338) = 0x00000000 (first_desc_hi)
[  835.675433] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x000000007bef9f07) = 0x00000001 (completed_desc_count).
[  835.675434] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000b1237d10) = 0x0000001f (first_desc_adjacent).
[  835.675436] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000f46c0d2f) = 0x00f83e1e (interrupt_enable_mask)
[  835.675436] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000250db681) = 0x00000102 (completed_desc_count).
[  835.675439] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000144b176d) = 0x00f83e1e (interrupt_enable_mask)
[  835.675439] xocl:check_nonzero_interrupt_status: 0000:4b:00.1 xdma0 channel_int_enable = 0x0000000f
[  835.675443] xocl:check_nonzero_interrupt_status: 0000:4b:00.1 xdma0 channel_int_enable = 0x0000000f
[  835.675493] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: DMA failed, Dumping SG Page Table, ep addr 8000000
[  835.675497] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: DMA failed, Dumping SG Page Table, ep addr 0
[  835.675928] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 0, 0x2b8cb3000
[  835.676417] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 0, 0x22ff92000
[  835.676664] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 1, 0x2b7300000
[  835.676911] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 1, 0x19cf49000
[  835.677157] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 2, 0x2bbb00000

HFTrader avatar Mar 08 '22 17:03 HFTrader

Check the card is seated correctly in the PCIe slot. I had to file the screw hole in the bracket for the card to stop being lifted by the retaining screw.

On Mar 8 2022, at 5:36 pm, Henrique Bucher @.***> wrote:

I have also experienced this on Alveo u50. Just recompiled master, xrt 2.13.0. Running kernel 5.13.0 on Ubuntu 20.04.03 LTS.

$ xbutil validate --device 0000:4b:00.1 --verbose -r "DMA" Verbose: Enabling Verbosity Starting validation for 1 devices

Validate Device : [0000:4b:00.1] Platform : xilinx_u50_gen3x16_xdma_201920_3 SC Version : 5.2.6 Platform ID : F465B0A3-AE8C-64F6-19BC-150384ACE69B

Test 1 [0000:4b:00.1] : dma Description : Run dma test Error(s) : DMA failed: Input/output error Test Status : [FAILED] dmesg shows

[ 825.366780] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 _xocl_drvinst_open: OPEN 1 [ 825.366834] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_create_client: created KDS client for pid(33515), ret: 0 [ 825.368064] xocl 0000:4b:00.1: xmc.u.18874368 ffff9d8c61460c10 xmc_read_from_peer: reading from peer [ 825.368074] xocl 0000:4b:00.1: mailbox.u.9437184 ffff9d8c61467810 _mailbox_request: sending request: 10 via HW [ 825.443880] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 process_request: received request from peer: 10, passed on [ 825.443886] xclmgmt 0000:4b:00.0: xclmgmt_read_subdev_req: req kind 0 [ 825.443957] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 mailbox_post_response: posting response for: 10 via HW [ 825.463336] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_read_from_peer: reading from peer [ 825.463344] xocl 0000:4b:00.1: mailbox.u.9437184 ffff9d8c61467810 _mailbox_request: sending request: 10 via HW [ 825.463630] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 process_request: received request from peer: 10, passed on [ 825.463634] xclmgmt 0000:4b:00.0: xclmgmt_read_subdev_req: req kind 1 [ 825.463645] xclmgmt 0000:4b:00.0: clock_wizard.m.27262976 ffff9d8c63a25810 clock_wiz_get_freq_by_id: freq = 300 [ 825.463652] xclmgmt 0000:4b:00.0: clock_wizard.m.27262976 ffff9d8c63a25810 clock_wiz_get_freq_by_id: freq = 500 [ 825.463659] xclmgmt 0000:4b:00.0: clock_wizard.m.27262976 ffff9d8c63a25810 clock_wiz_get_freq_by_id: freq = 450 [ 825.465655] xclmgmt 0000:4b:00.0: clock_freq_counter.m.28311552 ffff9d8c63a20410 clock_counter_get_freq: khz: 300000 [ 825.467650] xclmgmt 0000:4b:00.0: clock_freq_counter.m.28311552 ffff9d8c63a20410 clock_counter_get_freq: khz: 500000 [ 825.469645] xclmgmt 0000:4b:00.0: clock_freq_counter.m.28311552 ffff9d8c63a20410 clock_counter_get_freq: khz: 450000 [ 825.469649] xclmgmt 0000:4b:00.0: mailbox.m.9437184 ffff9d8da556f410 mailbox_post_response: posting response for: 10 via HW [ 825.470061] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3 [ 825.523704] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_destroy_client: client exits pid(33515) [ 825.523708] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_drvinst_close: CLOSE 2 [ 825.523710] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_drvinst_close: NOTIFY 0000000047877a5f [ 825.523729] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 _xocl_drvinst_open: OPEN 1 [ 825.523793] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_create_client: created KDS client for pid(33515), ret: 0 [ 825.528101] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3 [ 825.528598] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 _xocl_drvinst_open: OPEN 2 [ 825.528651] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_create_client: created KDS client for pid(33515), ret: 0 [ 825.532775] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3 [ 825.535549] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_destroy_client: client exits pid(33515) [ 825.535552] xocl 0000:4b:00.1: ffff9d8c42c0f0c8 xocl_drvinst_close: CLOSE 3 [ 825.578882] [drm:xocl_userptr_bo_ioctl [xocl]] ERROR object creation failed user_flags 0, size 0x1000000 [ 825.583321] xocl 0000:4b:00.1: icap.u.22020096 ffff9d8c61465c10 icap_cached_ocl_frequency: no cached data for 3 [ 835.675029] xocl:xdma_xfer_fastpath: Wait for request timed out [ 835.675034] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000069d4d79f) = 0x1fc00106 (id). [ 835.675042] xocl:xdma_xfer_fastpath: Wait for request timed out [ 835.675233] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000069e2858) = 0x00000001 (status). [ 835.675233] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000484f43bd) = 0x1fc00006 (id). [ 835.675422] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000050a09caa) = 0x00f83e1f (control) [ 835.675423] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x0000000053dbbb01) = 0x00000001 (status). [ 835.675425] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000d405ff68) = 0xfed80000 (first_desc_lo) [ 835.675425] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000001fa25302) = 0x00f83e1f (control) [ 835.675427] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x000000003fd66380) = 0x00000000 (first_desc_hi) [ 835.675428] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000f95b413a) = 0xfedc0000 (first_desc_lo) [ 835.675430] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x0000000054666220) = 0x00000009 (first_desc_adjacent). [ 835.675431] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x000000007a2ee338) = 0x00000000 (first_desc_hi) [ 835.675433] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x000000007bef9f07) = 0x00000001 (completed_desc_count). [ 835.675434] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000b1237d10) = 0x0000001f (first_desc_adjacent). [ 835.675436] xocl:engine_reg_dump: 0-H2C1-MM: ioread32(0x00000000f46c0d2f) = 0x00f83e1e (interrupt_enable_mask) [ 835.675436] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000250db681) = 0x00000102 (completed_desc_count). [ 835.675439] xocl:engine_reg_dump: 0-H2C0-MM: ioread32(0x00000000144b176d) = 0x00f83e1e (interrupt_enable_mask) [ 835.675439] xocl:check_nonzero_interrupt_status: 0000:4b:00.1 xdma0 channel_int_enable = 0x0000000f [ 835.675443] xocl:check_nonzero_interrupt_status: 0000:4b:00.1 xdma0 channel_int_enable = 0x0000000f [ 835.675493] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: DMA failed, Dumping SG Page Table, ep addr 8000000 [ 835.675497] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: DMA failed, Dumping SG Page Table, ep addr 0 [ 835.675928] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 0, 0x2b8cb3000 [ 835.676417] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 0, 0x22ff92000 [ 835.676664] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 1, 0x2b7300000 [ 835.676911] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 1, 0x19cf49000 [ 835.677157] xocl 0000:4b:00.1: dma.xdma.u.5242880 ffff9d8c61467010 xdma_migrate_bo: 2, 0x2bbb00000 — Reply to this email directly, view it on GitHub @./0?redirect=https%3A%2F%2Fgithub.com%2FXilinx%2FXRT%2Fissues%2F6104%23issuecomment-1062030676&recipient=cmVwbHkrQUJLRjdJTkNTNUU1NkFETUFGS0daN1dBR1RESTdFVkJOSEhFQ0lITEhRQHJlcGx5LmdpdGh1Yi5jb20%3D), or unsubscribe @./1?redirect=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABKF7IIIDKWDXXJAW2TYN53U66FY7ANCNFSM5LFUELFA&recipient=cmVwbHkrQUJLRjdJTkNTNUU1NkFETUFGS0daN1dBR1RESTdFVkJOSEhFQ0lITEhRQHJlcGx5LmdpdGh1Yi5jb20%3D). Triage notifications on the go with GitHub Mobile for iOS @./2?redirect=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&recipient=cmVwbHkrQUJLRjdJTkNTNUU1NkFETUFGS0daN1dBR1RESTdFVkJOSEhFQ0lITEhRQHJlcGx5LmdpdGh1Yi5jb20%3D) or Android @./3?redirect=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&recipient=cmVwbHkrQUJLRjdJTkNTNUU1NkFETUFGS0daN1dBR1RESTdFVkJOSEhFQ0lITEhRQHJlcGx5LmdpdGh1Yi5jb20%3D). You are receiving this because you authored the thread.

vkomenda avatar Mar 08 '22 18:03 vkomenda

@HFTrader , we are not seeing this with our daily tests. Is this issue reproduce-able on your machine? If yes, could you post the exact steps to reproduce it after a fresh cold reboot? And please also provide the entire dmesg when the error happens.

houlz0507 avatar Mar 08 '22 18:03 houlz0507

Yes, I can reproduce. A little bit of history here: since Ubuntu upgraded to 20.04.03 LTS, the kernel went to 5.13.0 and DKMS broke on XRT 2.8. So I rolled back the kernel to 5.8 and I have been living like this for the past 6 months. All the validation tests passed as configured.

Now I upgraded the kernel to 5.13.0 because there were many packages being held back, some security ones and I had to upgrade XRT as well. As XRT is not supported, I had to recompile it from XRT/master on Github. Here's what I did:

cd src/XRT
git checkout master
source /opt/Xilinx/Vivado/2021.1/settings64.sh
export PATH=/usr/bin:/bin:$PATH
cd build
./build.sh clean
./build.sh

Then installed the debian packages

sudo dpkg -i Release/xrt_202210.2.13.0_20.04-amd64-xrt.deb Release/xrt_202210.2.13.0_20.04-amd64-xbflash.deb Release/xrt_202210.2.13.0_20.04-amd64-container.deb 

Then installed the alveo u50 files

cd ~/Downloads
sudo dpkg -i xilinx-u50-gen3x16-xdma-201920.3-2784799_all.deb xilinx-sc-fw-u50_5.2.6-2.eef518f_all.deb xilinx-cmc-u50_1.0.30-2.3143895_all.deb xilinx-u50-gen3x16-xdma-dev-201920.3-2784799_all.deb

Run tests:

$ sudo /opt/xilinx/xrt/bin/xbutil validate --device 0000:4b:00.1 --verbose 
Verbose: Enabling Verbosity
Starting validation for 1 devices

Validate Device           : [0000:4b:00.1]
    Platform              : xilinx_u50_gen3x16_xdma_201920_3
    SC Version            : 5.2.6
    Platform ID           : F465B0A3-AE8C-64F6-19BC-150384ACE69B
-------------------------------------------------------------------------------
Test 1 [0000:4b:00.1]     : aux-connection 

    Description           : Check if auxiliary power is connected
    Details               : Aux power connector is not available on this board
    Test Status           : [SKIPPED]
-------------------------------------------------------------------------------
Test 2 [0000:4b:00.1]     : pcie-link 
    Description           : Check if PCIE link is active
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:4b:00.1]     : sc-version 
    Description           : Check if SC firmware is up-to-date
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:4b:00.1]     : verify 
    Description           : Run 'Hello World' kernel test
    Xclbin                : /opt/xilinx/firmware/u50/gen3x16-xdma/blp/test/verify.xclbin
    Testcase              : /opt/xilinx/xrt/test/22_verify.py
    Test Status           : [PASSED]
-------------------------------------------------------------------------------
Test 5 [0000:4b:00.1]     : dma 
    Description           : Run dma test
    Error(s)              : DMA failed: Input/output error
    Test Status           : [FAILED]
-------------------------------------------------------------------------------
Validation failed

I'm not sure if the firmware files are the correct ones?

The full dmesg is attached.

dmesg.error.txt xbutil_query.txt xbmgmt_scan.txt

HFTrader avatar Mar 09 '22 04:03 HFTrader

Ok so it is passing now. I changed a bunch of parameters in the BIOS and rebooted. Not sure which of them might have made an impact. I believe IOMMU or ACPI SRAT? Regardless, I'm saving the BIOS config to a pen drive and forget about it.

UPDATE: It looks like DMA failing due to IOMMU turned on is a known issue. Perhaps this should be noted in the installation guide? https://support.xilinx.com/s/article/71962?language=en_US @vkomenda are you sure it wasn't your case as well?

Now WHY this passed with older XRT/kernels is a mystery to me.

HFTrader avatar Mar 09 '22 16:03 HFTrader

@HFTrader Glad to know it works now. And I noticed that your are using AMD server. Yes, we hit DMA issue with AMD server as well. We are suspecting linux kernel issue, but not sure what it is. Turning IOMMU off will resolve it. By default, IOMMU is on with AMD server. you can also turn it off by kernel argument 'amd_iommu=off'

houlz0507 avatar Mar 10 '22 05:03 houlz0507

@houlz0507 now that AMD bought Xilinx, I guess it increases the priority to have this working with AMD IOMMU. ;-)

keryell avatar Mar 11 '22 03:03 keryell

@keryell I hope so. :) Reported this to management.

houlz0507 avatar Mar 11 '22 03:03 houlz0507

@keryell @houlz0507 It is not that simple apparently. Today I had serious stability issues with having the u50 and a SF x2522 on the same box. I had to shuffle them through several slots until I found an arrangement that worked. Even with IOMMU turned off sometimes the validation failed with DMA error.

HFTrader avatar Mar 11 '22 06:03 HFTrader

Have you tried to update the BIOS? There are some bugs in the BIOS too. and finding the right version might help.. :-(

keryell avatar Mar 11 '22 08:03 keryell

Yes I did. I'm running with Gigabyte's TRX40 AORUS MASTER rev 1.1 with Threadripper 3960X. I upgraded to version "FC" but things got really worse so I had to downgrade to "FB". It would initialize up to opcode '15' (pre-Northbridge initialization) and then fail. It would do this a few times then give up and reset the BIOS. I found that Solarflare also uses DMA for card statistics so that might play a role here. Removing the SF card makes the system more stable. Removing both cards (u50 and sf) even more. At this point I got the system stable enough to carry on my development but it also a butt-clencher to reboot.

HFTrader avatar Mar 11 '22 12:03 HFTrader

@HFTrader , Could you attach a full dmesg when error happens with IOMMU off?

houlz0507 avatar Mar 11 '22 15:03 houlz0507

I was rebooting so often that the relevant log rolled over. I dont have it anymore. I will keep an eye. But if you tell me it's very important I can reshuffle cards again so it breaks but I don't want to go there as the kernel lockup that ensued almost ditched my hard drive.

HFTrader avatar Mar 12 '22 00:03 HFTrader

@HFTrader, Thanks for your reply. Please keep an eye on it. We will also try some AMD servers internally. Another thing is that your server seems PCIe Gen5? you may try to lower your slot speed to Gen3 and see if that helps.

houlz0507 avatar Mar 12 '22 00:03 houlz0507

I have noticed on my U50 that I cannot pass the test with more than 4KiB. Can you try with adding --run dma --param dma:block-size:4096?

keryell avatar Jun 21 '22 01:06 keryell