edgetpu
edgetpu copied to clipboard
Installation failing on Raspberry Pi CM4 for PCI-E driver
Following the installation guide for the M.2 I get several compilation errors when its trying to install gasket. Here the log of the make process: gasket-make.log
It seems its mostly the 3 same errors
invalid use of undefined type ‘struct msix_entry’’
implicit declaration of function ‘writeq_relaxed’; did you mean ‘writel_relaxed’
implicit declaration of function ‘readq_relaxed’; did you mean ‘readw_relaxed’
implicit declaration of function ‘pci_disable_msix’; did you mean ‘pci_disable_sriov’
This is using gcc version 8.3.0 using the latest Raspbian with Kernel 5.4.51-v7l+
Unsure whether this is compiler, kernel header or code issues.
Hello @timonsku we have investigated the CM4 previously and unfortunately, we determined that it won't works with our PCIe modules as the CPU doesn't have MSI-X supports as required by our requirements.
Hey Namburger, the pi engineers have worked on this and have added support for MSI-X in the latest kernel. See this forum discussion: https://www.raspberrypi.org/forums/viewtopic.php?p=1772216&sid=fa34ae6597591c1f80cb68c8138c6a67#p1772216
As I mentioned, we have explored this path and there is still a little on going efforts but I don't believe it is something we can promise. @mbrooksx might be able to give you more info on this
Oh I see. If it doesn't turn out to be a true hw limitation I would be very interested in seeing this getting supported. I currently have hardware in development that would see good use of the M.2 modules.
@timonsku Unfortunately this ARM hardware does not support MSI-X. The raspberry pi discussion you referenced raised my hopes that limited performance with emulated interrupts might work. Although it still does not work, the on-going work is encouraging, and might lead to performance nearly as good as if the original MSI-X hardware interrupts were on the ARM silicon. Stay tuned!
@timonsku : Yes, I'm actively working with the people in the Pi forum discussion. While MSI-X isn't technically supported by the BCM2711, as you saw from that patch if SW indicates it works then the PCIe hardware is actually able to map some MSI-X interrupts correctly.
We've validated farther than you have (including MSI-X), your errors are because you're building for the 32-bit kernel but the driver expects 64-bit read/write (thus why writeq/readq don't exist). My plan is to customize the driver for Pi (including 32-bit workarounds) and likely submit it to the Pi kernel vs trying to update our DKMS package. Will keep you informed of the status.
Awesome that is great to hear :)
Great to hear that somebody is working on this issue! Already received my RPI CM4 + IO Board + PCIe Coral acc. Any news? Maybe I can help?
Has anyone had a go at this? I've done a bit of debugging and hacking myself and got the kernel module to load and libedgetpu to start an inference (although it never finishes, some event is missing, and there is an HIB error?).
There are some changes needed in both the kernel module and the user-space drivers, so far primarily replacing 64bit memory accesses with two 32bit ones. My progress is here for the module which I have updated to the latest version from the dkms package and here for libedgetpu, but these changes are of course nowhere near merge-quality.
This is what libedgetpu logs:
I :273] Starting in normal mode
I :83] Opening /dev/apex_0. read_only=0
I :97] mmap_offset=0x0000000000040000, mmap_size=4096
I :108] Got map addr at 0x0xb6fde000
I :97] mmap_offset=0x0000000000044000, mmap_size=4096
I :108] Got map addr at 0x0xb6fdd000
I :97] mmap_offset=0x0000000000048000, mmap_size=4096
I :108] Got map addr at 0x0xb6fdc000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000000, w0=0x00000000, w1=0x00000000
I :191] Write: offset = 0x00000000000487a8, value = 0x0000000000000000
I :229] Read: offset = 0x0000000000048578, value: = 0x0000000000000010, w0=0x00000010, w1=0x00000000
I :136] MmuMapper#Map() : 00000000b6627000 -> 0000000001000000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001000000
I :169] Queue base : 0xb6627000 -> 0x0000000001000000 [4096 bytes]
I :136] MmuMapper#Map() : 00000000b6628000 -> 0000000001001000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001001000
I :179] Queue status block : 0xb6628000 -> 0x0000000001001000 [16 bytes]
I :191] Write: offset = 0x0000000000048590, value = 0x0000000001000000
I :191] Write: offset = 0x0000000000048598, value = 0x0000000001001000
I :191] Write: offset = 0x00000000000485a0, value = 0x0000000000000100
I :191] Write: offset = 0x0000000000048568, value = 0x0000000000000005
I :229] Read: offset = 0x0000000000048570, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x00000000000486d0, value: = 0x0000000000000000, w0=0x00000000, w1=0x00000000
I :191] Write: offset = 0x0000000000044018, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044158, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044198, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000441d8, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000044218, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000048788, value = 0x000000000000007f
I :229] Read: offset = 0x0000000000048788, value: = 0x000000000000007f, w0=0x0000007f, w1=0x00000000
I :191] Write: offset = 0x00000000000400c0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040150, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040110, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040250, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040298, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000402e0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040328, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040190, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000401d0, value = 0x0000000000000001
I :191] Write: offset = 0x0000000000040210, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000486e8, value = 0x0000000000000000
I :45] Set event fd : event_id:0 -> event_fd:7,
I :45] Set event fd : event_id:4 -> event_fd:11,
I :62] event_fd=7. Monitor thread begin.
I :45] Set event fd : event_id:5 -> event_fd:12,
I :45] Set event fd : event_id:6 -> event_fd:13,
I :62] event_fd=12. Monitor thread begin.
I :62] event_fd=11. Monitor thread begin.
I :45] Set event fd : event_id:7 -> event_fd:14,
I :62] event_fd=13. Monitor thread begin.
I :45] Set event fd : event_id:8 -> event_fd:15,
I :62] event_fd=14. Monitor thread begin.
I :45] Set event fd : event_id:9 -> event_fd:16,
I :45] Set event fd : event_id:10 -> event_fd:17,
I :62] event_fd=15. Monitor thread begin.
I :45] Set event fd : event_id:11 -> event_fd:18,
I :62] event_fd=16. Monitor thread begin.
I :62] event_fd=17. Monitor thread begin.
I :45] Set event fd : event_id:12 -> event_fd:19,
I :62] event_fd=18. Monitor thread begin.
I :191] Write: offset = 0x00000000000486a0, value = 0x000000000000000f
I :191] Write: offset = 0x00000000000485c0, value = 0x0000000000000001
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000001
I :172] Opening device at /dev/apex_0
I :62] event_fd=19. Monitor thread begin.
I :75] event_fd=19. Monitor thread got num_events=1.
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :191] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :191] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :191] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :229] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :229] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
I :47] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :58] Adding output "prediction" with 965 bytes.
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.
I :310] Request [0]: Submitting P0 request immediately.
I :373] Request [0]: Need to map parameters.
I :136] MmuMapper#Map() : 00000000ad93d000 -> 8000000000000000 (953 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000000000
I :252] Mapped params : Buffer(ptr=0xad93d000) -> 0x8000000000000000, 3900864 bytes.
I :252] Mapped params : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :387] Request [0]: Need to do parameter-caching.
I :80] [0] Request constructed.
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :368] MapDataBuffers() done.
I :187] Linking Parameter: 0x8000000000000000
I :136] MmuMapper#Map() : 0000000001266000 -> 8000000000400000 (3 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000400000
I :223] Mapped "instructions" : Buffer(ptr=0x1266000) -> 0x8000000000400000, 9680 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [0] SetState old=0, new=1.
I :393] [0] NotifyRequestSubmitted()
I :481] [0] SetState old=1, new=2.
I :83] Request[0]: Submitted
I :401] [0] NotifyRequestActive()
I :481] [0] SetState old=2, new=3.
I :133] Request[0]: Scheduling DMA[0]
I :394] Adding an element to the host queue.
I :191] Write: offset = 0x00000000000485a8, value = 0x0000000000000001
I :80] [1] Request constructed.
I :113] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :188] Adding output "prediction" with 965 bytes.
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :136] MmuMapper#Map() : 0000000001226000 -> 8000000000440000 (38 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000440000
I :223] Mapped "map/TensorArrayStack/TensorArrayGatherV3" : Buffer(ptr=0x1226440) -> 0x8000000000440440, 150528 bytes. Direction=1
I :136] MmuMapper#Map() : 0000000001276000 -> 8000000000404000 (1 pages) flags=00000004.
I :55] MapMemory() page-aligned : device_address = 0x8000000000404000
I :223] Mapped "prediction" : Buffer(ptr=0x1276000) -> 0x8000000000404000, 968 bytes. Direction=2
I :368] MapDataBuffers() done.
I :93] Linking map/TensorArrayStack/TensorArrayGatherV3[0]: 0x8000000000440440
I :93] Linking prediction[0]: 0x8000000000404000
I :136] MmuMapper#Map() : 00000000012b9000 -> 8000000000420000 (32 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x8000000000420000
I :223] Mapped "instructions" : Buffer(ptr=0x12b9000) -> 0x8000000000420000, 129536 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [1] SetState old=0, new=1.
I :393] [1] NotifyRequestSubmitted()
I :481] [1] SetState old=1, new=2.
I :83] Request[1]: Submitted
I :401] [1] NotifyRequestActive()
I :481] [1] SetState old=2, new=3.
I :133] Request[1]: Scheduling DMA[0]
I :394] Adding an element to the host queue.
I :191] Write: offset = 0x00000000000485a8, value = 0x0000000000000002
Also the only interrupt firing seems to be the fatal error one:
cat /sys/class/apex/apex_0/interrupt_counts
0x00: 0
0x01: 0
0x02: 0
0x03: 0
0x04: 0
0x05: 0
0x06: 0
0x07: 0
0x08: 0
0x09: 0
0x0a: 0
0x0b: 0
0x0c: 2
@markus-k woa, thanks for sharing that @mbrooksx for awareness
@markus-k thank your for your sharing. I add othbootargs=gasket.dma_bit_mask=32 to avoid HIB error. But after running the sample program, I still get the following errors. Did you have any ideas ? (Rasbian OS is 32bit; all the code is download from markus-k's repo) Thank you -Jack
@hiwudery That's weird. Your upper and lower 32bits are cloned when reading from the device (see the line with I :229
), which my patch should fix. Maybe the compiler optimized the two reads into one ldrd? But since that still performs two 32bit accesses, I don't really understand why that happens.
I just tried setting dma_bit_mask
but still get HIB Errors, in addition to out of memory errors when mapping buffers. Also from dmesg:
[ 971.201472] apex 0000:01:00.0: gasket_perform_mapping i 0
[ 971.201480] apex 0000:01:00.0: gasket_page_table_map done: ha b657c000 daddr 1000000 num 1, flags 0 ret 0
[ 971.201552] apex 0000:01:00.0: gasket_perform_mapping i 0
[ 971.201558] apex 0000:01:00.0: gasket_page_table_map done: ha b657d000 daddr 1001000 num 1, flags 0 ret 0
[ 971.271839] apex 0000:01:00.0: gasket_alloc_extended_subtable -> fail to map page ffffffffffffffff [pfn 6d9fed66 phys 732d8923]
[ 971.271854] apex 0000:01:00.0: no memory for extended addr subtable
[ 971.271861] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[ 971.271868] apex 0000:01:00.0: gasket_page_table_map done: ha ad63c000 daddr 8000000000000000 num 953, flags 2 ret -12
[ 971.271907] apex 0000:01:00.0: gasket_alloc_extended_subtable -> fail to map page ffffffffffffffff [pfn 6d9fed66 phys 732d8923]
[ 971.271915] apex 0000:01:00.0: no memory for extended addr subtable
[ 971.271921] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[ 971.271928] apex 0000:01:00.0: gasket_page_table_map done: ha ad63c000 daddr 8000000000000000 num 953, flags 0 ret -12
I'm also not sure if dma_bit_mask
is right here. The comment says it's used for PCIe controller which can't do 64bit addressing, but the Raspberry Pis PCIe controller can do 64bit addressing, but only 32bit wide accesses (as noted by PhilE here).
Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are: https://github.com/mbrooksx/libedgetpu (Userspace) https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)
Note that I added an additional print - the host-side page address for the failed DMA transaction (it reports 0x100004000000000 - which is outside of the Pi RAM). The hope is that dma_bit_mask and command line swiotlb=65536 would create shadow registers in the 32-bit space but the Pi PCIe restrictions are very challenging. It is likely the coherent memory (setup in libedgetpu) is corrupted and thus the shared memory between the two is passing invalid information.
The other option that may be easier is the 32-bit kernel. It has issues with allocating enough BAR memory, but with some device tree tweaks this could likely be fixed. This paired with the 32-bit "aware" user-space may be an easier path. I've asked the Pi team to investigate this as well.
@mbrooksx - And for the benefit of anyone who hasn't touched BAR space allocations, here's a guide I wrote on it a few months back testing graphics cards on the CM4: https://gist.github.com/geerlingguy/9d78ea34cab8e18d71ee5954417429df
The latest 5.10.y kernels for Pi OS already increased the default allocation to 1 GB I think (maybe even 4 or 8 GB? I don't remember if I followed up and checked on those commits).
Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are: https://github.com/mbrooksx/libedgetpu (Userspace) https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)
Note that I added an additional print - the host-side page address for the failed DMA transaction (it reports 0x100004000000000 - which is outside of the Pi RAM). The hope is that dma_bit_mask and command line swiotlb=65536 would create shadow registers in the 32-bit space but the Pi PCIe restrictions are very challenging. It is likely the coherent memory (setup in libedgetpu) is corrupted and thus the shared memory between the two is passing invalid information.
The other option that may be easier is the 32-bit kernel. It has issues with allocating enough BAR memory, but with some device tree tweaks this could likely be fixed. This paired with the 32-bit "aware" user-space may be an easier path. I've asked the Pi team to investigate this as well.
Alright, at least I haven't been looking in the completely wrong place. I've done most of my debugging on a 32-bit kernel so far. The default BAR space seems to be 1GB, I'm not sure if that's enough, but I'm not seeing any BAR allocation errors.
In case this helps anyone, some more debug logs. I've added your additional debug print, on a 32-bit kernel without any additional parameters:
[ 77.630936] apex 0000:01:00.0: Fault VA: 0x0
[ 77.630952] apex 0000:01:00.0: Fault VA: 0x0
[ 77.635926] apex 0000:01:00.0: Fault VA: 0x0
[ 77.635940] apex 0000:01:00.0: Fault VA: 0x0
[ 77.635953] apex 0000:01:00.0: Fault VA: 0x0
[ 77.635966] apex 0000:01:00.0: Fault VA: 0x0
[ 77.635978] apex 0000:01:00.0: Fault VA: 0x0
[ 77.635990] apex 0000:01:00.0: Fault VA: 0x0
[ 77.636002] apex 0000:01:00.0: Fault VA: 0x0
[ 77.636014] apex 0000:01:00.0: Fault VA: 0x0
[ 83.141193] apex 0000:01:00.0: Fault VA: 0x1001000
[ 83.141216] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[ 83.141237] apex 0000:01:00.0: Computed Failing Bus Addr: 0x40c800000
[ 83.141259] apex 0000:01:00.0: Fault VA: 0x1001000
[ 83.141277] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[ 83.141296] apex 0000:01:00.0: Computed Failing Bus Addr: 0x40c800000
[ 83.141320] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[ 83.141345] apex 0000:01:00.0: Fault VA: 0xffffffff
[ 83.141362] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[ 83.141381] apex 0000:01:00.0: Computed Failing Bus Addr: 0x0
[ 83.141402] apex 0000:01:00.0: Fault VA: 0x0
[ 83.150222] apex 0000:01:00.0: Fault VA: 0x0
[ 83.150243] apex 0000:01:00.0: Fault VA: 0x0
[ 83.150263] apex 0000:01:00.0: Fault VA: 0x0
[ 83.150284] apex 0000:01:00.0: Fault VA: 0x0
[ 83.150309] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
I've also tried using gasket.dma_bit_mask=32 swiotlb=65536
on a 32-bit kernel:
[ 41.372303] apex 0000:01:00.0: Fault VA: 0x0
[ 41.372321] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378062] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378079] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378094] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378109] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378124] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378139] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378153] apex 0000:01:00.0: Fault VA: 0x0
[ 41.378168] apex 0000:01:00.0: Fault VA: 0x0
[ 41.628343] ------------[ cut here ]------------
[ 41.628367] WARNING: CPU: 3 PID: 707 at kernel/dma/swiotlb.c:683 swiotlb_map+0x38c/0x43c
[ 41.628374] apex 0000:01:00.0: swiotlb addr 0x0000000415400000+4096 overflow (mask ffffffff, bus limit 47fffffff).
[ 41.628379] Modules linked in: sha256_generic cfg80211 rfkill 8021q garp stp llc binfmt_misc v3d raspberrypi_hwmon vc4 gpu_sched dwc2 cec roles drm_kms_helper drm bcm2835_isp(C) i2c_bcm2835 bcm2835_codec(C) bcm2835_v4l2(C) drm_panel_orientation_quirks v4l2_mem2mem bcm2835_mmal_vchiq(C) videobuf2_dma_contig videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc apex(C) snd_soc_core vc_sm_cma(C) gasket(C) snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd syscopyarea sysfillrect sysimgblt fb_sys_fops backlight rpivid_mem uio_pdrv_genirq uio i2c_dev ip_tables x_tables ipv6
[ 41.628599] CPU: 3 PID: 707 Comm: python3 Tainted: G C 5.10.6-v7l+ #6
[ 41.628602] Hardware name: BCM2711
[ 41.628605] Backtrace:
[ 41.628617] [<c0b84b94>] (dump_backtrace) from [<c0b84f24>] (show_stack+0x20/0x24)
[ 41.628621] r7:ffffffff r6:00000000 r5:60000013 r4:c12e6c98
[ 41.628626] [<c0b84f04>] (show_stack) from [<c0b892bc>] (dump_stack+0xcc/0xf8)
[ 41.628632] [<c0b891f0>] (dump_stack) from [<c02216d4>] (__warn+0xfc/0x114)
[ 41.628637] r10:00001000 r9:00000009 r8:c02a5a50 r7:000002ab r6:00000009 r5:c02a5a50
[ 41.628640] r4:c0e3cd00 r3:c1205094
[ 41.628645] [<c02215d8>] (__warn) from [<c0b856c8>] (warn_slowpath_fmt+0xa4/0xd8)
[ 41.628648] r7:000002ab r6:c0e3cd00 r5:c1205048 r4:c0e3ccbc
[ 41.628654] [<c0b85628>] (warn_slowpath_fmt) from [<c02a5a50>] (swiotlb_map+0x38c/0x43c)
[ 41.628658] r9:c1b8b070 r8:c1205048 r7:00000000 r6:ffffffff r5:00000000 r4:ffffffff
[ 41.628664] [<c02a56c4>] (swiotlb_map) from [<c02a0668>] (dma_map_page_attrs+0x254/0x394)
[ 41.628668] r10:00000001 r9:00001000 r8:c1b8b1e0 r7:00000000 r6:ffffffff r5:c1205048
[ 41.628671] r4:c1b8b070
[ 41.628690] [<c02a0414>] (dma_map_page_attrs) from [<bf115184>] (gasket_map_extended_pages+0x100/0x45c [gasket])
[ 41.628694] r10:00000000 r9:c4112000 r8:c32ab700 r7:f09dc000 r6:00000200 r5:000003b9
[ 41.628697] r4:f085d018
[ 41.628717] [<bf115084>] (gasket_map_extended_pages [gasket]) from [<bf115900>] (gasket_page_table_map+0xa8/0x100 [gasket])
[ 41.628721] r10:c32ab740 r9:ad63c000 r8:00000000 r7:80000000 r6:c2f97c00 r5:c32ab700
[ 41.628724] r4:000003b9
[ 41.628741] [<bf115858>] (gasket_page_table_map [gasket]) from [<bf112a9c>] (gasket_map_buffers_common+0x90/0xa8 [gasket])
[ 41.628745] r10:00000005 r9:00000001 r8:c30e1180 r7:4028dc0c r6:c2f97c00 r5:c2f97c00
[ 41.628748] r4:c32a5d90
[ 41.628767] [<bf112a0c>] (gasket_map_buffers_common [gasket]) from [<bf112cac>] (gasket_handle_ioctl+0x1f8/0x8e0 [gasket])
[ 41.628770] r5:beb40fa0 r4:c1205048
[ 41.628788] [<bf112ab4>] (gasket_handle_ioctl [gasket]) from [<bf1106f8>] (gasket_ioctl+0x9c/0x118 [gasket])
[ 41.628792] r9:beb40fa0 r8:c2f97c00 r7:bf09a1b0 r6:4028dc0c r5:c30e1180 r4:c1205048
[ 41.628805] [<bf11065c>] (gasket_ioctl [gasket]) from [<c0451180>] (sys_ioctl+0x1d4/0x8ec)
[ 41.628809] r9:c32a4000 r8:00000000 r7:c30e1180 r6:c30e1181 r5:c1205048 r4:4028dc0c
[ 41.628815] [<c0450fac>] (sys_ioctl) from [<c0200040>] (ret_fast_syscall+0x0/0x28)
[ 41.628818] Exception stack(0xc32a5fa8 to 0xc32a5ff0)
[ 41.628822] 5fa0: beb40f9c 00000000 00000005 4028dc0c beb40fa0 00000005
[ 41.628826] 5fc0: beb40f9c 00000000 b454da7c 00000036 00000001 01f0349c 00000000 b48a4bbc
[ 41.628829] 5fe0: b454db58 beb40f74 b443ba3f b6cd551c
[ 41.628833] r10:00000036 r9:c32a4000 r8:c0200204 r7:00000036 r6:b454da7c r5:00000000
[ 41.628836] r4:beb40f9c
[ 41.628840] ---[ end trace a2d67e6b70f87dd2 ]---
[ 41.628855] apex 0000:01:00.0: no memory for extended addr subtable
[ 41.628861] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[ 41.628911] apex 0000:01:00.0: no memory for extended addr subtable
[ 41.628917] apex 0000:01:00.0: page table slots (0,0) (@ 0x8000000000000000) to (8191,511) are not available
[ 41.646322] apex 0000:01:00.0: Fault VA: 0x1001000
[ 41.646330] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[ 41.646338] apex 0000:01:00.0: Computed Failing Bus Addr: 0xc800000
[ 41.646347] apex 0000:01:00.0: Fault VA: 0x1001000
[ 41.646352] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x8, Simple: 0x1001
[ 41.646359] apex 0000:01:00.0: Computed Failing Bus Addr: 0xc800000
[ 41.646372] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[ 41.646384] apex 0000:01:00.0: Fault VA: 0xffffffff
[ 41.646389] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[ 41.646396] apex 0000:01:00.0: Computed Failing Bus Addr: 0xdeadbeef
[ 41.646405] apex 0000:01:00.0: Fault VA: 0x0
[ 41.648266] apex 0000:01:00.0: Fault VA: 0x0
[ 41.648275] apex 0000:01:00.0: Fault VA: 0x0
[ 41.648283] apex 0000:01:00.0: Fault VA: 0x0
[ 41.648292] apex 0000:01:00.0: Fault VA: 0x0
[ 41.648305] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
In this case mapping the buffer fails in libedgetpu:
I :192] Write: offset = 0x00000000000486a0, value = 0x000000000000000f
I :62] event_fd=19. Monitor thread begin.
I :192] Write: offset = 0x00000000000485c0, value = 0x0000000000000001
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :192] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :231] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :172] Opening device at /dev/apex_0
I :231] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
I :75] event_fd=19. Monitor thread got num_events=1.
I :192] Write: offset = 0x00000000000486c0, value = 0x0000000000000000
I :192] Write: offset = 0x00000000000486c8, value = 0x0000000000000000
I :231] Read: offset = 0x00000000000486f0, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
I :231] Read: offset = 0x0000000000048700, value: = 0x0000000000000001, w0=0x00000001, w1=0x00000000
E :254] HIB Error. hib_error_status = 0000000000000001, hib_first_error_status = 0000000000000001
----INFERENCE TIME----
Note: The first inference on Edge TPU is slow because it includes loading the model into Edge TPU memory.
I :47] Adding input "map/TensorArrayStack/TensorArrayGatherV3" with 150528 bytes.
I :58] Adding output "prediction" with 965 bytes.
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.
I :310] Request [0]: Submitting P0 request immediately.
I :373] Request [0]: Need to map parameters.
I :118] Failed to map buffer with flags, error -1
Traceback (most recent call last):
File "classify_image.py", line 126, in <module>
main()
File "classify_image.py", line 115, in main
interpreter.invoke()
File "/home/pi/venv/lib/python3.7/site-packages/tflite_runtime/interpreter.py", line 540, in invoke
self._interpreter.Invoke()
RuntimeError: Failed to execute request. Could not map pages : 5 (Cannot allocate memory)Node number 1 (EdgeTpuDelegateForCustomOp) failed to invoke.
I :226] Releasing Edge TPU device at /dev/apex_0
I :178] Closing Edge TPU device at /dev/apex_0
@markus-k in gasket_page_table.c, the page table is 64bit format not 32bit format. I think the gasket_page_table also need to modify in 32bit kernel.
- Address format:
- Simple addresses - those whose containing pages are directly placed in the
- device's address translation registers - are laid out as:
- [ 63 - 25: 0 | 24 - 12: page index | 11 - 0: page offset ]
I also wanted to note something here that may be of interest—I noticed earlier someone mentioned writeq
being present on 64-bit OSes. I'll soon be testing the Coral TPU (M.2 A+E key version) on a Pi so haven't yet had first-hand experience, but with a different driver I was taking a look at, it seems that one problem may be that writeq
is not supported on Pi OS / the Pi's PCI-E bus like it may be on some other 64-bit systems.
Edit: New bug reported relating to that driver issue is here: https://github.com/raspberrypi/linux/issues/4158
On 64-bit Pi OS (with latest kernel compiled at 5.10.14-v8+), I get the following kernel panic after running through the default steps in the setup guide:
(Cross-linking to https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/44#issuecomment-780912830)
You should probably read the rest of this issue, there hasn't been any development since my last comment to my knowledge. The default gasket module won't work at all, my fixed one at least loads and can read temperature, but something is still wrong with the DMA, so it won't work either. Then there's probably still a few other things broken in the user space driver as well.
I don't have the time to dig into this right now, and my knowledge with kernel dev is limited anyway. So best we can do is hope someone with deep understanding of how the DMA and TPU works can find some time and look into it.
@mbrooksx sounded like Google was working on it? Maybe he could update us. I still have very big interest in this for my product but don't have the resources or know-how to dig into this.
If someone at Google is working on it, or is going to, it would be nice to get a very rough ETA (weeks, months) on when we can expect to know whether or not the TPU will ever work over PCIe on a CM4. I'll be creating a new revision of my products PCB in few weeks, and if there's very little chance the PCIe TPU won't work anytime soon, I'll have to switch both to USB.
Yea similar situation for me.
I unfortunately don't have an estimated date. The CM4 PCIe hardware is antiquated, and there are endless hacks required to try to have it operate competently (note that the TPU is a PCIe bus master, and I don't see any evidence of a bus master ever being tested with the CM4). We haven't been receiving the support needed from the Pi team, so for now it's continuing to try things to understand the issues with communication (at this point it seems an issue with the shared memory). It may be within the next few weeks for operation (in which case I would post the hacked up version for your evaluation while we decide the best way to release this without polluting the main Coral codebase). I will keep this thread up to date.
Depending on the board configuration, USB may be a better choice.
My latest theory isn't encouraging (note that this would be really easy to solve in a non-COVID world, where would just plug this into a PCIe bus analyzer and see what data the CM4 is malforming):
When you run a model through the compiler it assigns virtual memory locations for the various operations, scratch memory, weights, etc. There are two mappings these addresses use to map to physical pages, what the driver calls simple and extended. The issue is that the way to differentiate simple and extended is the 63rd bit of the virtual address. So when the shared coherent memory between the CPU and TPU has been established - the TPU reads in this region to get the address of information it needs (in this case it's the location of the instruction queue). But because of the CM4's crippled PCIe bus, it is reading only 32bits of the virtual address - which means it interprets every read as a simple read.
The problem then is it will attempt to mmap this to the system and it will get wrong data (since the correct mapping was via the extended approach). The problem is the TPU is doing these reads (including checking the 64-bit) in hardware, we have no way to change which bit indicates extended mapping. If this is indeed the primary source of failure, it would require a hacked up version of the compiler that assigns everything into simple mapping - this would cripple the maximum size of the model, parameters, etc that is allowed.
I'll explore that option if we can verify this is indeed the cause.
(Thank you everyone for working on this issue!) I have a new setup (Custom CM4 carrier with M.2 PCIe-EdgeTPU) and would love to help get this integration working. Are the following repos still the latest progress in userspace/kernel?
Yes, what you've done is essentially everything I've done for debug. The only additional change you alluded to is correct - the compiler is too smart for libedgetpu and expects a competent system that would be able have 64-bit wide accesses. I fixed this by using volatile variables to skip caching. My repos of progress are: https://github.com/mbrooksx/libedgetpu (Userspace) https://github.com/mbrooksx/pi-cm4-gasket-hacks (Kernel)
It would be so sad if it would never be possible to use the Coral Boards via PCIE on the CM4. The combo is the perfect high performance - low power - compact formfactor - multi camera - mainline kernel supported - embedded inference platform. Please please find a way to make it useable.
I completely agree about the potential with the combination. At this point, it looks like a irreparable hardware issue with the antiquated CM4 PCIe module. I have forced all the allocations into simple mapping (see above for more info about this) so that all the virtual addresses are 32-bit, as well as previously setting all reads/writes to 32-bit. However, the device itself (in hardware) makes reads/writes in the coherent cache - all of these read/writes are 64-bits.
For now, the plan is to wait until the office is open so we can use a PCIe analyzer and confirm this hypothesis. But there doesn't appear to be any additional changes that we can do in SW - the device expecting a host to be able to perform 64-bit read/write is built into the hardware.
USB is still the recommendation for the CM4. USB2.0 is possible out of box, and USB3.0 may be possible although extra design considerations are required (more info here: https://coral.ai/products/accelerator-module/).
Choosing to believe this is still possible...here are my current DMESG and libedgetpu logs: (Kernel: 5.10.23-v8+ (aarch64) with gasket/apex modules and libedgetpu from mbooksx's repos, custom Buildroot Rootfs)
DMESG
[ 1876.006541] apex 0000:01:00.0: Fault VA: 0xffffffff
[ 1876.012884] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[ 1876.024280] apex 0000:01:00.0: Computed Failing Bus Addr: 0x0
[ 1876.031596] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.042358] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.048153] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.053923] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.059681] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.065456] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.071141] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.076769] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.082370] apex 0000:01:00.0: Fault VA: 0x0
[ 1876.089568] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f89c74000, dev_addr 0x1000000, num_pages 1
[ 1876.100752] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f89c75000, dev_addr 0x1001000, num_pages 1
[ 1876.160486] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f5f969000, dev_addr 0x0, num_pages 1603
[ 1876.171885] apex 0000:01:00.0: Map Simple Pages: host_addr 0xd9c3000, dev_addr 0x1004000, num_pages 3
[ 1876.185214] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f88350000, dev_addr 0x1080000, num_pages 66
[ 1876.196648] apex 0000:01:00.0: Map Simple Pages: host_addr 0xd9c7000, dev_addr 0x1002000, num_pages 2
[ 1876.208103] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f88272000, dev_addr 0x1040000, num_pages 44
[ 1876.219712] apex 0000:01:00.0: Map Simple Pages: host_addr 0xd9ca000, dev_addr 0x1008000, num_pages 2
[ 1876.230804] apex 0000:01:00.0: Map Simple Pages: host_addr 0x7f88231000, dev_addr 0x1100000, num_pages 63
(here the test program hangs until ctrl-c)
[ 1904.820076] apex 0000:01:00.0: Fault VA: 0xbe96c8
[ 1904.826533] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x5, Simple: 0xbe9
[ 1904.837859] apex 0000:01:00.0: Computed Failing Bus Addr: 0x100004000000000
[ 1904.846581] apex 0000:01:00.0: Fault VA: 0xbe96c8
[ 1904.853128] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x5, Simple: 0xbe9
[ 1904.864475] apex 0000:01:00.0: Computed Failing Bus Addr: 0x100004000000000
[ 1904.873204] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
[ 1904.880539] apex 0000:01:00.0: Fault VA: 0xffffffff
[ 1904.887108] apex 0000:01:00.0: Failing in first (simple) read access. Extended_level0: 0x7ff, Simple: 0x1fff
[ 1904.898652] apex 0000:01:00.0: Computed Failing Bus Addr: 0x0
[ 1904.906057] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.921784] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.927701] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.933515] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.939298] apex 0000:01:00.0: Fault VA: 0x0
[ 1904.945065] apex 0000:01:00.0: Fault VA: 0xffffffffffffffff
libedgetpu (verbosity=10)
I :944] EnumerateDevices: vendor:0x1a6e, product:0x89a
I :944] EnumerateDevices: vendor:0x18d1, product:0x9302
Test_EdgeTPU[412]: (main:70): Num EdgeTPU Devices: 1
I :453] No matching device is already opened for shared ownership.
I :944] EnumerateDevices: vendor:0x1a6e, product:0x89a
I :944] EnumerateDevices: vendor:0x18d1, product:0x9302
I :104] USB always DFU: False (default)
I :126] USB bulk-in queue capacity: default
I :65] Performance expectation: Max (default)
I :273] Hello Adam!
I :274] Starting in FUCK YEAH mode
I :83] Opening /dev/apex_0. read_only=0
I :97] mmap_offset=0x0000000000040000, mmap_size=4096
I :108] Got map addr at 0x0x7f904db000
I :97] mmap_offset=0x0000000000044000, mmap_size=4096
I :108] Got map addr at 0x0x7f89c79000
I :97] mmap_offset=0x0000000000048000, mmap_size=4096
I :108] Got map addr at 0x0x7f89c78000
I :240] Offset: 0x00000000000486f0, mmap_reg: 0x7f89c786f0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000000, value:0x0000000000000000
I :269] Read 32 Hacks: offset = 0x00000000000486f0, lower: = 0x0000000000000000 upper: = 0x0000000000000000 value: = 0x0000000000000000 mmap: 0x7f89c786f0
I :282] Page Fault Address: 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x00000000000487a8, value = 0x0000000000000000 mmap=0x7f89c787a8
I :206] ReRead 32 Hacks: offset = 0x00000000000487a8, value: = 0x0000000000000000
I :240] Offset: 0x0000000000048578, mmap_reg: 0x7f89c78578, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010
I :269] Read 32 Hacks: offset = 0x0000000000048578, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78578
I :282] Page Fault Address: 0x0000000000000000
I :136] MmuMapper#Map() : 0000007f89c74000 -> 0000000001000000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001000000
I :169] Queue base : 0x7f89c74000 -> 0x0000000001000000 [4096 bytes]
I :136] MmuMapper#Map() : 0000007f89c75000 -> 0000000001001000 (1 pages) flags=00000000.
I :55] MapMemory() page-aligned : device_address = 0x0000000001001000
I :179] Queue status block : 0x7f89c75000 -> 0x0000000001001000 [16 bytes]
I :195] Write 32 Hacks: offset = 0x0000000000048590, value = 0x0000000001000000 mmap=0x7f89c78590
I :206] ReRead 32 Hacks: offset = 0x0000000000048590, value: = 0x0000000001000000
I :195] Write 32 Hacks: offset = 0x0000000000048598, value = 0x0000000001001000 mmap=0x7f89c78598
I :206] ReRead 32 Hacks: offset = 0x0000000000048598, value: = 0x0000000001001000
I :195] Write 32 Hacks: offset = 0x00000000000485a0, value = 0x0000000000000100 mmap=0x7f89c785a0
I :206] ReRead 32 Hacks: offset = 0x00000000000485a0, value: = 0x0000000000000100
I :195] Write 32 Hacks: offset = 0x0000000000048568, value = 0x0000000000000005 mmap=0x7f89c78568
I :206] ReRead 32 Hacks: offset = 0x0000000000048568, value: = 0x0000000000000005
I :240] Offset: 0x0000000000048570, mmap_reg: 0x7f89c78570, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000001, value:0x0000000000000001
I :269] Read 32 Hacks: offset = 0x0000000000048570, lower: = 0x0000000000000001 upper: = 0x0000000000000000 value: = 0x0000000000000001 mmap: 0x7f89c78570
I :282] Page Fault Address: 0x0000000000000000
I :240] Offset: 0x00000000000486d0, mmap_reg: 0x7f89c786d0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000000, value:0x0000000000000000
I :269] Read 32 Hacks: offset = 0x00000000000486d0, lower: = 0x0000000000000000 upper: = 0x0000000000000000 value: = 0x0000000000000000 mmap: 0x7f89c786d0
I :282] Page Fault Address: 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000044018, value = 0x0000000000000001 mmap=0x7f89c79018
I :206] ReRead 32 Hacks: offset = 0x0000000000044018, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000044158, value = 0x0000000000000001 mmap=0x7f89c79158
I :206] ReRead 32 Hacks: offset = 0x0000000000044158, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000044198, value = 0x0000000000000001 mmap=0x7f89c79198
I :206] ReRead 32 Hacks: offset = 0x0000000000044198, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x00000000000441d8, value = 0x0000000000000001 mmap=0x7f89c791d8
I :206] ReRead 32 Hacks: offset = 0x00000000000441d8, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000044218, value = 0x0000000000000001 mmap=0x7f89c79218
I :206] ReRead 32 Hacks: offset = 0x0000000000044218, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000048788, value = 0x000000000000007f mmap=0x7f89c78788
I :206] ReRead 32 Hacks: offset = 0x0000000000048788, value: = 0x000000000000007f
I :240] Offset: 0x0000000000048788, mmap_reg: 0x7f89c78788, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x000000000000007f, value:0x000000000000007f
I :269] Read 32 Hacks: offset = 0x0000000000048788, lower: = 0x000000000000007f upper: = 0x0000000000000000 value: = 0x000000000000007f mmap: 0x7f89c78788
I :282] Page Fault Address: 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x00000000000400c0, value = 0x0000000000000001 mmap=0x7f904db0c0
I :206] ReRead 32 Hacks: offset = 0x00000000000400c0, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000040150, value = 0x0000000000000001 mmap=0x7f904db150
I :206] ReRead 32 Hacks: offset = 0x0000000000040150, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000040110, value = 0x0000000000000001 mmap=0x7f904db110
I :206] ReRead 32 Hacks: offset = 0x0000000000040110, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000040250, value = 0x0000000000000001 mmap=0x7f904db250
I :206] ReRead 32 Hacks: offset = 0x0000000000040250, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000040298, value = 0x0000000000000001 mmap=0x7f904db298
I :206] ReRead 32 Hacks: offset = 0x0000000000040298, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x00000000000402e0, value = 0x0000000000000001 mmap=0x7f904db2e0
I :206] ReRead 32 Hacks: offset = 0x00000000000402e0, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000040328, value = 0x0000000000000001 mmap=0x7f904db328
I :206] ReRead 32 Hacks: offset = 0x0000000000040328, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000040190, value = 0x0000000000000001 mmap=0x7f904db190
I :206] ReRead 32 Hacks: offset = 0x0000000000040190, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x00000000000401d0, value = 0x0000000000000001 mmap=0x7f904db1d0
I :206] ReRead 32 Hacks: offset = 0x00000000000401d0, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x0000000000040210, value = 0x0000000000000001 mmap=0x7f904db210
I :206] ReRead 32 Hacks: offset = 0x0000000000040210, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x00000000000486e8, value = 0x0000000000000000 mmap=0x7f89c786e8
I :206] ReRead 32 Hacks: offset = 0x00000000000486e8, value: = 0x0000000000000000
I :45] Set event fd : event_id:0 -> event_fd:8,
I :45] Set event fd : event_id:4 -> event_fd:12,
I :62] event_fd=8. Monitor thread begin.
I :45] Set event fd : event_id:5 -> event_fd:13,
I :62] event_fd=12. Monitor thread begin.
I :45] Set event fd : event_id:6 -> event_fd:14,
I :62] event_fd=13. Monitor thread begin.
I :45] Set event fd : event_id:7 -> event_fd:15,
I :62] event_fd=14. Monitor thread begin.
I :45] Set event fd : event_id:8 -> event_fd:16,
I :62] event_fd=15. Monitor thread begin.
I :45] Set event fd : event_id:9 -> event_fd:17,
I :62] event_fd=16. Monitor thread begin.
I :45] Set event fd : event_id:10 -> event_fd:18,
I :62] event_fd=17. Monitor thread begin.
I :45] Set event fd : event_id:11 -> event_fd:19,
I :62] event_fd=18. Monitor thread begin.
I :45] Set event fd : event_id:12 -> event_fd:20,
I :62] event_fd=19. Monitor thread begin.
I :195] Write 32 Hacks: offset = 0x00000000000486a0, value = 0x000000000000000f mmap=0x7f89c786a0
I :206] ReRead 32 Hacks: offset = 0x00000000000486a0, value: = 0x000000000000000f
I :195] Write 32 Hacks: offset = 0x00000000000485c0, value = 0x0000000000000001 mmap=0x7f89c785c0
I :206] ReRead 32 Hacks: offset = 0x00000000000485c0, value: = 0x0000000000000001
I :195] Write 32 Hacks: offset = 0x00000000000486c0, value = 0x0000000000000001 mmap=0x7f89c786c0
I :206] ReRead 32 Hacks: offset = 0x00000000000486c0, value: = 0x0000000000000001
I :62] event_fd=20. Monitor thread begin.
I :172] Opening device at /dev/apex_0
Test_EdgeTPU[412]: (main:75): EdgeTPU - path:type (0=PCIe, 1=USB): /dev/apex_0:0
Test_EdgeTPU[412]: (main:80): Loading Model: /home/kampff/Voight-Kampff/objects_edgetpu.tflite
Test_EdgeTPU[412]: (main:82): Model Created
Test_EdgeTPU[412]: (main:89): Options configured: maybe
Test_EdgeTPU[412]: (main:94): Interpreter Created
Test_EdgeTPU[412]: (main:98): Tensors Allocated
Test_EdgeTPU[412]: (main:120): NPU inputs: 1 vs 1
Test_EdgeTPU[412]: (main:127): - Input 0 (normalized_input_image_tensor): Dimensionsw: 4
Test_EdgeTPU[412]: (main:132): - Dimension 0: (size: 1)
Test_EdgeTPU[412]: (main:132): - Dimension 1: (size: 300)
Test_EdgeTPU[412]: (main:132): - Dimension 2: (size: 300)
Test_EdgeTPU[412]: (main:132): - Dimension 3: (size: 3)
Test_EdgeTPU[412]: (main:138): NPU outputs: 4 vs 4
Test_EdgeTPU[412]: (main:145): - Ouput 0 (TFLite_Detection_PostProcess): Dimensions: 3
Test_EdgeTPU[412]: (main:150): - Dimension 0: 1)
Test_EdgeTPU[412]: (main:150): - Dimension 1: 20)
Test_EdgeTPU[412]: (main:150): - Dimension 2: 4)
Test_EdgeTPU[412]: (main:145): - Ouput 1 (TFLite_Detection_PostProcess:1): Dimensions: 2
Test_EdgeTPU[412]: (main:150): - Dimension 0: 1)
Test_EdgeTPU[412]: (main:150): - Dimension 1: 20)
Test_EdgeTPU[412]: (main:145): - Ouput 2 (TFLite_Detection_PostProcess:2): Dimensions: 2
Test_EdgeTPU[412]: (main:150): - Dimension 0: 1)
Test_EdgeTPU[412]: (main:150): - Dimension 1: 20)
Test_EdgeTPU[412]: (main:145): - Ouput 3 (TFLite_Detection_PostProcess:3): Dimensions: 1
Test_EdgeTPU[412]: (main:150): - Dimension 0: 1)
Test_EdgeTPU[412]: (main:167): Test Image Loaded
Test_EdgeTPU[412]: (main:185): Labels Loaded
Test_EdgeTPU[412]: (main:209): Inputs Configured
I :47] Adding input "normalized_input_image_tensor" with 270000 bytes.
I :58] Adding output "Squeeze" with 7668 bytes.
I :58] Adding output "convert_scores" with 174447 bytes.
I :167] Request prepared, total batch size: 1, total TPU requests required: 1.
I :310] Request [0]: Submitting P0 request immediately.
I :373] Request [0]: Need to map parameters.
I :136] MmuMapper#Map() : 0000007f5f969000 -> 0000000000000000 (1603 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x0000000000000000
I :252] Mapped params : Buffer(ptr=0x7f5f969000) -> 0x0000000000000000, 6564224 bytes.
I :252] Mapped params : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :387] Request [0]: Need to do parameter-caching.
I :80] [0] Request constructed.
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :368] MapDataBuffers() done.
I :187] Linking Parameter: 0x0000000000000000
I :136] MmuMapper#Map() : 000000000d9c3000 -> 0000000001004000 (3 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x0000000001004000
I :223] Mapped "instructions" : Buffer(ptr=0xd9c3000) -> 0x0000000001004000, 11472 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [0] SetState old=0, new=1.
I :393] [0] NotifyRequestSubmitted()
I :481] [0] SetState old=1, new=2.
I :83] Request[0]: Submitted
I :401] [0] NotifyRequestActive()
I :481] [0] SetState old=2, new=3.
I :133] Request[0]: Scheduling DMA[0]
I :393] Adding an element to the host queue.
I :195] Write 32 Hacks: offset = 0x00000000000485a8, value = 0x0000000000000001 mmap=0x7f89c785a8
I :206] ReRead 32 Hacks: offset = 0x00000000000485a8, value: = 0x0000000000000001
I :75] event_fd=20. Monitor thread got num_events=1.
I :80] [1] Request constructed.
I :195] Write 32 Hacks: offset = 0x00000000000486c0, value = 0x0000000000000000 mmap=0x7f89c786c0
I :113] Adding input "normalized_input_image_tensor" with 270000 bytes.
I :206] ReRead 32 Hacks: offset = 0x00000000000486c0, value: = 0x0000000000000000
I :188] Adding output "Squeeze" with 7668 bytes.
I :195] Write 32 Hacks: offset = 0x00000000000486c8, value = 0x0000000000000000 mmap=0x7f89c786c8
I :188] Adding output "convert_scores" with 174447 bytes.
I :206] ReRead 32 Hacks: offset = 0x00000000000486c8, value: = 0x0000000000000001
I :240] Offset: 0x00000000000486f0, mmap_reg: 0x7f89c786f0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000211, value:0x0000000000000211
I :269] Read 32 Hacks: offset = 0x00000000000486f0, lower: = 0x0000000000000211 upper: = 0x0000000000000000 value: = 0x0000000000000211 mmap: 0x7f89c786f0
I :282] Page Fault Address: 0x0000000000be96c8
I :240] Offset: 0x0000000000048700, mmap_reg: 0x7f89c78700, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010
I :269] Read 32 Hacks: offset = 0x0000000000048700, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78700
I :282] Page Fault Address: 0x0000000000be96c8
I :240] Offset: 0x0000000000048700, mmap_reg: 0x7f89c78700, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010
I :269] Read 32 Hacks: offset = 0x0000000000048700, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78700
I :282] Page Fault Address: 0x0000000000be96c8
E :254] HIB Error. hib_error_status = 0000000000000211, hib_first_error_status = 0000000000000010
I :75] event_fd=20. Monitor thread got num_events=1.
I :195] Write 32 Hacks: offset = 0x00000000000486c0, value = 0x0000000000000000 mmap=0x7f89c786c0
I :206] ReRead 32 Hacks: offset = 0x00000000000486c0, value: = 0x0000000000000000
I :195] Write 32 Hacks: offset = 0x00000000000486c8, value = 0x0000000000000000 mmap=0x7f89c786c8
I :206] ReRead 32 Hacks: offset = 0x00000000000486c8, value: = 0x0000000000000000
I :240] Offset: 0x00000000000486f0, mmap_reg: 0x7f89c786f0, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000211, value:0x0000000000000211
I :269] Read 32 Hacks: offset = 0x00000000000486f0, lower: = 0x0000000000000211 upper: = 0x0000000000000000 value: = 0x0000000000000211 mmap: 0x7f89c786f0
I :282] Page Fault Address: 0x0000000000be96c8
I :240] Offset: 0x0000000000048700, mmap_reg: 0x7f89c78700, Upper: 0x0000000000000000, Shifted upper: 0x0000000000000000, lower: 0x0000000000000010, value:0x0000000000000010
I :269] Read 32 Hacks: offset = 0x0000000000048700, lower: = 0x0000000000000010 upper: = 0x0000000000000000 value: = 0x0000000000000010 mmap: 0x7f89c78700
I :282] Page Fault Address: 0x0000000000be96c8
E :254] HIB Error. hib_error_status = 0000000000000211, hib_first_error_status = 0000000000000010
I :46] InstructionBuffers created.
I :653] Created new instruction buffers.
I :75] Mapped scratch : Buffer(ptr=(nil)) -> 0x0000000000000000, 0 bytes.
I :136] MmuMapper#Map() : 0000007f88350000 -> 0000000001080000 (66 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x0000000001080000
I :223] Mapped "normalized_input_image_tensor" : Buffer(ptr=0x7f88350040) -> 0x0000000001080040, 270000 bytes. Direction=1
I :136] MmuMapper#Map() : 000000000d9c7000 -> 0000000001002000 (2 pages) flags=00000004.
I :55] MapMemory() page-aligned : device_address = 0x0000000001002000
I :136] MmuMapper#Map() : 0000007f88272000 -> 0000000001040000 (44 pages) flags=00000004.
I :55] MapMemory() page-aligned : device_address = 0x0000000001040000
I :223] Mapped "convert_scores" : Buffer(ptr=0x7f88272000) -> 0x0000000001040000, 176368 bytes. Direction=2
I :223] Mapped "Squeeze" : Buffer(ptr=0xd9c7000) -> 0x0000000001002000, 7672 bytes. Direction=2
I :368] MapDataBuffers() done.
I :93] Linking normalized_input_image_tensor[0]: 0x0000000001080040
I :93] Linking Squeeze[0]: 0x0000000001002000
I :93] Linking convert_scores[0]: 0x0000000001040000
I :136] MmuMapper#Map() : 000000000d9ca000 -> 0000000001008000 (2 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x0000000001008000
I :136] MmuMapper#Map() : 0000007f88231000 -> 0000000001100000 (63 pages) flags=00000002.
I :55] MapMemory() page-aligned : device_address = 0x0000000001100000
I :223] Mapped "instructions" : Buffer(ptr=0x7f88231000) -> 0x0000000001100000, 256992 bytes. Direction=1
I :223] Mapped "instructions" : Buffer(ptr=0xd9ca000) -> 0x0000000001008000, 7632 bytes. Direction=1
I :384] MapInstructionBuffers() done.
I :481] [1] SetState old=0, new=1.
I :393] [1] NotifyRequestSubmitted()
I :481] [1] SetState old=1, new=2.
I :83] Request[1]: Submitted
I :401] [1] NotifyRequestActive()
I :481] [1] SetState old=2, new=3.
I :133] Request[1]: Scheduling DMA[0]
I :393] Adding an element to the host queue.
I :195] Write 32 Hacks: offset = 0x00000000000485a8, value = 0x0000000000000002 mmap=0x7f89c785a8
I :206] ReRead 32 Hacks: offset = 0x00000000000485a8, value: = 0x0000000000000002
I :133] Request[1]: Scheduling DMA[1]
I :393] Adding an element to the host queue.
I :195] Write 32 Hacks: offset = 0x00000000000485a8, value = 0x0000000000000003 mmap=0x7f89c785a8
I :206] ReRead 32 Hacks: offset = 0x00000000000485a8, value: = 0x0000000000000003
program hangs until killed with ctl-c...
These logs look like what I see as well. The HIB error there (hib_error_status = 0000000000000211) still indicates read failures.
I recently became aware of a new-ish DT Overlay from the Pi team for 32 bit DMA (I found it in this thread for bringing up a USB controller) - pcie-32bit-dma.dtbo. Alas adding it has no effect (and I verified it does cleanly apply).
I think this new overlay originated from this issue over here: https://github.com/raspberrypi/linux/issues/4197#issuecomment-794014591
Maybe you can find some ideas on the problem in there ?