raspberry-pi-pcie-devices icon indicating copy to clipboard operation
raspberry-pi-pcie-devices copied to clipboard

TEST GPU (NVIDIA GeForce GTX 1050 Ti) - Nouveau patches req'd

Open RSC-Games opened this issue 3 months ago • 83 comments

I've been following this project since its early days with the CM4 and Jeff's first video on the long and hard fight with the Pi 4's broken PCIe bus. I only recently ordered an eGPU dock, and am waiting for it to arrive. I'm well aware of the existing rm_init_adapter failed! issues on NVIDIA's proprietary ARM64 drivers, but I have an idea of my own. What if I make an x64 VM on this Pi, PCIe passthrough the GPU to the VM, and use the NVIDIA proprietary drivers from there? It doesn't have to work well; I can afford an RX 580 or just use my ancient Radeon HD 7870 on this thing. I'm legitimately just curious to see just how much can I do, short of active kernel patching, to get this awful idea to work.

I've already set up the VM, with active VirGL rendering (Emulation is slow enough that software rendering the desktop really hurts (kde was not usable but did run with the vga fb device) and I wanted to see how viable 3D acceleration really was on the VM in the first place), the NVIDIA proprietary drivers installed, and a basic DE (LXDE).

Image

The emulated x64 machine runs... not amazingly, but it is somewhat usable surprisingly. It's about on par with a stock Pi 3 with near VC7 rendering capability. Modprobing the NVIDIA drivers works as expected, with it not detecting the card, as there is no card even plugged in yet.

Image

I'll update later when the dock arrives and send a screenshot of the card being recognised if it does end up working. I did realize when writing this issue that an x64 VM only solves the issue of "code optimised for x64 not working right on ARM" and not the PCIe quirks (which will still be present bc PCIe passthrough just passes through the host interface), but hopefully I can help get us closer to a possible solution with NVIDIA cards.

If this works, I'll then test out using VirtualGL to render to an x-server on the VM and display it via the GPU video out ports. I can then test some native ARM OpenGL apps, and have them render on the NVIDIA card. It probably won't be amazing, and Vulkan likely won't work, but I'll at least be able to run the apps on the ARM side, so we're not slowed down by the emulated OS.

RSC-Games avatar Sep 02 '25 14:09 RSC-Games

I got the card hooked up in about 5 minutes, and it's recognised by lspci. The easy part's over.

lspci output:

Image

Also here's a pic of my cursed test bench if you're interested.

Image

Now, time to figure out how to pass through this sucker to the x64 VM I made.

RSC-Games avatar Sep 02 '25 21:09 RSC-Games

(This is the kind of jank I enjoy. Good luck!)

geerlingguy avatar Sep 02 '25 21:09 geerlingguy

Okay so I'm running into iommu_group not found issues and it seems the RPi 5's IOMMU implementation is kinda weird? I'm going to try adding iommu=pt vfio-pci.ids=<vid:did> to the kernel command line and see how that goes. Is there anything I'm missing concerning the Pi's IOMMU?

Edit: I've done some device tree edits (using your guide too lol) and I can get the PCIe bus recognized by IOMMU2, but qemu still cannot find an iommu_group, despite the card (clearly?) being assigned an iommu group.

system@raspberrypi:~ $ qemu-system-aarch64 -machine virt,accel=kvm -cpu host -device vfio-pci,host=0001:01:00.0 -device vfio-pci,host=0001:01:00.1
qemu-system-aarch64: -device vfio-pci,host=0001:01:00.0: vfio 0001:01:00.0: no iommu_group found: No such file or directory
system@raspberrypi:~ $ cat /sys/kernel/iommu_groups/0/devices/1000110000.pcie/pci0001\:00/0001\:00\:00.0/0001\:01\:00.0/max_link_speed 
8.0 GT/s PCIe
system@raspberrypi:~ $ ls /sys/kernel/iommu_groups/0/devices/
1000110000.pcie  1000800000.codec  1000880000.pisp_be

I'll look into this more in a couple of days, but for now I'm gonna mess around with my AMD card, which I will probably have more luck with (it's an AMD Radeon HD 7870 GHz Edition).

RSC-Games avatar Sep 02 '25 22:09 RSC-Games

Dealing with QEMU and iommu may be a topic better answered on the Pi forums. I haven't heard of many people trying PCIe passthrough on the Pi, but it's certainly something worth supporting (IMO), not just for graphics card shenanigans!

geerlingguy avatar Sep 03 '25 03:09 geerlingguy

I'm starting a custom compile of Coreforge's patched kernel now, and I enabled an option in menuconfig called IOMMU_USERSPACE_API. I don't know if that will fix anything (I enabled some other virtualization options too), but hopefully I can kill two birds with this stone (IOMMU behaves rationally and compiling amdgpu). We'll see how this goes but for now I'm priortizing the easy card.

I did find a couple threads on the IOMMU in the Pi 5 and I found out there's apparently 5 of them that can be used for various device groupings, but they don't appear to automatically enumerate external PCIe devices, only the RP1 ones. If I'm still having issues later when I circle back to the NVIDIA card, I'll definitely open a new thread on the Pi forums, as passing through like a PCIe WiFi card or something could be very useful for the few brave souls who are building a homelab with Pis or just want to run VMs.

I'll open a new issue for the Radeon card, but since you already have a working Southern Islands card buried somewhere in these forums, it should be trivial to test and close the issue.

RSC-Games avatar Sep 03 '25 03:09 RSC-Games

After looking deeper into the device tree for the Pi 5, I found the entries for the 3 active IOMMUs. The naming convention seems to imply there may be at least two more unused IOMMUs on the silicon? I'll definitely be asking the RPi engineers about this on the forum, since those extra IOMMUs would probably come in handy at some point.

iommu2: iommu@5100 {
	/* IOMMU2 for PISP-BE, HEVC; and (unused) H264 accelerators */
	compatible = "brcm,bcm2712-iommu";
	reg = <0x10 0x5100  0x0 0x80>;
	cache = <&iommuc>;
	#iommu-cells = <0>;
};

iommu4: iommu@5200 {
	/* IOMMU4 for HVS, MPL/TXP; and (unused) Unicam, PISP-FE, MiniBVN */
	compatible = "brcm,bcm2712-iommu";
	reg = <0x10 0x5200  0x0 0x80>;
	cache = <&iommuc>;
	#iommu-cells = <0>;
	#interconnect-cells = <0>;
};

iommu5: iommu@5280 {
	/* IOMMU5 for PCIe2 (RP1); and (unused) BSTM */
	compatible = "brcm,bcm2712-iommu";
	reg = <0x10 0x5280  0x0 0x80>;
	cache = <&iommuc>;
	#iommu-cells = <0>;
	dma-iova-offset = <0x10 0x00000000>; // HACK for RP1 masters over PCIe
};

If I'm doing my math right, the remaining IOMMU device tree nodes would probably look like this. I'll need to confirm with the RPi engineers, however.

iommu0: iommu@5000 {
	/* IOMMU0 for ???? (figure out what you want to use it for) */
	compatible = "brcm,bcm2712-iommu";
	reg = <0x10 0x5000  0x0 0x80>;
	cache = <&iommuc>;
	#iommu-cells = <0>;
};

iommu1: iommu@5080 {
	/* Another free IOMMU. What would we use it for? */
	compatible = "brcm,bcm2712-iommu";
	reg = <0x10 0x5080  0x0 0x80>;
	cache = <&iommuc>;
	#iommu-cells = <0>;
};

iommu3: iommu@5180 {
	/* IOMMU again. What's gonna be done with this one? */
	compatible = "brcm,bcm2712-iommu";
	reg = <0x10 0x5180  0x0 0x80>;
	cache = <&iommuc>;
	#iommu-cells = <0>;
};

The thread I started was approved and can be viewed here: https://forums.raspberrypi.com/viewtopic.php?p=2335675#

Note to self: I also found this thread on enabling vfio and iommu support so a kernel rebuild is probably necessary. https://forums.raspberrypi.com/viewtopic.php?p=2268533

Also see this for getting pcie1 on the IOMMU: https://github.com/raspberrypi/linux/issues/6834

RSC-Games avatar Sep 03 '25 14:09 RSC-Games

IOMMUs are not freely assignable to peripherals.

iommu0 @ 0x5000 is in the path for the VPU (firmware) iommu1 @ 0x5080 is in the path for the DMA controllers and EMMC2 iommu2 @ 0x5100 is in the path for the ISP and HEVC decoder iommu3 @ 0x5300 is in the path for EMMC0 and sysctrl iommu4 @ 0x5200 is in the path for the display (HVS) iommu5 @ 0x5280 is in the path for PCIe (presumably all instances)

6by9 avatar Sep 03 '25 14:09 6by9

Thanks- I'll need to test that tonight after I tweak a few device tree nodes.

I am curious, however, about why IOMMU2 is quoted here (https://github.com/raspberrypi/linux/issues/6834#issuecomment-2854467821) for pcie1 (and it did show up in the iommu_group when I tested). I'll test both IOMMU5 and IOMMU2 and report what happens here.

I might also rebase Coreforge's patch set onto the latest pi kernel (so I'm not like 6 months out of date) then start working on my own patch set (mostly virtualization and device tree tweaks).

RSC-Games avatar Sep 03 '25 15:09 RSC-Games

Okay so it turns out my main Pi I use for testing (a 16 GB model) has the D0 stepping but my Pi 5 4 GB actually has the C0 stepping. I should actually be able to test these patches on both models and see how it fares.

RSC-Games avatar Sep 04 '25 03:09 RSC-Games

I was unable to get IOMMU8 to load on the C0 silicon but IOMMU2 appears to work and add the root complex to the iommu_group 0. However, I've run into another issue, where I'm not actually able to bind the card. I've tried disabling all other functions of the card, using a different card, etc. It just doesn't seem to work. I'll take a fresh look at it tomorrow, but for now I'm at a dead end.

Kernel command line:

[    0.000000] Kernel command line: reboot=w coherent_pool=1M 8250.nr_uarts=1 pci=pcie_bus_safe cgroup_disable=memory numa_policy=interleave  numa=fake=8 system_heap.max_order=0 smsc95xx.macaddr=2C:CF:67:21:C1:A3 vc_mem.mem_base=0x3fc00000 vc_mem.mem_size=0x40000000  console=tty1 root=PARTUUID=3c1760e9-02 rootfstype=ext4 fsck.repair=yes rootwait cfg80211.ieee80211_regdom=US vfio-pci.ids=10de:1c82

lspci output:

0001:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ZOTAC International (MCO) Ltd. GP107 [GeForce GTX 1050 Ti] [19da:a454]
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 163
	Region 0: Memory at 1b00000000 (32-bit, non-prefetchable) [virtual] [size=16M]
	Region 1: Memory at 1800000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 3: Memory at 1810000000 (64-bit, prefetchable) [disabled] [size=32M]
	Expansion ROM at 1b01000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x1 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [250 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0

card init via pcie

[    0.496764] brcm-pcie 1000110000.pcie: bcm2712_iommu_of_xlate: MMU 1000005100.iommu
[    0.497839] brcm-pcie 1000110000.pcie: bcm2712_iommu_probe_device: MMU 1000005100.iommu
[    0.498255] brcm-pcie 1000110000.pcie: bcm2712_iommu_device_group: MMU 1000005100.iommu
[    0.498698] brcm-pcie 1000110000.pcie: Adding to iommu group 0
[    0.499157] brcm-pcie 1000110000.pcie: bcm2712_iommu_attach_dev: MMU 1000005100.iommu
[    0.499561] brcm-pcie 1000110000.pcie: host bridge /axi/pcie@1000110000 ranges:
[    0.499956] brcm-pcie 1000110000.pcie:   No bus range found for /axi/pcie@1000110000, using [bus 00-ff]
[    0.500324] brcm-pcie 1000110000.pcie:      MEM 0x1b00000000..0x1bfffffffb -> 0x0000000000
[    0.500654] brcm-pcie 1000110000.pcie:      MEM 0x1800000000..0x1affffffff -> 0x0400000000
[    0.500963] brcm-pcie 1000110000.pcie:   IB MEM 0x0000000000..0x0fffffffff -> 0x0000000000
[    0.501269] brcm-pcie 1000110000.pcie:   IB MEM 0x1000131000..0x1000131fff -> 0xfffffff000
[    0.502770] brcm-pcie 1000110000.pcie: PCI host bridge to bus 0001:00
[    0.503094] pci_bus 0001:00: root bus resource [bus 00-ff]
[    0.503408] pci_bus 0001:00: root bus resource [mem 0x1b00000000-0x1bfffffffb] (bus address [0x00000000-0xfffffffb])
[    0.503724] pci_bus 0001:00: root bus resource [mem 0x1800000000-0x1affffffff pref] (bus address [0x400000000-0x6ffffffff])
[    0.504057] pci 0001:00:00.0: [14e4:2712] type 01 class 0x060400 PCIe Root Port
[    0.504380] pci 0001:00:00.0: PCI bridge to [bus 00]
[    0.504686] pci 0001:00:00.0:   bridge window [mem 0x1b80000000-0x1bbfffffff]
[    0.505007] pci 0001:00:00.0: PME# supported from D0 D3hot
[    0.505902] pci 0001:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    0.537320] mmc0: SDHCI controller on 1000fff000.mmc [1000fff000.mmc] using ADMA 64-bit
[    0.609321] brcm-pcie 1000110000.pcie: clkreq-mode set to safe
[    0.609637] brcm-pcie 1000110000.pcie: link up, 8.0 GT/s PCIe x1 (!SSC)
[    0.609964] pci 0001:01:00.0: [10de:1c82] type 00 class 0x030000 PCIe Legacy Endpoint
[    0.610286] pci 0001:01:00.0: BAR 0 [mem 0x1b00000000-0x1b00ffffff]
[    0.610606] pci 0001:01:00.0: BAR 1 [mem 0x1b00000000-0x1b0fffffff 64bit pref]
[    0.610925] pci 0001:01:00.0: BAR 3 [mem 0x1b00000000-0x1b01ffffff 64bit pref]
[    0.611237] pci 0001:01:00.0: BAR 5 [io  0x0000-0x007f]
[    0.611552] pci 0001:01:00.0: ROM [mem 0x1b00000000-0x1b0007ffff pref]
[    0.611961] pci 0001:01:00.0: 7.876 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x1 link at 0001:00:00.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[    0.612367] pci 0001:01:00.0: vgaarb: setting as boot VGA device
[    0.612713] pci 0001:01:00.0: vgaarb: bridge control possible
[    0.613058] pci 0001:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.613437] pci 0001:01:00.1: [10de:0fb9] type 00 class 0x040300 PCIe Endpoint
[    0.613791] pci 0001:01:00.1: BAR 0 [mem 0x1b00000000-0x1b00003fff]
[    0.621331] pci_bus 0001:01: busn_res: [bus 01-ff] end is updated to 01
[    0.621687] pci 0001:00:00.0: bridge window [mem 0x1800000000-0x1817ffffff 64bit pref]: assigned
[    0.622041] pci 0001:00:00.0: bridge window [mem 0x1b00000000-0x1b017fffff]: assigned
[    0.622394] pci 0001:01:00.0: BAR 1 [mem 0x1800000000-0x180fffffff 64bit pref]: assigned
[    0.622754] pci 0001:01:00.0: BAR 3 [mem 0x1810000000-0x1811ffffff 64bit pref]: assigned
[    0.623111] pci 0001:01:00.0: BAR 0 [mem 0x1b00000000-0x1b00ffffff]: assigned
[    0.623465] pci 0001:01:00.0: ROM [mem 0x1b01000000-0x1b0107ffff pref]: assigned
[    0.623820] pci 0001:01:00.1: BAR 0 [mem 0x1b01080000-0x1b01083fff]: assigned
[    0.624182] pci 0001:01:00.0: BAR 5 [io  size 0x0080]: can't assign; no space
[    0.624541] pci 0001:01:00.0: BAR 5 [io  size 0x0080]: failed to assign
[    0.624904] pci 0001:00:00.0: PCI bridge to [bus 01]
[    0.625268] pci 0001:00:00.0:   bridge window [mem 0x1b00000000-0x1b017fffff]
[    0.625635] pci 0001:00:00.0:   bridge window [mem 0x1800000000-0x1817ffffff 64bit pref]
[    0.625997] pci_bus 0001:00: Some PCI device resources are unassigned, try booting with pci=realloc
[    0.626362] pci_bus 0001:00: resource 4 [mem 0x1b00000000-0x1bfffffffb]
[    0.626728] pci_bus 0001:00: resource 5 [mem 0x1800000000-0x1affffffff pref]
[    0.627093] pci_bus 0001:01: resource 1 [mem 0x1b00000000-0x1b017fffff]
[    0.627461] pci_bus 0001:01: resource 2 [mem 0x1800000000-0x1817ffffff 64bit pref]
[    0.627829] pci 0001:00:00.0: Max Payload Size set to  256/ 512 (was  128), Max Read Rq  512
[    0.628204] pci 0001:01:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  512
[    0.628575] pci 0001:01:00.1: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  512
[    0.629002] pcieport 0001:00:00.0: enabling device (0000 -> 0002)
[    0.629407] pcieport 0001:00:00.0: PME: Signaling with IRQ 163
[    0.629828] pcieport 0001:00:00.0: AER: enabled with IRQ 163
[    0.630241] pci 0001:01:00.1: extending delay after power-on from D3hot to 20 msec
[    0.630626] pci 0001:01:00.1: D0 power state depends on 0001:01:00.0

vfio-pci load probe failure

[    4.896649] Bluetooth: RFCOMM ver 1.11
[   40.733708] VFIO - User Level meta-driver version: 0.3
[   40.741497] vfio-pci 0001:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[   40.741526] vfio-pci 0001:01:00.0: probe with driver vfio-pci failed with error -22
[   40.741540] vfio_pci: add [10de:1c82[ffffffff:ffffffff]] class 0x000000/00000000

I'm not entirely sure what's causing the probe issue... definitely going to be spending quite a bit of time on this.

RSC-Games avatar Sep 04 '25 05:09 RSC-Games

Okay looks like the error code is -EINVAL. However, that doesn't really help us, since there's like 5 different things in this function alone that can cause that. I should probably rebase Coreforge's kernel before I go much further...

./drivers/vfio/pci/vfio_pci_core.c:

int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
{
	struct pci_dev *pdev = vdev->pdev;
	struct device *dev = &pdev->dev;
	int ret;

	/* Drivers must set the vfio_pci_core_device to their drvdata */
	if (WARN_ON(vdev != dev_get_drvdata(dev)))
		return -EINVAL;

	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
		return -EINVAL;

	if (vdev->vdev.mig_ops) {
		if (!(vdev->vdev.mig_ops->migration_get_state &&
		      vdev->vdev.mig_ops->migration_set_state &&
		      vdev->vdev.mig_ops->migration_get_data_size) ||
		    !(vdev->vdev.migration_flags & VFIO_MIGRATION_STOP_COPY))
			return -EINVAL;
	}

	if (vdev->vdev.log_ops && !(vdev->vdev.log_ops->log_start &&
	    vdev->vdev.log_ops->log_stop &&
	    vdev->vdev.log_ops->log_read_and_clear))
		return -EINVAL;

It's not even guaranteed that we're failing in here. I think it's time to get the good ol' printks out and do some in-depth debugging.

EDIT: I was wrong- we are NOT failing in here, but further down somewhere else in the function. Right now I've gotta wait for VSCode to index 22000 source files so intellisense actually works...

RSC-Games avatar Sep 04 '25 14:09 RSC-Games

Okay- I have successfully rebased Coreforge's patches onto the latest RPi kernel commit. I plan to open a PR and push to Coreforge's fork later, but for now the kernel is accessible at https://github.com/RSC-Games/linux. The default branch is the GPU one. I don't currently know if it compiles. It probably should but I'll need to iron that out before opening the PR.

As for vfio, I have thrown printks all over the vfio-pci driver (mostly in the vfio_pci_core_register_device function), so hopefully this time I can narrow down what's causing the issue.

EDIT: Okay so STILL it's past where I splattered printks. Here's the dmesg output:

[   66.764256] VFIO - User Level meta-driver version: 0.3
[   66.771884] vfio_pci: unknown parameter 'disable_vga' ignored
[   66.772044] bus reset path
[   66.772060] vfio-pci 0001:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[   66.772064] made it past all tested cases (issue is somewhere else)
[   66.796911] vfio-pci 0001:01:00.0: probe with driver vfio-pci failed with error -22
[   66.796932] vfio_pci: add [10de:1c82[ffffffff:ffffffff]] class 0x000000/00000000

Turns out the Pi can't (?) do an individual slot reset, so it takes the bus reset codepath...

} else {
	printk("bus reset path");
	/*
	 * If there is no slot reset support for this device, the whole
	 * bus needs to be grouped together to support bus-wide resets.
	 */
	ret = vfio_assign_device_set(&vdev->vdev, pdev->bus);

	if (ret)
		printk("error condition detected (vfio_assign_device_set)");
}

We're getting close... There's functions for initializing the IOMMU. I'm almost POSITIVE that's our issue here.

ASIDE: For some reason the vfio-pci driver is not in the initramfs and I'm unable to build my own (presumably because there's no folder with modules for this kernel).

RSC-Games avatar Sep 05 '25 03:09 RSC-Games

It turns out, after quite a bit of printks and a ton of reboots, I have found the core of the issue. Unsurprisingly, no iommu_group is ever created for any PCIe device, so vfio-pci detects that and gives up. Now the main question is- where/how in the kernel are the IOMMU groups created? That's a question for tomorrow. I'm going to keep testing the old radeon card for now.

dmesg:

[   26.863805] VFIO - User Level meta-driver version: 0.3
[   26.871228] vfio_pci: unknown parameter 'disable_vga' ignored
[   26.871510] bus reset path
[   26.871532] vfio-pci 0001:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[   26.871539] made it past all tested cases (issue is somewhere else)
[   26.871549] no error testing done on pci_set_power_state
[   26.871640] proceeding with iommu support
[   26.871644] iommu group present (if zero means no allocated iommu group): 0
[   26.871646] device is not in an iommu group
[   26.871647] error: inner (failed to set group - inspect later)
[   26.871648] failed to register vfio group
[   26.895721] vfio-pci 0001:01:00.0: probe with driver vfio-pci failed with error -22
[   26.895734] vfio_pci: add [10de:1c82[ffffffff:ffffffff]] class 0x000000/00000000

As expected, the lack of an IOMMU group pisses off the driver. @6by9 does anyone on the engineering team potentially have any insights? It's okay if you don't- I'm just trying to get my bearings here.

RSC-Games avatar Sep 05 '25 03:09 RSC-Games

I think I might have figured out how to get PCIe passthrough working! See the thread I have open for more details (https://forums.raspberrypi.com/viewtopic.php?p=2336096), but basically we can disable the other devices using IOMMU group 0 and allow passthrough of the RC (passthrough of devices on the bus would be possible but harder to implement). Hopefully the guest kernel is smart enough to recognize what the RC is. This will 100% require kernel patching but I think it's very much possible. Whether we can merge this as a PR back into the rpi kernel will be an entirely different story. If we can't, that's okay, since I'm already maintaining Coreforge's patchset, so another one shouldn't be that much worse.

EDIT: You know what? I could just add a new dtoverlay that disables the h.265 codec and the isp and adds the PCIe link to IOMMU2- this way users aren't forced to compile a custom kernel or to forego the codec and isp with more... reasonable... configurations.

RSC-Games avatar Sep 05 '25 15:09 RSC-Games

@RSC-Games - If you can get it into a decent logical state, it might be possible to upstream to the rpi linux fork, and maybe we could even push it beyond that. It enables enough new use cases I think it'd be worth it. At least, if it works :D

geerlingguy avatar Sep 05 '25 19:09 geerlingguy

I think the biggest use case we'll see with this (especially if I can get the device tree overlay working) is for people dipping their feet in the water with homelabbing and passing through like a RAID card to a VM, or for the RPi Proxmox Port (or whatever it is they call it at this point). Or, it could also enable more ridiculous stuff like what I have here, or help us debug some timing issues with say the Xe driver. Also the alignment issues should be fixed by Coreforge's patches for any user-mode VMs like my x64 (albeit with even lower performance), but would it help with anything related to vfio-pci for KVM-based VMs? It is an entirely different driver. I guess we'll see when I get something together.

RSC-Games avatar Sep 05 '25 20:09 RSC-Games

Recently working on support for virtualizing wifi & ethernet drivers. It's nice to see some more documentation on the forum.

For iommu5 would be great to have some more documentation around dma-iova-offset. Is this something we can choose or is it set in hardware?

I got either extremely close on august 26th or just slightly close. The drivers loaded inside of qemu but were not able to TX packets out so I think something is off with my pagetable implementation

Below are some screenshots from when i was working with iommu2 before i concluded this will need iommu5 (confirmed above) Image

Image

lts-rad avatar Sep 06 '25 04:09 lts-rad

I could be very wrong, so you might want to ask on the forums, but dma-iova-offset I believe would be an offset from the 40 GB starting address for translation (@njhollinghurst is this correct)? All of the IOMMUs likely support custom dma-iova-offsets in the device tree. Also, for PCIe 1 (the external PCIe lane) the IOMMU responsible for managing it is IOMMU2. Just a heads up.

Also did you manage to get passthrough working? Are you using a custom kernel/patchset or are you passing in the card to qemu in a way different from -device vfio-pci <vid>:<did>?

IOMMU2 for me currently isn't doing anything due to the lack of driver support, so all I'm seeing is this (remember I do have a dGPU up and running already):

[    0.467311] brcm-pcie 1000110000.pcie: bcm2712_iommu_of_xlate: MMU 1000005100.iommu
[    0.467319] brcm-pcie 1000110000.pcie: bcm2712_iommu_probe_device: MMU 1000005100.iommu
[    0.467326] brcm-pcie 1000110000.pcie: bcm2712_iommu_device_group: MMU 1000005100.iommu
[    0.467331] brcm-pcie 1000110000.pcie: Adding to iommu group 0
[    0.467335] brcm-pcie 1000110000.pcie: bcm2712_iommu_attach_dev: MMU 1000005100.iommu
[    0.467380] brcm-pcie 1000110000.pcie: host bridge /axi/pcie@1000110000 ranges:
[    0.467388] brcm-pcie 1000110000.pcie:   No bus range found for /axi/pcie@1000110000, using [bus 00-ff]
[    0.467401] brcm-pcie 1000110000.pcie:      MEM 0x1b00000000..0x1bfffffffb -> 0x0000000000
[    0.467414] brcm-pcie 1000110000.pcie:      MEM 0x1800000000..0x1affffffff -> 0x0400000000
[    0.467423] brcm-pcie 1000110000.pcie:   IB MEM 0x0000000000..0x0fffffffff -> 0x0000000000
[    0.467432] brcm-pcie 1000110000.pcie:   IB MEM 0x1000131000..0x1000131fff -> 0xfffffff000
[    0.469561] brcm-pcie 1000110000.pcie: PCI host bridge to bus 0001:00
[    0.469573] pci_bus 0001:00: root bus resource [bus 00-ff]
[    0.469577] pci_bus 0001:00: root bus resource [mem 0x1b00000000-0x1bfffffffb] (bus address [0x00000000-0xfffffffb])
[    0.469580] pci_bus 0001:00: root bus resource [mem 0x1800000000-0x1affffffff pref] (bus address [0x400000000-0x6ffffffff])
[    0.469591] pci 0001:00:00.0: [14e4:2712] type 01 class 0x060400 PCIe Root Port
[    0.469598] pci 0001:00:00.0: PCI bridge to [bus 00]
[    0.469601] pci 0001:00:00.0:   bridge window [mem 0x1b80000000-0x1bbfffffff]
[    0.469618] pci 0001:00:00.0: PME# supported from D0 D3hot
[    0.470147] pci 0001:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    0.501235] mmc0: SDHCI controller on 1000fff000.mmc [1000fff000.mmc] using ADMA 64-bit
[    0.573201] brcm-pcie 1000110000.pcie: clkreq-mode set to safe
[    0.573204] brcm-pcie 1000110000.pcie: link up, 2.5 GT/s PCIe x1 (!SSC)
[    0.573228] pci 0001:01:00.0: [1002:6818] type 00 class 0x030000 PCIe Legacy Endpoint
[    0.573246] pci 0001:01:00.0: BAR 0 [mem 0x1b00000000-0x1b0fffffff 64bit pref]
[    0.573258] pci 0001:01:00.0: BAR 2 [mem 0x1b00000000-0x1b0003ffff 64bit]
[    0.573265] pci 0001:01:00.0: BAR 4 [io  0x0000-0x00ff]
[    0.573277] pci 0001:01:00.0: ROM [mem 0x1b00000000-0x1b0001ffff pref]
[    0.573285] pci 0001:01:00.0: enabling Extended Tags
[    0.573366] pci 0001:01:00.0: supports D1 D2
[    0.573368] pci 0001:01:00.0: PME# supported from D1 D2 D3hot
[    0.573412] pci 0001:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0001:00:00.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[    0.573486] pci 0001:01:00.0: vgaarb: setting as boot VGA device
[    0.573488] pci 0001:01:00.0: vgaarb: bridge control possible
[    0.573490] pci 0001:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.573513] pci 0001:01:00.1: [1002:aab0] type 00 class 0x040300 PCIe Legacy Endpoint
[    0.573530] pci 0001:01:00.1: BAR 0 [mem 0x1b00000000-0x1b00003fff 64bit]
[    0.573563] pci 0001:01:00.1: enabling Extended Tags
[    0.573618] pci 0001:01:00.1: supports D1 D2
[    0.581209] pci_bus 0001:01: busn_res: [bus 01-ff] end is updated to 01
[    0.581217] pci 0001:00:00.0: bridge window [mem 0x1800000000-0x180fffffff 64bit pref]: assigned
[    0.581221] pci 0001:00:00.0: bridge window [mem 0x1b00000000-0x1b000fffff]: assigned
[    0.581225] pci 0001:01:00.0: BAR 0 [mem 0x1800000000-0x180fffffff 64bit pref]: assigned
[    0.581233] pci 0001:01:00.0: BAR 2 [mem 0x1b00000000-0x1b0003ffff 64bit]: assigned
[    0.581240] pci 0001:01:00.0: ROM [mem 0x1b00040000-0x1b0005ffff pref]: assigned
[    0.581244] pci 0001:01:00.1: BAR 0 [mem 0x1b00060000-0x1b00063fff 64bit]: assigned
[    0.581251] pci 0001:01:00.0: BAR 4 [io  size 0x0100]: can't assign; no space
[    0.581253] pci 0001:01:00.0: BAR 4 [io  size 0x0100]: failed to assign
[    0.581256] pci 0001:00:00.0: PCI bridge to [bus 01]
[    0.581259] pci 0001:00:00.0:   bridge window [mem 0x1b00000000-0x1b000fffff]
[    0.581262] pci 0001:00:00.0:   bridge window [mem 0x1800000000-0x180fffffff 64bit pref]
[    0.581265] pci_bus 0001:00: Some PCI device resources are unassigned, try booting with pci=realloc
[    0.581268] pci_bus 0001:00: resource 4 [mem 0x1b00000000-0x1bfffffffb]
[    0.581271] pci_bus 0001:00: resource 5 [mem 0x1800000000-0x1affffffff pref]
[    0.581273] pci_bus 0001:01: resource 1 [mem 0x1b00000000-0x1b000fffff]
[    0.581276] pci_bus 0001:01: resource 2 [mem 0x1800000000-0x180fffffff 64bit pref]
[    0.581280] pci 0001:00:00.0: Max Payload Size set to  256/ 512 (was  128), Max Read Rq  512
[    0.581287] pci 0001:01:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  512
[    0.581294] pci 0001:01:00.1: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  512
[    0.581352] pcieport 0001:00:00.0: enabling device (0000 -> 0002)
[    0.581379] pcieport 0001:00:00.0: PME: Signaling with IRQ 161
[    0.581501] pcieport 0001:00:00.0: AER: enabled with IRQ 161
[    0.581553] pci 0001:01:00.1: D0 power state depends on 0001:01:00.0
 ~ snip ~
[    2.827332] [drm] amdgpu kernel modesetting enabled.
[    2.827550] amdgpu 0001:01:00.0: enabling device (0000 -> 0002)
[    2.827557] [drm] initializing kernel modesetting (PITCAIRN 0x1002:0x6818 0x1462:0x2740 0x00).
[    2.827574] [drm] register mmio base: 0x00000000
[    2.827575] [drm] register mmio size: 262144
[    2.827655] [drm] add ip block number 0 <si_common>
[    2.827658] [drm] add ip block number 1 <gmc_v6_0>
[    2.827659] [drm] add ip block number 2 <si_ih>
[    2.827660] [drm] add ip block number 3 <gfx_v6_0>
[    2.827662] [drm] add ip block number 4 <si_dma>
[    2.827664] [drm] add ip block number 5 <si_dpm>
[    2.827665] [drm] add ip block number 6 <dce_v6_0>
[    2.827667] [drm] add ip block number 7 <uvd_v3_1>
[    2.942328] amdgpu 0001:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
[    2.942337] amdgpu: ATOM BIOS: 113-C4010200-X02
[    2.942353] amdgpu 0001:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[    2.942356] amdgpu 0001:01:00.0: amdgpu: PCIE atomic ops is not supported
[    2.942360] [drm] GPU posting now...
[    2.952990] [drm] PCIE gen 3 link speeds already enabled
[    2.953004] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    2.956393] amdgpu 0001:01:00.0: BAR 0 [mem 0x1800000000-0x180fffffff 64bit pref]: releasing
[    2.956409] amdgpu 0001:01:00.0: BAR 0 [mem 0x1800000000-0x180fffffff 64bit pref]: assigned
[    2.956419] amdgpu 0001:01:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[    2.956421] amdgpu 0001:01:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[    2.956426] [drm] Detected VRAM RAM=2048M, BAR=256M
[    2.956428] [drm] RAM width 256bits GDDR5
[    2.960564] [drm] amdgpu: 2048M of VRAM memory ready
[    2.960571] [drm] amdgpu: 8001M of GTT memory ready.
[    2.960597] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    2.961952] amdgpu 0001:01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400000000).
[    2.983935] CUSE: failed to register chrdev region
[    2.983938] CUSE: failed to register chrdev region
[    2.996413] CUSE: failed to register chrdev region
[    2.996987] [drm] Internal thermal controller with fan control
[    2.997006] [drm] amdgpu: dpm initialized
[    2.997073] [drm] AMDGPU Display Connectors
[    2.997075] [drm] Connector 0:
[    2.997076] [drm]   DP-1
[    2.997077] [drm]   HPD4
[    2.997078] [drm]   DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 0x194f
[    2.997080] [drm]   Encoders:
[    2.997082] [drm]     DFP1: INTERNAL_UNIPHY2
[    2.997084] [drm] Connector 1:
[    2.997085] [drm]   DP-2
[    2.997087] [drm]   HPD5
[    2.997088] [drm]   DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953
[    2.997089] [drm]   Encoders:
[    2.997090] [drm]     DFP2: INTERNAL_UNIPHY2
[    2.997091] [drm] Connector 2:
[    2.997092] [drm]   HDMI-A-1
[    2.997093] [drm]   HPD1
[    2.997094] [drm]   DDC: 0x1954 0x1954 0x1955 0x1955 0x1956 0x1956 0x1957 0x1957
[    2.997095] [drm]   Encoders:
[    2.997096] [drm]     DFP3: INTERNAL_UNIPHY1
[    2.997097] [drm] Connector 3:
[    2.997098] [drm]   DVI-I-1
[    2.997099] [drm]   HPD6
[    2.997100] [drm]   DDC: 0x1960 0x1960 0x1961 0x1961 0x1962 0x1962 0x1963 0x1963
[    2.997102] [drm]   Encoders:
[    2.997103] [drm]     DFP4: INTERNAL_UNIPHY
[    2.997104] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
[    3.000330] [drm] Found UVD firmware Version: 64.0 Family ID: 13

The GPU's working fine but I don't see anything related to IOMMU2 after early boot. I did also send in an earlier post what happens when I try to modprobe vfio-pci where no IOMMU group was present.

RSC-Games avatar Sep 06 '25 05:09 RSC-Games

yes this is passthru with qemu. i can share some patches in a bit but im away from home this week, but have limited access to hang my test bed over my home vpn :-)

the offset is at 64GB btw, so this ends up being in addition to the 40-42gb window for iommu5.

Another thing to note is that MSI into the VM at first glance appears to be working also, but it's possible something could also be failing there

(from the VM)

41: 710164 1 0 0 BRCM STB PCIe MSI 1572864 Edge eth1-0 42: 0 0 0 0 BRCM STB PCIe MSI 1572865 Edge eth1-1 43: 0 0 0 0 BRCM STB PCIe MSI 1572866 Edge eth1-2 44: 0 0 0 0 BRCM STB PCIe MSI 1572867 Edge eth1-3 45: 0 0 0 0 BRCM STB PCIe MSI 1572868 Edge eth1-4 46: 0 0 0 0 BRCM STB PCIe MSI 1572869 Edge eth1-5 47: 0 0 0 0 BRCM STB PCIe MSI 1572870 Edge eth1-6 48: 0 0 0 0 BRCM STB PCIe MSI 1572871 Edge eth1-7 49: 0 0 0 0 BRCM STB PCIe MSI 1572872 Edge eth1-8 50: 0 0 0 0 BRCM STB PCIe MSI 1572873 Edge eth1-9 51: 0 0 0 0 BRCM STB PCIe MSI 1572874 Edge eth1-10 52: 0 0 0 0 BRCM STB PCIe MSI 1572875 Edge eth1-11 53: 0 0 0 0 BRCM STB PCIe MSI 1572876 Edge eth1-12 54: 0 0 0 0 BRCM STB PCIe MSI 1572877 Edge eth1-13 55: 0 0 0 0 BRCM STB PCIe MSI 1572878 Edge eth1-14 56: 0 0 0 0 BRCM STB PCIe MSI 1572879 Edge eth1-15 57: 550945 0 0 0 BRCM STB PCIe MSI 1572880 Edge eth1-16 58: 0 0 0 0 BRCM STB PCIe MSI 1572881 Edge eth1-17 59: 0 0 0 0 BRCM STB PCIe MSI 1572882 Edge eth1-18 60: 0 0 0 0 BRCM STB PCIe MSI 1572883 Edge eth1-19 61: 0 0 0 0 BRCM STB PCIe MSI 1572884 Edge eth1-20 62: 1 0 0 0 BRCM STB PCIe MSI 1572885 Edge eth1-21 63: 0 0 0 0 BRCM STB PCIe MSI 1572886 Edge eth1-22 64: 0 0 0 0 BRCM STB PCIe MSI 1572887 Edge eth1-23 65: 0 0 0 0 BRCM STB PCIe MSI 1572888 Edge eth1-24 66: 0 0 0 0 BRCM STB PCIe MSI 1572889 Edge eth1-25 67: 0 0 0 0 BRCM STB PCIe MSI 1572890 Edge eth1-26 68: 0 0 0 0 BRCM STB PCIe MSI 1572891 Edge eth1-27 69: 0 0 0 0 BRCM STB PCIe MSI 1572892 Edge eth1-28 70: 0 0 0 0 BRCM STB PCIe MSI 1572893 Edge eth1-29 71: 0 0 0 0 BRCM STB PCIe MSI 1572894 Edge eth1-30 72: 0 0 0 0 BRCM STB PCIe MSI 1572895 Edge eth1-31

lts-rad avatar Sep 06 '25 05:09 lts-rad

If passthrough is working I'm not surprised that MSI is working too. I don't think the Pi 5 really has issues with MSI or most PCIe stuff except alignment afaik. (Also I edited the above comment).

If you can send that patch set or a repo so I can merge your changes onto my repo, I can clean things up a bit and implement the device tree overlay for adding the PCIe 1 bus on IOMMU2 and kicking the other two devices off of it.

Also can I see your full dmesg output for the host kernel, as well as the output for lspci? I'm interested but also confused about how you're using IOMMU5 on PCIe1...

EDIT: Funny thing- I was gonna start writing the IOMMU drivers tomorrow but since you've got them mostly written already there's no need for me to write my own lol.

RSC-Games avatar Sep 06 '25 05:09 RSC-Games

So what i've been thinking is that we need to work with iommu5 for this for the pcie lane (and that seems to be confirmed from above).

For your DTB you will not need all of these pcie lanes. my hardware has an asm1182e 2:1 pcie switch first and then a r8125 card & mediatek wifi @ 03:/04.

Virtualization by PCI function will not be possible (as stated here https://github.com/raspberrypi/linux/issues/6834#issuecomment-2862303155). The hardware IOMMU does not support splitting based on the pcie packet which has that information. The error in that thread that they saw (-22) was related to bcm2712_iommu_capable() returning false though so vfio-pci was giving up early.

root@spr:~/dtb# cat mmu5.dts /dts-v1/; /plugin/;

&pcie1 { #address-cells = <3>; #size-cells = <2>;

/delete-property/ iommus;

iommu-map = <0x0100 &iommu5 0x0003 0x0001>,    /* 01:00.0 only */
            <0x0218 &iommu5 0x0003 0x0001>,    /* 02:03.0 only */
            <0x0238 &iommu5 0x0003 0x0001>,    /* 02:07.0 only */
            <0x0300 &iommu5 0x0004 0x0001>,    /* 03:00.0 only */
            <0x0400 &iommu5 0x0004 0x0001>;    /* 04:00.0 only */

iommu-map-mask = <0xffff>;  /* Exact matching */

}; root@spr:~/dtb# cat build.sh #!/bin/sh dtc -I dts -O dtb -o mmu5.dtbo mmu5.dts

I copy this to overlays and then update config.txt to have dtoverlay=mmu5

My mediatek driver also expects 32-bit dma support so i have that additional complication (pcie-32bit-dma-pi5) but I am working with the r8125/r8169 drivers first.

lts-rad avatar Sep 06 '25 05:09 lts-rad

I'm currently waiting on my kernel to rebuild (switched to 48 bit address space for testing switch emulation). But yeah I'm aware we can't group by function, but I was under the impression the IOMMUs didn't support anything related to requester IDs. If I'm wrong (hopefully I am) that means we could in fact still pass through multiple devices instead of only being limited to the RC.

Also looking back at your logs it's showing 1000005100.iommu, which is most certainly not IOMMU5 (1000005280.iommu)

I was having my own issues with the driver, which I tracked down to this:

[ 2020.119895] VFIO - User Level meta-driver version: 0.3
[ 2020.138208] vfio_pci: unknown parameter 'disable_vga' ignored
[ 2020.139044] bus reset path
[ 2020.139059] made it past all tested cases (issue is somewhere else)
[ 2020.139061] no error testing done on pci_set_power_state
[ 2020.139114] proceeding with iommu support
[ 2020.139116] iommu group present (if zero means no allocated iommu group): 0
[ 2020.139118] device is not in an iommu group
[ 2020.139119] error: inner (failed to set group - inspect later)
[ 2020.139120] failed to register vfio group
[ 2020.161341] vfio-pci 0001:01:00.1: probe with driver vfio-pci failed with error -22
[ 2020.161383] vfio_pci: add [1002:aab0[ffffffff:ffffffff]] class 0x000000/00000000

My error was being caused by this:

/**
 * iommu_group_get - Return the group for a device and increment reference
 * @dev: get the group that this device belongs to
 *
 * This function is called by iommu drivers and users to get the group
 * for the specified device.  If found, the group is returned and the group
 * reference in incremented, else NULL.
 */
struct iommu_group *iommu_group_get(struct device *dev)
{
	struct iommu_group *group = dev->iommu_group;

	printk("iommu group present (if zero means no allocated iommu group): %d", group);

	if (group)
		kobject_get(group->devices_kobj);

	return group;
}

For me, the device was never added to an IOMMU group in the first place. I'll have to dig in further later... Also I guess 6by9 was right- you can use IOMMU5 with PCIe 1, though I was told that was not the case 🤷.

I have a couple questions though: Still can I see your dmesg so I can compare what's going on with mine to yours 2: what's address-cells and size-cells in the device tree? I'm not very experienced with dt edits yet.

RSC-Games avatar Sep 06 '25 05:09 RSC-Games

address-cells / size-cells should be for supporting 64-bits for the iommu map

I will have more time mid-next week when Im home to prepare a patch (i'm on an ubuntu kernel by the way but the iommu should be easy to drop in). i also get the impression that qemu is somewhat demanding for mapping dma for guests and so my pagetable code needs to support pretty much the full range for dma, and im pretty sure that's where my bugs are that are breaking TX

You should be able to get further with getting vfio mapped with this patch

static bool bcm2712_iommu_capable(struct device *dev, enum iommu_cap cap)
{
	return true;
}

On my testbed I reload with vfio-pci as follows

root@spr:~/qem# cat prep.sh 
#!/bin/bash
modprobe -r r8125 
modprobe -r r8169
modprobe -r mt7915e
modprobe vfio-pci

echo "10ec 8125" > /sys/bus/pci/drivers/vfio-pci/new_id
echo "14c3 7906" > /sys/bus/pci/drivers/vfio-pci/new_id
echo Y > /sys/module/vfio_iommu_type1/parameters/allow_unsafe_interrupts

root@spr:~/qem# cat run.sh 
#!/bin/bash
qemu-system-aarch64   -M virt   -cpu host -enable-kvm   -m 512   -drive if=pflash,format=raw,readonly=on,file=/usr/share/AAVMF/AAVMF_CODE.fd   -drive if=pflash,format=raw,file=./AAVMF_VARS.fd   -drive file=alpine-disk.qcow2,if=virtio -nographic -device vfio-pci,host=0000:03:00.0

lts-rad avatar Sep 06 '25 05:09 lts-rad

It's looking fine on my kernel- let me send a source snippet

/**
 * device_iommu_capable() - check for a general IOMMU capability
 * @dev: device to which the capability would be relevant, if available
 * @cap: IOMMU capability
 *
 * Return: true if an IOMMU is present and supports the given capability
 * for the given device, otherwise false.
 */
bool device_iommu_capable(struct device *dev, enum iommu_cap cap)
{
	const struct iommu_ops *ops;

	if (!dev_has_iommu(dev))
		return false;

	ops = dev_iommu_ops(dev);
	if (!ops->capable)
		return false;

	return ops->capable(dev, cap);
}
EXPORT_SYMBOL_GPL(device_iommu_capable);

Yeah I guess the function is stubbed out on the mainline kernel? NVM wrong function

static bool bcm2712_iommu_capable(struct device *dev, enum iommu_cap cap)
{
	return false;
}

Interesting- my issue isn't from a lack of capabilities. I don't know how it's reporting that it is cache coherent if it says false for any capabilities...

EDIT: Oh that's because it's not even getting that far. I'm not seeing any cap enumeration because it still isn't adding PCIe to the iommu group in the first place. How did you get it in the iommu group (and not just the AXI bus master).

RSC-Games avatar Sep 06 '25 06:09 RSC-Games

That doesn't really mean anything though because there are -EINVALs splattered throughout that entire function. I tracked down my error and I know what is causing it- it has nothing to do with capabilities since it's not even enumerating them yet.

Here's the code I'm hitting (corresponds to my dmesg output above).

	ret = dev_set_name(&device->device, "vfio%d", device->index);
	if (ret) {
		printk("failed to set device name");
		return ret;
	}

	ret = vfio_device_set_group(device, type);
	if (ret) {
		printk("error: inner (failed to set group - inspect later)");
		return ret; // hits this case due to no iommu group associated with device
	}

	/*
	 * VFIO always sets IOMMU_CACHE because we offer no way for userspace to
	 * restore cache coherency. It has to be checked here because it is only
	 * valid for cases where we are using iommu groups.
	 */
	if (type == VFIO_IOMMU && !vfio_device_is_noiommu(device) &&
	    !device_iommu_capable(device->dev, IOMMU_CAP_CACHE_COHERENCY)) {
		ret = -EINVAL;
		printk("iommu either doesn't exist or is not cache coherent");
		goto err_out;
	}

Dmesg output is showing exactly where I'm hitting the error

[ 2020.139114] proceeding with iommu support
[ 2020.139116] iommu group present (if zero means no allocated iommu group): 0
[ 2020.139118] device is not in an iommu group
[ 2020.139119] error: inner (failed to set group - inspect later) // <-- line I commented above- no cap bits are enumerated. it just aborts there and stops trying
[ 2020.139120] failed to register vfio group

Kernel version: Linux raspberrypi 6.12.25-v8-AMDGPU-sRGB+ #2 SMP PREEMPT Thu Sep 4 23:46:56 EDT 2025 aarch64 GNU/Linux

RSC-Games avatar Sep 06 '25 06:09 RSC-Games

Ah so your overlay is not configured yet to put the device into a group then. I would try mmu5.dts above it should put the device in a group. You will then encounter the capable issue

lts-rad avatar Sep 06 '25 06:09 lts-rad

Would do that but my kernel is still compiling- it'll be a bit

Also I found some info on dma-iova-offset

/*
 * XXX When an IOMMU is downstream of a PCIe RC or some other chip/bus
 * and serves some of the masters thereon (others using pass-through),
 * we seem to fumble and lose the "dma-ranges" address offset for
 * masters using IOMMU. This property restores it, where needed.
 */
if (!pdev->dev.of_node ||
    of_property_read_u64(pdev->dev.of_node, "dma-iova-offset",
    &mmu->dma_iova_offset))
        mmu->dma_iova_offset = 0;

Also I'm asking the RPi engineers which capabilities the bcm2712 IOMMU supports, but for now I have a skeleton function (if this is going to be upstreamed I'm going to make sure it's as clean as I can make it)

static bool bcm2712_iommu_capable(struct device *dev, enum iommu_cap cap)
{
	switch (cap) {
		case IOMMU_CAP_CACHE_COHERENCY:
			return true;
		case IOMMU_CAP_NOEXEC:
		case IOMMU_CAP_PRE_BOOT_PROTECTION:
		case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
		case IOMMU_CAP_DEFERRED_FLUSH:
		case IOMMU_CAP_DIRTY_TRACKING:
		default:
			return false;
	}
}

mmu overlay (for code clarity):

/dts-v1/;
/plugin/;

&pcie1 {
#address-cells = <3>;
#size-cells = <2>;

/delete-property/ iommus;

iommu-map = <0x0100 &iommu5 0x0003 0x0001>,    /* 01:00.0 only */
            <0x0218 &iommu5 0x0003 0x0001>,    /* 02:03.0 only */
            <0x0238 &iommu5 0x0003 0x0001>,    /* 02:07.0 only */
            <0x0300 &iommu5 0x0004 0x0001>,    /* 03:00.0 only */
            <0x0400 &iommu5 0x0004 0x0001>;    /* 04:00.0 only */

iommu-map-mask = <0xffff>;  /* Exact matching */

};

Hard-coding this is probably counter-productive so I'll have to figure out how to add parameters to this overlay so you can specify any bus. Also, how would I determine the bus id (convert it to hex)- is it 0bBBBBBBBB DDDDDFFF (yes it is that was actually quite obvious... lol)

Edit: I think I see how the dtb entry works! Is the third parameter for the iommu-group id? And what's the fourth one for?

RSC-Games avatar Sep 06 '25 06:09 RSC-Games

Okay I made the edits and am now getting this:

[  167.231279] OF: /axi/pcie@1000110000: no iommu-map translation for id 0x100 on (null)

dt looks like this (made edits in the core device tree):

&pcie1 {
	// TODO: Move into overlay with parameters so end users can configure this.
	brcm,clkreq-mode = "safe";

	/* note: will be moved into a device tree overlay later */
	#address-cells = <3>;
	#size-cells = <2>;

	/delete-property/ iommus;

	/* Enable IOMMU accesses over PCIe 1. Older steppings used IOMMU8 for the PCIe0/1 RC */
	//iommus = <&iommu2>;

	/* Automatically create IOMMU groups for the listed PCIe devices. */
	iommu-map = <0x1000 &iommu5 0x0003 0x0001>,  /* 01:00.0 (root complex) only */
				<0x0108 &iommu5 0x0003 0x0001>;  /* 01:01.0 only (main device connected to the port) */
	iommu-map-mask = <0xffff>;  /* Match the exact address */
};

I'm too tired to keep going tonight. I'll keep looking at this tomorrow. EDIT: oops messed up the bus id.

It's now working perfectly- at least the driver init stage is:

[  100.142736] VFIO - User Level meta-driver version: 0.3
[  100.151618] vfio-pci 0001:01:00.0: bcm2712_iommu_of_xlate: MMU 1000005280.iommu
[  100.151626] vfio-pci 0001:01:00.0: bcm2712_iommu_probe_device: MMU 1000005280.iommu
[  100.151635] vfio-pci 0001:01:00.0: bcm2712_iommu_device_group: MMU 1000005280.iommu
[  100.151640] vfio-pci 0001:01:00.0: Adding to iommu group 2
[  100.151645] vfio-pci 0001:01:00.0: bcm2712_iommu_attach_dev: MMU 1000005280.iommu
[  100.151715] bus reset path
[  100.151723] vfio-pci 0001:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[  100.151727] made it past all tested cases (issue is somewhere else)
[  100.151746] no error testing done on pci_set_power_state
[  100.151817] proceeding with iommu support
[  100.151819] iommu group present (if zero means no allocated iommu group): 47098624
[  100.151821] passed testing
[  100.151929] device added to group properly - error is somewhere else
[  100.151948] vfio_pci: add [1002:6818[ffffffff:ffffffff]] class 0x000000/00000000

RSC-Games avatar Sep 06 '25 07:09 RSC-Games

As i mentioned earlier, it appears that for qemu the iommu implementation will need pretty much a full view and requires a sparsley populated pagetable.

What would be helpful from Broadcom/Raspberry pi would be an iommu programmer's manual to help implement supporting arbitrary remapping addresses beyond the 40-42gb aperture. Also it would be good to have some clarity on dma_iova_offset. Is this a hardware requirement? if the offset is required by hardware, is it a bitwise or on addresses, is it addition, what is the design underlying this?

lts-rad avatar Sep 06 '25 08:09 lts-rad