raspberry-pi-pcie-devices icon indicating copy to clipboard operation
raspberry-pi-pcie-devices copied to clipboard

Test GPU (AMD Radeon RX 6700 XT)

Open geerlingguy opened this issue 4 years ago • 226 comments
trafficstars

Working branches:

  • Coreforge rpi-6.6.y-gpu: https://github.com/Coreforge/linux/tree/rpi-6.6.y-gpu
  • Coreforge rpi-6.12.y-gpu: https://github.com/Coreforge/linux/tree/rpi-6.12.y-gpu

Just received an OEM AMD Radeon RX 6700 XT in the mail. I was able to get it at MSRP+Shipping, which is something of a miracle these days:

DSC02333

DSC02363

I will be interested in seeing what, if anything, the card does when powered up and plugged into the Compute Module 4 IO Board!

The following issues are closely related:

Current steps to get this card working with Pi OS Bookworm

Last updated: 2025-01-03

  1. Clone the Raspberry Pi Linux kernel patching the default Raspberry Pi 6.6.y kernel tree with Coreforge's GPU-enablement patch (or just check out Coreforge's branch directly).
  2. Before compiling the kernel, run make menuconfig and select the options: 1. Kernel Features > Page Size > 4 KB (for Box86 compatibility) 2. Kernel Features > Kernel support for 32-bit EL0 > Fix up misaligned multi-word loads and stores in user space 3. Kernel Features > Fix up misaligned loads and stores from userspace for 64bit code 4. Device Drivers > Graphics support > AMD GPU (optionally SI/CIK support too) 5. Device Drivers > Graphics support > Direct Rendering Manager (XFree86 4.1.0 and higher DRI support) > Force Architecture can write-combine memory
  3. Recompile the kernel following Raspberry Pi's instructions
  4. Install the AMD firmware: sudo apt install -y firmware-amd-graphics
  5. Reboot the Pi with the card attached using an appropriate PCIe riser and external ATX power supply.

Confirm everything is working by plugging a monitor into the graphics card; then confirm the card's GPU is in use by running glxinfo -B (part of the mesa-utils package), for example:

$ sudo apt install -y mesa-utils
$ DISPLAY=:0 glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 6700 XT (navi22, LLVM 15.0.6, DRM 3.54, 6.6.51-v8-16k+) (0x73df)
    Version: 23.2.1
    Accelerated: yes
    Video memory: 12288MB
...

(Prepend DISPLAY=:0 if running commands over SSH.)

geerlingguy avatar Sep 07 '21 20:09 geerlingguy

A few notes on drivers from the Twitterverse:

@linux4kix mentioned:

@geerlingguy You will need to use a pre 5.10 kernel for basic Navi on Aarch64. A driver rework needs to be done to fix amdgpu dcn support which was reverted for 5.10. https://lists.freedesktop.org/archives/dri-devel/2021-January/292867.html

@ric96 said:

@geerlingguy Don't forget to use upstream linux-firmware for the correct blob

So yeah... this one could be interesting, and I think my first attempts will be a bit faltering. We'll see.

geerlingguy avatar Sep 08 '21 14:09 geerlingguy

pi@cm4:~ $ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 20)
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478 (rev c1)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73df (rev c1)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28

pi@cm4:~ $ sudo lspci -vvvv
...
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1478 (rev c1) (prog-if 00 [Normal decode])
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at 618200000 (32-bit, non-prefetchable) [disabled] [size=16K]
	Bus: primary=01, secondary=02, subordinate=03, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: d8000000-d81fffff
	Prefetchable memory behind bridge: 00000000c0000000-00000000d7ffffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Upstream Port, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed unknown, Width x16, ASPM L1, Exit Latency L0s unlimited, L1 <64us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [270 v1] #19
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [400 v1] #25
	Capabilities: [410 v1] #26
	Capabilities: [440 v1] #27

02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479 (prog-if 00 [Normal decode])
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	Bus: primary=02, secondary=03, subordinate=03, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: d8000000-d81fffff
	Prefetchable memory behind bridge: 00000000c0000000-00000000d7ffffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Downstream Port (Slot-), MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0
			ExtTag+ RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 1479
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [270 v1] #19
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [400 v1] #25
	Capabilities: [410 v1] #26
	Capabilities: [440 v1] #27

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 73df (rev c1) (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0e36
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at 600000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 2: Memory at 610000000 (64-bit, prefetchable) [disabled] [size=2M]
	Region 4: I/O ports at <unassigned> [disabled]
	Region 5: Memory at 618000000 (32-bit, non-prefetchable) [disabled] [size=1M]
	[virtual] Expansion ROM at 618100000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [200 v1] #15
	Capabilities: [240 v1] Power Budgeting <?>
	Capabilities: [270 v1] #19
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [410 v1] #26
	Capabilities: [440 v1] #27

03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin B routed to IRQ 255
	Region 0: Memory at 618120000 (32-bit, non-prefetchable) [disabled] [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

geerlingguy avatar Sep 08 '21 15:09 geerlingguy

pi@cm4:~ $ dmesg | grep pci
[    1.261278] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.261305] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.261373] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x063fffffff -> 0x00c0000000
[    1.261447] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0400000000
[    1.308507] brcm-pcie fd500000.pcie: link up, 5.0 GT/s PCIe x1 (SSC)
[    1.308896] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    1.308914] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.308940] pci_bus 0000:00: root bus resource [mem 0x600000000-0x63fffffff] (bus address [0xc0000000-0xffffffff])
[    1.309028] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    1.309262] pci 0000:00:00.0: PME# supported from D0 D3hot
[    1.313103] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    1.313417] pci 0000:01:00.0: [1002:1478] type 01 class 0x060400
[    1.313474] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff]
[    1.313873] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[    1.313969] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    1.317679] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    1.318042] pci 0000:02:00.0: [1002:1479] type 01 class 0x060400
[    1.318515] pci 0000:02:00.0: PME# supported from D0 D3hot D3cold
[    1.322211] pci 0000:02:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    1.322530] pci 0000:03:00.0: [1002:73df] type 00 class 0x030000
[    1.322595] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    1.322637] pci 0000:03:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    1.322667] pci 0000:03:00.0: reg 0x20: [io  0x0000-0x00ff]
[    1.322695] pci 0000:03:00.0: reg 0x24: [mem 0x00000000-0x000fffff]
[    1.322724] pci 0000:03:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    1.323058] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    1.323147] pci 0000:03:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:00.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    1.323306] pci 0000:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    1.323421] pci 0000:03:00.1: [1002:ab28] type 00 class 0x040300
[    1.323470] pci 0000:03:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
[    1.323795] pci 0000:03:00.1: PME# supported from D1 D2 D3hot D3cold
[    1.327530] pci_bus 0000:03: busn_res: [bus 03-ff] end is updated to 03
[    1.327555] pci_bus 0000:02: busn_res: [bus 02-ff] end is updated to 03
[    1.327576] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 03
[    1.327628] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    1.327644] pci 0000:00:00.0: BAR 8: assigned [mem 0x618000000-0x6182fffff]
[    1.327665] pci 0000:01:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    1.327680] pci 0000:01:00.0: BAR 8: assigned [mem 0x618000000-0x6181fffff]
[    1.327696] pci 0000:01:00.0: BAR 0: assigned [mem 0x618200000-0x618203fff]
[    1.327716] pci 0000:01:00.0: BAR 7: no space for [io  size 0x1000]
[    1.327729] pci 0000:01:00.0: BAR 7: failed to assign [io  size 0x1000]
[    1.327747] pci 0000:02:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    1.327761] pci 0000:02:00.0: BAR 8: assigned [mem 0x618000000-0x6181fffff]
[    1.327774] pci 0000:02:00.0: BAR 7: no space for [io  size 0x1000]
[    1.327786] pci 0000:02:00.0: BAR 7: failed to assign [io  size 0x1000]
[    1.327805] pci 0000:03:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    1.327844] pci 0000:03:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    1.327880] pci 0000:03:00.0: BAR 5: assigned [mem 0x618000000-0x6180fffff]
[    1.327902] pci 0000:03:00.0: BAR 6: assigned [mem 0x618100000-0x61811ffff pref]
[    1.327917] pci 0000:03:00.1: BAR 0: assigned [mem 0x618120000-0x618123fff]
[    1.327936] pci 0000:03:00.0: BAR 4: no space for [io  size 0x0100]
[    1.327949] pci 0000:03:00.0: BAR 4: failed to assign [io  size 0x0100]
[    1.327964] pci 0000:02:00.0: PCI bridge to [bus 03]
[    1.327987] pci 0000:02:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[    1.328007] pci 0000:02:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    1.328032] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[    1.328053] pci 0000:01:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[    1.328072] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    1.328096] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[    1.328115] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[    1.328131] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    1.328349] pci 0000:03:00.1: D0 power state depends on 0000:03:00.0

geerlingguy avatar Sep 08 '21 15:09 geerlingguy

While compiling on kernel version 5.10 from the raspberrypi/linux tree, I noticed an error:

  AR      drivers/ptp/built-in.a
  CC [M]  drivers/i2c/busses/i2c-brcmstb.o
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c: In function 'amdgpu_dm_atomic_commit_tail':
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7757:4: error: implicit declaration of function 'is_hdr_metadata_different'; did you mean 'is_scaling_state_different'? [-Werror=implicit-function-declaration]
    is_hdr_metadata_different(old_con_state, new_con_state);
    ^~~~~~~~~~~~~~~~~~~~~~~~~
    is_scaling_state_different
  CC [M]  drivers/media/i2c/cx25840/cx25840-firmware.o
  CC [M]  drivers/media/i2c/cx25840/cx25840-vbi.o
  AR      drivers/i2c/muxes/built-in.a
...
  LD [M]  drivers/media/dvb-frontends/drxd.o
  LD [M]  drivers/media/dvb-frontends/stv0900.o
  LD [M]  drivers/media/dvb-frontends/cxd2820r.o
  LD [M]  drivers/media/dvb-frontends/drxk.o
make: *** [Makefile:1825: drivers] Error 2

geerlingguy avatar Sep 08 '21 16:09 geerlingguy

Looks like it was missed in https://github.com/raspberrypi/linux/commit/6bd46342fadfdfb0a40d674f9161104f2e691873 which removed is_hdr_metadata_different for the generic helper function drm_connector_atomic_hdr_metadata_equal.

6by9 avatar Sep 08 '21 17:09 6by9

2nd Attempt:

  1. Recompiled kernel on rpi-5.14.y branch with AMDGPU selected. Seemed to work.
  2. Copied over to Pi.
  3. Installed sudo apt install -y firmware-amd-graphics
  4. Blacklisted amdgpu via /etc/modprobe.d/blacklist-amdgpu.conf

Rebooting...

geerlingguy avatar Sep 10 '21 15:09 geerlingguy

Without the card plugged in, a sudo modprobe amdgpu gets me:

[  431.751110] [drm] amdgpu kernel modesetting enabled.

Now trying with the card plugged in...

geerlingguy avatar Sep 10 '21 15:09 geerlingguy

Good news! The Pi doesn't completely lock up and halt now... it errors out then goes back to letting me debug. Makes test cycles oh-so-much-simpler:

In one terminal:

pi@cm4:~ $ sudo modprobe amdgpu

And in the other:

pi@cm4:~ $ dmesg --follow
...
[   83.281692] [drm] amdgpu kernel modesetting enabled.
[   83.282319] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   83.282361] pci 0000:01:00.0: enabling device (0000 -> 0002)
[   83.282398] pci 0000:02:00.0: enabling device (0000 -> 0002)
[   83.282430] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[   83.282453] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[   83.282474] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   83.282543] [drm] register mmio base: 0x18000000
[   83.282554] [drm] register mmio size: 1048576
[   83.282578] [drm] PCIE atomic ops is not supported
[   83.284144] [drm] add ip block number 0 <nv_common>
[   83.284150] [drm] add ip block number 1 <gmc_v10_0>
[   83.284373] [drm] add ip block number 2 <navi10_ih>
[   83.284395] [drm] add ip block number 3 <psp>
[   83.284401] [drm] add ip block number 4 <smu>
[   83.284419] [drm] add ip block number 5 <gfx_v10_0>
[   83.284425] [drm] add ip block number 6 <sdma_v5_2>
[   83.284431] [drm] add ip block number 7 <vcn_v3_0>
[   83.284435] [drm] add ip block number 8 <jpeg_v3_0>
[   83.319061] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM
[   83.319078] amdgpu: ATOM BIOS: 113-D5121100-101
[   83.319115] [drm] VCN(0) decode is enabled in VM mode
[   83.319121] [drm] VCN(0) encode is enabled in VM mode
[   83.319127] [drm] JPEG decode is enabled in VM mode
[   83.319148] [drm] GPU posting now...
[   83.319230] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   83.319265] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x610000000-0x6101fffff 64bit pref]
[   83.319275] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x600000000-0x60fffffff 64bit pref]
[   83.319324] pci 0000:02:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319332] pci 0000:01:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319343] pci 0000:00:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319362] pci 0000:00:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   83.319369] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   83.319378] pci 0000:01:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   83.319383] pci 0000:01:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   83.319391] pci 0000:02:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   83.319397] pci 0000:02:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   83.319406] amdgpu 0000:03:00.0: BAR 0: no space for [mem size 0x400000000 64bit pref]
[   83.319411] amdgpu 0000:03:00.0: BAR 0: failed to assign [mem size 0x400000000 64bit pref]
[   83.319419] amdgpu 0000:03:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[   83.319424] amdgpu 0000:03:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[   83.319431] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   83.319442] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   83.319456] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   83.319465] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   83.319473] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319483] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[   83.319494] pci 0000:01:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   83.319504] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319517] pci 0000:02:00.0: PCI bridge to [bus 03]
[   83.319529] pci 0000:02:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   83.319538] pci 0000:02:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   83.319566] [drm] Not enough PCI address space for a large BAR.
[   83.319573] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[   83.319595] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[   83.319625] amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   83.319633] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   83.319641] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[   83.319649] [drm] Detected VRAM RAM=12272M, BAR=256M
[   83.319654] [drm] RAM width 192bits GDDR6
[   83.319767] [drm] amdgpu: 12272M of VRAM memory ready
[   83.319775] [drm] amdgpu: 2845M of GTT memory ready.
[   83.319794] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   83.319943] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   83.322016] amdgpu 0000:03:00.0: Direct firmware load for amdgpu/navy_flounder_sos.bin failed with error -2
[   83.322037] amdgpu 0000:03:00.0: amdgpu: failed to init sos firmware
[   83.322044] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware!
[   83.322472] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <psp> failed -2
[   83.322795] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[   83.322802] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[   83.322808] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[   83.323187] amdgpu: probe of 0000:03:00.0 failed with error -2
[   83.323329] [drm] amdgpu: ttm finalized

geerlingguy avatar Sep 10 '21 15:09 geerlingguy

Hmm... firmware-amd-graphics might not include firmware for the RX 6700 XT (see https://github.com/NixOS/nixpkgs/issues/122776), since the card is new enough to not have been packaged in whatever build that package is based on :(

See more: Radeon RX 6700 XT "Navy Flounder" Microcode Lands In Linux-Firmware.Git, and the commit where firmware was added. (Good ol' Phoronix)

geerlingguy avatar Sep 10 '21 15:09 geerlingguy

First time doing this (grabbing newer firmware from the linux-firmware repo):

  1. git clone git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
  2. sudo cp linux-firmware/amdgpu/navy_flounder* /lib/firmware/amdgpu
  3. sudo reboot

And now trying again...

geerlingguy avatar Sep 10 '21 15:09 geerlingguy

Okay, earlier firmware bug gave me false hope. We're still crashing and burning:

[   85.221462] [drm] amdgpu kernel modesetting enabled.
[   85.221843] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   85.221866] pci 0000:01:00.0: enabling device (0000 -> 0002)
[   85.221886] pci 0000:02:00.0: enabling device (0000 -> 0002)
[   85.221904] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[   85.221916] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[   85.221929] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   85.221965] [drm] register mmio base: 0x18000000
[   85.221970] [drm] register mmio size: 1048576
[   85.221984] [drm] PCIE atomic ops is not supported
[   85.223501] [drm] add ip block number 0 <nv_common>
[   85.223508] [drm] add ip block number 1 <gmc_v10_0>
[   85.223513] [drm] add ip block number 2 <navi10_ih>
[   85.223518] [drm] add ip block number 3 <psp>
[   85.223524] [drm] add ip block number 4 <smu>
[   85.223530] [drm] add ip block number 5 <gfx_v10_0>
[   85.223535] [drm] add ip block number 6 <sdma_v5_2>
[   85.223540] [drm] add ip block number 7 <vcn_v3_0>
[   85.223545] [drm] add ip block number 8 <jpeg_v3_0>
[   85.258238] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM
[   85.258256] amdgpu: ATOM BIOS: 113-D5121100-101
[   85.258293] [drm] VCN(0) decode is enabled in VM mode
[   85.258298] [drm] VCN(0) encode is enabled in VM mode
[   85.258304] [drm] JPEG decode is enabled in VM mode
[   85.258324] [drm] GPU posting now...
[   85.258413] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   85.258451] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x610000000-0x6101fffff 64bit pref]
[   85.258461] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x600000000-0x60fffffff 64bit pref]
[   85.258510] pci 0000:02:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258517] pci 0000:01:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258524] pci 0000:00:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258545] pci 0000:00:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   85.258551] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   85.258560] pci 0000:01:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   85.258566] pci 0000:01:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   85.258574] pci 0000:02:00.0: BAR 9: no space for [mem size 0x600000000 64bit pref]
[   85.258580] pci 0000:02:00.0: BAR 9: failed to assign [mem size 0x600000000 64bit pref]
[   85.258588] amdgpu 0000:03:00.0: BAR 0: no space for [mem size 0x400000000 64bit pref]
[   85.258594] amdgpu 0000:03:00.0: BAR 0: failed to assign [mem size 0x400000000 64bit pref]
[   85.258601] amdgpu 0000:03:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[   85.258607] amdgpu 0000:03:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[   85.258614] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   85.258624] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   85.258638] pci 0000:00:00.0: PCI bridge to [bus 01-03]
[   85.258647] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6182fffff]
[   85.258655] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258665] pci 0000:01:00.0: PCI bridge to [bus 02-03]
[   85.258676] pci 0000:01:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   85.258686] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258699] pci 0000:02:00.0: PCI bridge to [bus 03]
[   85.258710] pci 0000:02:00.0:   bridge window [mem 0x618000000-0x6181fffff]
[   85.258720] pci 0000:02:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[   85.258747] [drm] Not enough PCI address space for a large BAR.
[   85.258754] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[   85.258775] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[   85.258804] amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   85.258813] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   85.258820] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[   85.258828] [drm] Detected VRAM RAM=12272M, BAR=256M
[   85.258834] [drm] RAM width 192bits GDDR6
[   85.258945] [drm] amdgpu: 12272M of VRAM memory ready
[   85.258953] [drm] amdgpu: 2845M of GTT memory ready.
[   85.258971] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   85.259113] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).

geerlingguy avatar Sep 10 '21 15:09 geerlingguy

It does seem like it's running out of address space for a large BAR:

[   85.258747] [drm] Not enough PCI address space for a large BAR.
[   85.258828] [drm] Detected VRAM RAM=12272M, BAR=256M

But that doesn't seem to be the issue here.

geerlingguy avatar Sep 10 '21 15:09 geerlingguy

Added a few debug lines, and things were a little different!

[  115.560635] [drm] amdgpu: 12272M of VRAM memory ready
[  115.560677] [drm] amdgpu: 2845M of GTT memory ready.
[  115.560718] [drm] GART: num cpu pages 131072, num gpu pages 131072
[  115.560755] DEBUG: Passed gmc_v10_0_hw_init 1069 
[  115.560973] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[  115.560984] DEBUG: Passed gmc_v10_0_hw_init 1078 
[  115.587372] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[  116.615220] ------------[ cut here ]------------
[  116.615231] Firmware transaction timeout
[  116.615282] WARNING: CPU: 3 PID: 37 at drivers/firmware/raspberrypi.c:67 rpi_firmware_transaction+0xdc/0x108
[  116.615301] Modules linked in: amdgpu(+) drm_ttm_helper ttm i2c_algo_bit rfcomm bnep hci_uart btbcm bluetooth ecdh_generic ecc fuse 8021q garp stp llc snd_soc_hdmi_codec brcmfmac brcmutil v3d vc4 cec cfg80211 bcm2835_codec(C) drm_kms_helper gpu_sched rfkill snd_soc_core drm raspberrypi_hwmon v4l2_mem2mem snd_compress snd_bcm2835(C) bcm2835_v4l2(C) drm_panel_orientation_quirks bcm2835_isp(C) videobuf2_vmalloc snd_pcm_dmaengine bcm2835_mmal_vchiq(C) videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common i2c_brcmstb snd_pcm videodev snd_timer dwc2 mc vc_sm_cma(C) snd syscopyarea sysfillrect sysimgblt roles fb_sys_fops backlight rpivid_mem uio_pdrv_genirq uio nvmem_rmem i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd ip_tables x_tables ipv6
[  116.615461] CPU: 3 PID: 37 Comm: kworker/3:1 Tainted: G         C        5.14.2-v8+ #1
[  116.615467] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[  116.615472] Workqueue: events dbs_work_handler
[  116.615485] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[  116.615490] pc : rpi_firmware_transaction+0xdc/0x108
[  116.615495] lr : rpi_firmware_transaction+0xdc/0x108
[  116.615499] sp : ffffffc0117639c0
[  116.615502] x29: ffffffc0117639c0 x28: ffffffc011763d20 x27: 0000000000000000
[  116.615512] x26: ffffff8042fddd00 x25: ffffff80409cdd00 x24: ffffffc011a7e008

Not sure what PSP runtime database doesn't exist means, but the Firmware transaction timeout seems related to the Pi's own firmware?

geerlingguy avatar Sep 10 '21 17:09 geerlingguy

Tried: sudo SKIP_KERNEL=1 rpi-update, then rebooted. Now it's just hanging at:

[  115.560984] DEBUG: Passed gmc_v10_0_hw_init 1078 

And the green ACT light on the IO board just stays lit green.

geerlingguy avatar Sep 10 '21 17:09 geerlingguy

Trying a few more times, with various debug statements. I can definitely get to gmc_v10_0_hw_init but I'm trying to dig around and see where the code is calling that through the amd_ip_funcs struct.

Anyways, sometimes I get back to:

[   96.885394] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist

geerlingguy avatar Sep 10 '21 18:09 geerlingguy

Another run with some more debugging:

[   59.061056] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   59.061084] DEBUG: Passed gmc_v10_0_hw_init 1075 
[   59.061216] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   59.061222] DEBUG: Passed gmc_v10_0_hw_init 1084 
[   59.061784] DEBUG: Passed psp_sw_init 250 
[   59.083186] DEBUG: Passed psp_sw_init 266 
[   59.083216] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   59.083223] DEBUG: Passed psp_sw_init 289 
[   61.088295] ------------[ cut here ]------------
[   61.088317] Firmware transaction timeout
[   61.088366] WARNING: CPU: 3 PID: 98 at drivers/firmware/raspberrypi.c:67 rpi_firmware_transaction+0xdc/0x108
[   61.088392] Modules linked in: amdgpu(+) drm_ttm_helper ttm i2c_algo_bit rfcomm bnep hci_uart btbcm bluetooth ecdh_generic ecc fuse 8021q garp stp llc snd_soc_hdmi_codec brcmfmac vc4 brcmutil cec v3d drm_kms_helper gpu_sched drm cfg80211 rfkill drm_panel_orientation_quirks bcm2835_codec(C) bcm2835_v4l2(C) bcm2835_isp(C) bcm2835_mmal_vchiq(C) v4l2_mem2mem videobuf2_vmalloc videobuf2_dma_contig raspberrypi_hwmon videobuf2_memops videobuf2_v4l2 snd_soc_core i2c_brcmstb videobuf2_common dwc2 roles videodev snd_compress snd_bcm2835(C) mc snd_pcm_dmaengine vc_sm_cma(C) snd_pcm snd_timer snd syscopyarea sysfillrect sysimgblt fb_sys_fops rpivid_mem backlight uio_pdrv_genirq uio nvmem_rmem i2c_dev aes_neon_bs sha256_generic aes_neon_blk crypto_simd cryptd ip_tables x_tables ipv6
[   61.088679] CPU: 3 PID: 98 Comm: kworker/3:2 Tainted: G         C        5.14.2-v8+ #1
[   61.088690] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[   61.088698] Workqueue: events dbs_work_handler
[   61.088718] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[   61.088727] pc : rpi_firmware_transaction+0xdc/0x108
[   61.088736] lr : rpi_firmware_transaction+0xdc/0x108
[   61.088744] sp : ffffffc011be39c0
[   61.088749] x29: ffffffc011be39c0 x28: ffffffc011be3d20 x27: 0000000000000000
[   61.088768] x26: ffffff8058594d80 x25: ffffff80409cdd00 x24: ffffffc011a7d008
[   61.088785] x23: 0000000000001000 x22: ffffff80409cdd00 x21: 00000000ffffff92
[   61.088802] x20: ffffffc01146f520 x19: ffffffc0112f8948 x18: 0000000000000000
[   61.088818] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   61.088833] x14: 0000000000000000 x13: 74756f656d697420 x12: ffffffc0113862c8
[   61.088849] x11: 0000000000000003 x10: ffffffc01136e288 x9 : ffffffc0100e6f00
[   61.088866] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : ffffffc011be3650
[   61.088882] x5 : ffffffc0ea7b0000 x4 : 0000000000000000 x3 : 0000000000000001
[   61.088897] x2 : 0000000000000000 x1 : 20ef52a5bc805600 x0 : 0000000000000000
[   61.088913] Call trace:
[   61.088918]  rpi_firmware_transaction+0xdc/0x108
[   61.088926]  rpi_firmware_property_list+0xc0/0x180
[   61.088935]  rpi_firmware_property+0x78/0x110
[   61.088942]  raspberrypi_fw_set_rate+0x5c/0xd8
[   61.088953]  clk_change_rate+0xdc/0x4e8
[   61.088965]  clk_core_set_rate_nolock+0x1e4/0x238
[   61.088975]  clk_set_rate+0x44/0xb8
[   61.088984]  _set_opp+0x230/0x4f8
[   61.088996]  dev_pm_opp_set_rate+0x128/0x190
[   61.089007]  set_target+0x38/0x48

(Hit that same Pi firmware issue, but system is still hard locked up.)

Looks like it might be failing somewhere in here:

static int psp_sw_init(void *handle)
...
	if (mem_training_ctx->enable_mem_training) {
		ret = psp_memory_training_init(psp);
		if (ret) {
			DRM_ERROR("Failed to initialize memory training!\n");
			return ret;
		}

		ret = psp_mem_training(psp, PSP_MEM_TRAIN_COLD_BOOT);
		if (ret) {
			DRM_ERROR("Failed to process memory training!\n");
			return ret;
		}
	}

geerlingguy avatar Sep 10 '21 19:09 geerlingguy

Opened an issue on the 'official' tracker: Freedesktop GitLab - Can't get RX 6700 XT running on Raspberry Pi CM4.

geerlingguy avatar Sep 10 '21 19:09 geerlingguy

The way I read this log is that the actual panic occurs when the Raspberry Pi itself is setting some clockspeed (PCIE bus? its own CPU? But why would that fail…) through a firmware call that times out. I think that’s why we’re not seeing that DRM error about failed memory training being printed, which leads me to believe we’re seeing the crashes occur at random points again? Smells familiar…

elmeyer avatar Sep 10 '21 19:09 elmeyer

Which leads me to believe we’re seeing the crashes occur at random points again? Smells familiar…

Indeed, I'm running through a few more tests just to see if I can get consistent results (with a tons of .5s delays mixed in).

I just checked before I was going to load amdgpu again, and saw these two errors too (completely random, a few minutes after booting the Pi, hadn't touched it):

[  610.888425] ------------[ cut here ]------------
[  610.888447] fw-clk-m2mc already disabled
[  610.888492] WARNING: CPU: 3 PID: 86 at drivers/clk/clk.c:960 clk_core_disable+0x258/0x290
...
[  610.889440] fw-clk-m2mc already unprepared
[  610.889474] WARNING: CPU: 3 PID: 86 at drivers/clk/clk.c:819 clk_core_unprepare+0x23c/0x260

And looking back, those same two errors occurred 10 seconds into the boot cycle. PCIe bus seems to not be up either on this boot:

[    1.228140] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.228179] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.228265] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x063fffffff -> 0x00c0000000
[    1.228355] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0400000000
[    1.545482] brcm-pcie fd500000.pcie: link down

But a reboot brings it right back.

geerlingguy avatar Sep 10 '21 19:09 geerlingguy

I'm also adding .5s delays with two lines like the following:

	printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
	msleep(500);

And it looks like I can very consistently reach:

[   76.507503] [drm] Detected VRAM RAM=12272M, BAR=256M
[   76.507508] [drm] RAM width 192bits GDDR6
[   76.507617] [drm] amdgpu: 12272M of VRAM memory ready
[   76.507625] [drm] amdgpu: 2845M of GTT memory ready.
[   76.507643] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   76.507672] DEBUG: Passed gmc_v10_0_hw_init 1075 
[   76.507796] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   76.507803] DEBUG: Passed gmc_v10_0_hw_init 1084 
[   76.508260] DEBUG: Passed psp_sw_init 262 
[   77.046534] DEBUG: Passed psp_sw_init 279 
[   77.564552] DEBUG: Passed psp_get_runtime_db_entry 201 
[   78.076551] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   78.076566] DEBUG: Passed psp_sw_init 303 
[   78.588509] DEBUG: Passed psp_sw_init 308 
[   79.100496] DEBUG: Passed psp_sw_init 317

The next block of code, which does not run, is:

		ret = psp_mem_training(psp, PSP_MEM_TRAIN_COLD_BOOT);
		if (ret) {
			DRM_ERROR("Failed to process memory training!\n");
			return ret;
		}

geerlingguy avatar Sep 10 '21 19:09 geerlingguy

Debugging psp_v11_0_memory_training now:

[   26.845578] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   26.845590] DEBUG: Passed psp_sw_init 303 
[   27.357576] DEBUG: Passed psp_sw_init 308 
[   27.869578] DEBUG: Passed psp_sw_init 317 
[   28.381584] DEBUG: Passed psp_v11_0_memory_training 612 
[   28.893586] DEBUG: Passed psp_v11_0_memory_training 623 
[   29.405609] DEBUG: Passed psp_v11_0_memory_training 634 
[   29.917580] DEBUG: Passed psp_v11_0_memory_training 642 
[   30.429593] DEBUG: Passed psp_v11_0_memory_training 650 
[   30.941605] DEBUG: Passed psp_v11_0_memory_training 658 
[   31.453586] DEBUG: Passed psp_v11_0_memory_training 667 
[   31.965598] DEBUG: Passed psp_v11_0_memory_training 677 
[   32.477579] DEBUG: Passed psp_v11_0_memory_training 686 
[   32.989579] DEBUG: Passed psp_v11_0_memory_training 694 
[   33.501583] DEBUG: Passed psp_v11_0_memory_training 708 
[   34.013581] DEBUG: Passed psp_v11_0_memory_training 718 
[   34.526817] DEBUG: Passed psp_v11_0_memory_training 727 

It looks like it's hitting this portion of code:

static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
...
	if (drm_dev_enter(&adev->ddev, &idx)) {
			memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
			ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
			if (ret) {
				DRM_ERROR("Send long training msg failed.\n");
				vfree(buf);
				drm_dev_exit(idx);
				return ret;
			}

memcpy_fromio() seems the likely culprit?

Edit: It seems like every time with debug statements around it, the system halts on the line:

memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);

geerlingguy avatar Sep 10 '21 20:09 geerlingguy

Maybe it's time for me to read through the entire Linux Device Drivers book on PCIe memory access?

geerlingguy avatar Sep 10 '21 20:09 geerlingguy

Trimming down the debug to just before the memcpy_fromio() line:

static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
...
		if (drm_dev_enter(&adev->ddev, &idx)) {
			printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
			printk(KERN_ALERT "DEBUG: addr %p, value %u, count %d \n",buf,adev->mman.aper_base_kaddr,sz);
			msleep(500);

			memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);

I see:

[   48.987688] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   48.988976] DEBUG: Passed psp_v11_0_memory_training 692 
[   48.988991] DEBUG: addr 0000000022ac6957, value 536870912, count 33554432 
[   51.837474] ------------[ cut here ]------------
[   51.837490] Firmware transaction timeout
[   51.837532] WARNING: CPU: 1 PID: 177 at drivers/firmware/raspberrypi.c:67 rpi_firmware_transaction+0xdc/0x108

geerlingguy avatar Sep 10 '21 20:09 geerlingguy

Added an issue on the Raspberry Pi Forums too: Having trouble with AMD Radeon RX 6700 XT on CM4.

geerlingguy avatar Sep 10 '21 20:09 geerlingguy

This smells so familiar that you may have to start writing single-byte loops.

elmeyer avatar Sep 10 '21 21:09 elmeyer

@elmeyer - Seems like a different issue.

I replaced:

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index bc133db2d538..3c34949222a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -689,7 +689,19 @@ static int psp_v11_0_memory_training(struct psp_context *psp, uint32_t ops)
 		}
 
 		if (drm_dev_enter(&adev->ddev, &idx)) {
-			memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
+			printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
+			printk(KERN_ALERT "DEBUG: addr %p, value %u, count %d \n",buf,adev->mman.aper_base_kaddr,sz);
+			msleep(500);
+
+			int pos;
+			for(pos = 0;pos < sz; pos++){
+				memcpy_fromio(buf+pos,adev->mman.aper_base_kaddr+pos,1);
+			}
+			// memcpy_fromio(buf, adev->mman.aper_base_kaddr, sz);
+
+			printk(KERN_ALERT "DEBUG: Passed %s %d \n",__FUNCTION__,__LINE__);
+			msleep(500);
+
 			ret = psp_v11_0_memory_training_send_msg(psp, PSP_BL__DRAM_LONG_TRAIN);
 			if (ret) {
 				DRM_ERROR("Send long training msg failed.\n");

And it output:

[   70.605375] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[   70.632286] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[   70.633553] DEBUG: Passed psp_v11_0_memory_training 692 
[   70.633569] DEBUG: addr 00000000a6e09a10, value 536870912, count 33554432

But didn't get any further AFAICT.

geerlingguy avatar Sep 10 '21 21:09 geerlingguy

Just noting someone else who was working with memcopy_fromio/toio and experiencing hard crashes: https://stackoverflow.com/questions/28518336/how-do-i-use-memcpy-toio-fromio#comment45366546_28518336

geerlingguy avatar Sep 10 '21 21:09 geerlingguy

To test where it gets to easier without recompiling the kernel each time, you could use a jtag adapter or a second pi as one with openOCD. You'll have to set the maximum number of cores used to 1 because at least with the configuration I used, I could only access the first one. You might also want to suppress RCU stalls as the kernel complains about them when single stepping.

Crashing at memcpy_*io seems right to me as it's the same with the radeon driver.

Coreforge avatar Sep 11 '21 08:09 Coreforge

Over on the Pi forums, got the following response from jdb:

"Firmware transaction timeout" usually means the VPU has crashed.

From your linked issue on freedesktop.org, the memcpy_fromio boils down to this: https://elixir.bootlin.com/linux/latest/source/arch/arm/lib/copy_template.S

Which uses optimised loads and stores to access the PCIe outbound window.

This won't work on a CM4. At best you get garbage in the read data, at worst you trash the internal bus between CPU and PCIe - which is what seems to be happening because the VPU sometimes fails while the CPU trundles on.

You need to use dword-sized transfers only, readl()/writel().

It looks like @Coreforge did something similar here for the radeon driver.

geerlingguy avatar Sep 11 '21 17:09 geerlingguy

There are a few memcpy_toio and memcpy_fromio in the driver that likely will cause issues. The best way is probably to put a 32-bit version of the two functions and replace the calls with calls to the 32-bit versions (instead of doing what I did and having multiple functions that do exactly the same in multiple places)

Coreforge avatar Sep 11 '21 18:09 Coreforge