raspberry-pi-pcie-devices icon indicating copy to clipboard operation
raspberry-pi-pcie-devices copied to clipboard

Test GPU (Sapphire Radeon RX 550 2GB)

Open geerlingguy opened this issue 3 years ago • 174 comments

I just ordered the Sapphire Radeon Pulse RX 550 2GB card after this recommendation by Djhg2000, and I'd like to see if it uses UEFI, works without BIOS/IO BAR space, and might have a better chance of working on the Pi.

Because it definitely requires more power, I'm going to try it in a 1x to 16x riser with external power, which I'm still waiting to test later. Will definitely need to expand the BAR space, maybe beyond 1 GB this time :O

DSC_3094

References:

geerlingguy avatar Oct 27 '20 20:10 geerlingguy

Will be curious if you can allocate enough BAR for this card. Looking forward to finding out! :-)

dtischler avatar Oct 27 '20 22:10 dtischler

Ha, well I half wonder if I'll need a CM4 8GB... which I have not ordered (I have a couple more 4GB models on the way but they aren't shipping for a couple weeks!).

geerlingguy avatar Oct 27 '20 22:10 geerlingguy

Over in the raspberrypi/linux project, it looks like this commit (https://github.com/raspberrypi/linux/commit/54db4b2fa4d17251c2f6e639f849b27c3b553939) has increased the default BAR allocation to 1GB by default—nice!

geerlingguy avatar Oct 28 '20 20:10 geerlingguy

It has arrived.

But I have been full bore on a few other things today, so it will have to wait. I set it next to the CM4 IO Board so it can start getting 'familiar' with it.

geerlingguy avatar Oct 30 '20 22:10 geerlingguy

Cards for the Mac market also shouldn't have that I/O section, because they don't use the whole BIOS system at all - and not even the x86 set but that was a long time ago. So Mac branded cards (they do exist..) are also an option.

sinetek avatar Nov 01 '20 00:11 sinetek

Some amd cards can be flashed to Mac / efi mode too, if you’re that way inclined.

clarkalastair avatar Nov 01 '20 11:11 clarkalastair

Cards for the Mac market also shouldn't have that I/O section, because they don't use the whole BIOS system at all - and not even the x86 set but that was a long time ago. So Mac branded cards (they do exist..) are also an option.

Hi,

I think that potentially you can patch the driver to ignore this problem.

The problem resides in the line 1423 of the file drivers/gpu/drm/radeon/radeon_device.c Using the latest driver version is:

rdev->rio_mem = pci_iomap(rdev->pdev, i, rdev->rio_mem_size);

The pci_iomap is returning NULL, however, I think that you don't really need to do the iomap, since it's only needed in the case of AMD legacy cards.

In fact, I think you could just continue executing by erasing the if: (although this is not causing the driver to not initialize as far as I can see) if (rdev->rio_mem == NULL) DRM_ERROR("Unable to find PCI I/O BAR\n");

and as far as I can see, the code is already prepared to have rio_mem to NULL, as you can see on

amdgpu_atmbios.c:1988 int amdgpu_atombios_init(struct amdgpu_device *adev)

It fallbacks to the MMIO (as the I/O BAR region that the driver would use is also mapped in the BAR)

/* needed for iio ops */
if (adev->rio_mem) {
	atom_card_info->ioreg_read = cail_ioreg_read;
	atom_card_info->ioreg_write = cail_ioreg_write;
} else {
	DRM_DEBUG("PCI I/O BAR is not found. Using MMIO to access ATOM BIOS\n");
	atom_card_info->ioreg_read = cail_reg_read;
	atom_card_info->ioreg_write = cail_reg_write;
}

Also, I can see that this check is not only here, but in various places.

I can't test it myself, but I think it's worth a try.

However, I think the problem it's more likely located in:

radeon_get_bios on radeon_bios.c

It's absolutely returning true that function, since in the log I can see that it's expecting an evergreen GPU, but failing at it.

That check is donde in this part of code:

if (!memcmp(rdev->bios + tmp, "ATOM", 4) ||
    !memcmp(rdev->bios + tmp, "MOTA", 4)) {
	rdev->is_atom_bios = true;
} else {
	rdev->is_atom_bios = false;
}

is is_atom_bios is true, the code at evergreen.c will continue initializing.

The problem is that is_atom_bios is set to false, so I think it's reading garbage. (I would love to debug it).

Also, I'm pretty sure that it's not failing early because if it detects an incorrect BIOS signature or is unable to allocate the bios map, it returns false in a previous if, and it causes a return code of -EINVAL;

I don't know if this is useful or my conclusions are utter garbage, since I'm by no means an expert in this topic.

Rucadi avatar Nov 02 '20 21:11 Rucadi

@Rucadi Good analysis – in practice i fear it won't be this easy. But it might just be. It's worth also checking how the Raptor Talos II folks are handling this case.

sinetek avatar Nov 03 '20 16:11 sinetek

Hmm, first plugins aren't going super well...

[    1.010470] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.010490] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.010547] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x067fffffff -> 0x00c0000000
[    1.010601] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.329328] brcm-pcie fd500000.pcie: link down

And some plugging and unplugging and rebooting later, and sometimes it just halts boot as it hits the following:

IMG_2667

I have a couple other powered PCIe adapters I may try. And maybe just for grins, try out the plain unpowered adapter too... at least I know it works with all my other devices.

geerlingguy avatar Nov 04 '20 19:11 geerlingguy

Well that's better... so far so good with the plain (unpowered) adapter:

$ sudo lspci -vvv
...
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7) (prog-if 00 [VGA controller])
	Subsystem: Sapphire Technology Limited Lexa PRO [Radeon RX 550/550X] (Lexa PRO [Radeon RX 550])
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at 640000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 2: Memory at 650000000 (64-bit, prefetchable) [disabled] [size=2M]
	Region 4: I/O ports at <unassigned> [disabled]
	Region 5: Memory at 600000000 (32-bit, non-prefetchable) [disabled] [size=256K]
	[virtual] Expansion ROM at 600040000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [200 v1] #15
	Capabilities: [270 v1] #19
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable-, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000020, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [370 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
	Subsystem: Sapphire Technology Limited Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin B routed to IRQ 255
	Region 0: Memory at 600060000 (64-bit, non-prefetchable) [disabled] [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0

geerlingguy avatar Nov 04 '20 19:11 geerlingguy

No kernel / boot errors reaching this point, when using the unpowered adapter?

dtischler avatar Nov 04 '20 19:11 dtischler

And dmesg logs:

[    1.011261] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.011281] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.011338] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x067fffffff -> 0x00c0000000
[    1.011392] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.059289] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    1.059578] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    1.059593] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.059610] pci_bus 0000:00: root bus resource [mem 0x600000000-0x67fffffff] (bus address [0xc0000000-0x13fffffff])
[    1.059663] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    1.059884] pci 0000:00:00.0: PME# supported from D0 D3hot
[    1.063495] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    1.063695] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    1.063809] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    1.063851] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    1.063879] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    1.063907] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    1.063935] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    1.063965] pci 0000:01:00.0: enabling Extended Tags
[    1.064241] pci 0000:01:00.0: supports D1 D2
[    1.064253] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    1.064317] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    1.064459] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    1.064545] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    1.064635] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    1.064745] pci 0000:01:00.1: enabling Extended Tags
[    1.064937] pci 0000:01:00.1: supports D1 D2
[    1.068388] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    1.068430] pci 0000:00:00.0: BAR 9: assigned [mem 0x640000000-0x657ffffff 64bit pref]
[    1.068444] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    1.068463] pci 0000:01:00.0: BAR 0: assigned [mem 0x640000000-0x64fffffff 64bit pref]
[    1.068501] pci 0000:01:00.0: BAR 2: assigned [mem 0x650000000-0x6501fffff 64bit pref]
[    1.068536] pci 0000:01:00.0: BAR 5: assigned [mem 0x600000000-0x60003ffff]
[    1.068556] pci 0000:01:00.0: BAR 6: assigned [mem 0x600040000-0x60005ffff pref]
[    1.068571] pci 0000:01:00.1: BAR 0: assigned [mem 0x600060000-0x600063fff 64bit]
[    1.068607] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    1.068618] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    1.068632] pci 0000:00:00.0: PCI bridge to [bus 01]
[    1.068650] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[    1.068666] pci 0000:00:00.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[    1.068767] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

Always with that silly i/o bar. Well, let's go recompile cross-compile the kernel and see what the amdgpu driver gives us...

And to @dtischler - no, no issues so far. My power supply seems to be happy to put out enough juice at least to get things started (and get the fan on the card moving).

geerlingguy avatar Nov 04 '20 19:11 geerlingguy

All right, recompiled the kernel, now where does that get us:

[    4.194363] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x600000000 -> 0x60003ffff
[    4.194377] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: passed res_id (0) is not a memory bar
[    4.194435] pci 0000:00:00.0: enabling device (0000 -> 0002)
[    4.194464] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[    4.344078] brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac43456-sdio for chip BCM4345/9
[    4.357361] brcmfmac: brcmf_c_preinit_dcmds: Firmware: BCM4345/9 wl0: May 14 2020 17:26:08 version 7.84.17.1 (r871554) FWID 01-3d9e1d87
[    4.360945] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[    4.360991] [drm] register mmio base: 0x00000000
[    4.361000] [drm] register mmio size: 262144
[    4.361009] [drm] PCI I/O BAR is not found.
[    4.361021] [drm] PCIE atomic ops is not supported
[    4.361044] [drm] add ip block number 0 <vi_common>
[    4.361053] [drm] add ip block number 1 <gmc_v8_0>
[    4.361062] [drm] add ip block number 2 <tonga_ih>
[    4.361070] [drm] add ip block number 3 <gfx_v8_0>
[    4.361078] [drm] add ip block number 4 <sdma_v3_0>
[    4.361087] [drm] add ip block number 5 <powerplay>
[    4.361096] [drm] add ip block number 6 <dm>
[    4.361104] [drm] add ip block number 7 <uvd_v6_0>
[    4.361112] [drm] add ip block number 8 <vce_v3_0>
[    4.609386] ATOM BIOS: 113-36764-U61
[    4.609527] [drm] UVD is enabled in VM mode
[    4.609536] [drm] UVD ENC is enabled in VM mode
[    4.609549] [drm] VCE enabled in VM mode
[    4.609574] [drm] GPU posting now...
[    4.729868] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    4.729982] amdgpu 0000:01:00.0: Direct firmware load for amdgpu/polaris12_mc.bin failed with error -2
[    4.729997] mc: Failed to load firmware "amdgpu/polaris12_mc.bin"
[    4.730341] [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
[    4.730641] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -2
[    4.730653] amdgpu 0000:01:00.0: amdgpu_device_ip_init failed
[    4.730666] amdgpu 0000:01:00.0: Fatal error during GPU init
[    4.730676] [drm] amdgpu: finishing device.
[    4.730763] ------------[ cut here ]------------
[    4.730773] sysfs group 'fw_version' not found for kobject '0000:01:00.0'
[    4.730821] WARNING: CPU: 2 PID: 163 at fs/sysfs/group.c:280 sysfs_remove_group+0x94/0xa0
[    4.730826] Modules linked in: amdgpu(+) brcmfmac brcmutil sha256_generic libsha256 i2c_algo_bit ttm vc4 cec cfg80211 v3d drm_kms_helper gpu_sched rfkill bcm2835_codec(C) bcm2835_isp(C) bcm2835_v4l2(C) v4l2_mem2mem raspberrypi_hwmon bcm2835_mmal_vchiq(C) snd_soc_core videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops drm snd_bcm2835(C) snd_compress videobuf2_v4l2 snd_pcm_dmaengine videobuf2_common backlight drm_panel_orientation_quirks snd_pcm videodev mc snd_timer vc_sm_cma(C) snd syscopyarea sysfillrect sysimgblt fb_sys_fops rpivid_mem uio_pdrv_genirq uio i2c_dev ip_tables x_tables ipv6
[    4.730931] CPU: 2 PID: 163 Comm: systemd-udevd Tainted: G         C        5.4.74-v8gpu+ #1
[    4.730936] Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
[    4.730944] pstate: 80000005 (Nzcv daif -PAN -UAO)
[    4.730953] pc : sysfs_remove_group+0x94/0xa0
[    4.730961] lr : sysfs_remove_group+0x94/0xa0
[    4.730966] sp : ffffffc01165b790
[    4.730971] x29: ffffffc01165b790 x28: 0000000000000000 
[    4.730981] x27: 0000000000000000 x26: ffffffc0092ad198 
[    4.730990] x25: ffffff80f0474d70 x24: ffffffc00921d440 
[    4.730998] x23: ffffffc0092ad000 x22: 00000000ffffffff 
[    4.731005] x21: ffffff80f66468a0 x20: ffffffc0091cb6a8 
[    4.731013] x19: 0000000000000000 x18: 0000000000000004 
[    4.731021] x17: 0000000000000fff x16: 0000000000000009 
[    4.731028] x15: ffffff80f6d0b890 x14: ffffff80ef04cca8 
[    4.731035] x13: 0000000000000000 x12: ffffffc010fa5000 
[    4.731043] x11: ffffffc010ea1000 x10: ffffffc010fa5958 
[    4.731051] x9 : 0000000000000000 x8 : 0000000000000003 
[    4.731058] x7 : 0000000000000163 x6 : ffffffc01165b480 
[    4.731066] x5 : 0000000000000001 x4 : ffffff80f79c3150 
[    4.731074] x3 : 0000000000000006 x2 : 0000000000000007 
[    4.731081] x1 : e6d12f7aeb88b200 x0 : 0000000000000000 
[    4.731090] Call trace:
[    4.731099]  sysfs_remove_group+0x94/0xa0
[    4.731401]  amdgpu_ucode_sysfs_fini+0x28/0x38 [amdgpu]
[    4.731692]  amdgpu_device_fini+0x424/0x46c [amdgpu]
[    4.731988]  amdgpu_driver_unload_kms+0x54/0xa8 [amdgpu]
[    4.732297]  amdgpu_driver_load_kms+0x11c/0x178 [amdgpu]
[    4.732405]  drm_dev_register+0x144/0x1c8 [drm]
[    4.732738]  amdgpu_pci_probe+0xe0/0x178 [amdgpu]
[    4.732760]  pci_device_probe+0xb8/0x180
[    4.732769]  really_probe+0xe0/0x330
[    4.732776]  driver_probe_device+0x5c/0xf0
[    4.732783]  device_driver_attach+0x74/0x80
[    4.732790]  __driver_attach+0x64/0xe0
[    4.732800]  bus_for_each_dev+0x84/0xd8
[    4.732806]  driver_attach+0x30/0x40
[    4.732812]  bus_add_driver+0x188/0x1e8
[    4.732819]  driver_register+0x64/0x110
[    4.732828]  __pci_register_driver+0x58/0x68
[    4.733152]  amdgpu_init+0x70/0x7c [amdgpu]
[    4.733165]  do_one_initcall+0x54/0x2b8
[    4.733174]  do_init_module+0x5c/0x230
[    4.733181]  load_module+0x1ddc/0x2078
[    4.733188]  __do_sys_finit_module+0xd0/0xe8
[    4.733195]  __arm64_sys_finit_module+0x28/0x38
[    4.733207]  el0_svc_common.constprop.1+0x98/0x1a0
[    4.733215]  el0_svc_handler+0x34/0xa0
[    4.733223]  el0_svc+0x8/0x204
[    4.733231] ---[ end trace d9b9d6fba13c699e ]---

geerlingguy avatar Nov 04 '20 20:11 geerlingguy

Do you have firmware-amd-graphics installed?

The error is -2 (File Not Found) That's the binary blob for the GPU, so you have to install the firmware package or add it manually

Rucadi avatar Nov 04 '20 20:11 Rucadi

Just tried sudo apt install -y firmware-amd-graphics after seeing this post, rebooted and... now it gets stuck during boot (no HDMI output) and the D2 activity LED just stays solid green.

So then I tried pulling the microSD card and commenting out the vc4-fkms-v3d dtoverlay in config.txt, and... it wouldn't boot.

I unplugged the card and got it to boot again, and then created /etc/modprobe.d/blacklist-amdgpu.conf with the contents blacklist amdgpu, then shut down, plugged in the card, and booted using the jumper at the end of J2 and... now it's booting all the way, so I'm going to modprobe this sucker and see if I can figure out what's going on.

geerlingguy avatar Nov 04 '20 20:11 geerlingguy

Well that's odd, I'm also getting some MEM space allocation failures again:

[    0.945205] pci 0000:00:00.0: BAR 9: no space for [mem size 0x18000000 64bit pref]
[    0.945218] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x18000000 64bit pref]
[    0.945231] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    0.945251] pci 0000:01:00.0: BAR 0: no space for [mem size 0x10000000 64bit pref]
[    0.945261] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x10000000 64bit pref]
[    0.945275] pci 0000:01:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[    0.945285] pci 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[    0.945297] pci 0000:01:00.0: BAR 5: assigned [mem 0x600000000-0x60003ffff]
[    0.945317] pci 0000:01:00.0: BAR 6: assigned [mem 0x600040000-0x60005ffff pref]
[    0.945331] pci 0000:01:00.1: BAR 0: assigned [mem 0x600060000-0x600063fff 64bit]

Going to dig into that first, before I run modprobe amdgpu to see what happens at that point.

Edit: Heh, I forgot that when I copied the generated dtb files I had to re-adjust the BAR space again... oops. Doing that now, will see what happens.

Edit 2: BAR MEM space is allocated again (using 1 GB, 0x40000000). I was planning on testing 2 GB (0x80000000), but that seems unnecessary, and besides, 0x40000000 is the value that will be in the next version of the Pi kernel, so it'd be nice to confirm that works.

geerlingguy avatar Nov 04 '20 20:11 geerlingguy

Well that's odd, I'm also getting some MEM space allocation failures again:

[    0.945205] pci 0000:00:00.0: BAR 9: no space for [mem size 0x18000000 64bit pref]
[    0.945218] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0x18000000 64bit pref]
[    0.945231] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    0.945251] pci 0000:01:00.0: BAR 0: no space for [mem size 0x10000000 64bit pref]
[    0.945261] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x10000000 64bit pref]
[    0.945275] pci 0000:01:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[    0.945285] pci 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[    0.945297] pci 0000:01:00.0: BAR 5: assigned [mem 0x600000000-0x60003ffff]
[    0.945317] pci 0000:01:00.0: BAR 6: assigned [mem 0x600040000-0x60005ffff pref]
[    0.945331] pci 0000:01:00.1: BAR 0: assigned [mem 0x600060000-0x600063fff 64bit]

Going to dig into that first, before I run modprobe amdgpu to see what happens at that point.

Edit: Heh, I forgot that when I copied the generated dtb files I had to re-adjust the BAR space again... oops. Doing that now, will see what happens.

Good luck! e.e"

Rucadi avatar Nov 04 '20 20:11 Rucadi

All right, so I have a terminal open with dmesg --follow, and another one where I run modprobe amdgpu (as root):

[  173.558495] [drm] amdgpu kernel modesetting enabled.
[  173.558693] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x600000000 -> 0x60fffffff
[  173.558699] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x610000000 -> 0x6101fffff
[  173.558704] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x618000000 -> 0x61803ffff
[  173.558790] pci 0000:00:00.0: enabling device (0000 -> 0002)
[  173.558804] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[  173.559150] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[  173.559176] [drm] register mmio base: 0x18000000
[  173.559179] [drm] register mmio size: 262144
[  173.559183] [drm] PCI I/O BAR is not found.
[  173.559188] [drm] PCIE atomic ops is not supported
[  173.559201] [drm] add ip block number 0 <vi_common>
[  173.559205] [drm] add ip block number 1 <gmc_v8_0>
[  173.559209] [drm] add ip block number 2 <tonga_ih>
[  173.559213] [drm] add ip block number 3 <gfx_v8_0>
[  173.559217] [drm] add ip block number 4 <sdma_v3_0>
[  173.559221] [drm] add ip block number 5 <powerplay>
[  173.559225] [drm] add ip block number 6 <dm>
[  173.559229] [drm] add ip block number 7 <uvd_v6_0>
[  173.559233] [drm] add ip block number 8 <vce_v3_0>
[  173.805864] ATOM BIOS: 113-36764-U61
[  173.805941] [drm] UVD is enabled in VM mode
[  173.805945] [drm] UVD ENC is enabled in VM mode
[  173.805951] [drm] VCE enabled in VM mode
[  173.805976] [drm] GPU posting now...
[  173.926955] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[  173.932337] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0x610000000-0x6101fffff 64bit pref]
[  173.932346] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0x600000000-0x60fffffff 64bit pref]
[  173.932390] pci 0000:00:00.0: BAR 9: releasing [mem 0x600000000-0x617ffffff 64bit pref]
[  173.932407] pci 0000:00:00.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[  173.932412] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[  173.932420] amdgpu 0000:01:00.0: BAR 0: no space for [mem size 0x80000000 64bit pref]
[  173.932425] amdgpu 0000:01:00.0: BAR 0: failed to assign [mem size 0x80000000 64bit pref]
[  173.932431] amdgpu 0000:01:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[  173.932435] amdgpu 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[  173.932440] pci 0000:00:00.0: PCI bridge to [bus 01]
[  173.932449] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6180fffff]
[  173.932460] pci 0000:00:00.0: PCI bridge to [bus 01]
[  173.932467] pci 0000:00:00.0:   bridge window [mem 0x618000000-0x6180fffff]
[  173.932473] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[  173.932516] [drm] Not enough PCI address space for a large BAR.
[  173.932523] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[  173.932542] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[  173.932570] amdgpu 0000:01:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[  173.932576] amdgpu 0000:01:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[  173.932582] [drm] Detected VRAM RAM=2048M, BAR=256M
[  173.932586] [drm] RAM width 64bits GDDR5
[  173.932754] [TTM] Zone  kernel: Available graphics memory: 1944480 KiB
[  173.932759] [TTM] Initializing pool allocator
[  173.932780] [TTM] Initializing DMA pool allocator
[  173.932854] [drm] amdgpu: 2048M of VRAM memory ready
[  173.932864] [drm] amdgpu: 2848M of GTT memory ready.
[  173.932930] [drm] GART: num cpu pages 65536, num gpu pages 65536
[  173.934178] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[  173.937749] [drm] Chained IB support enabled!

At that moment, the Pi just completely locks up. So... something going on here that's killing the Pi, maybe a power issue? I'm going to pop the card in a couple different adapters and see if I can overcome it. Otherwise it could be a driver/SoC problem, and that ain't going to be fun.

geerlingguy avatar Nov 04 '20 20:11 geerlingguy

Err... upon further reading of the above log from dmesg, it's getting more BAR MEM space errors. Going to try 2 GB like I did earlier and see if that might help.

Edit: Nope, same thing, same BAR MEM space allocation failures. I might try for 4 GB instead of 2 GB...

Edit 2: Apparently 0xffffffff is the maximum value allowed for that bit of the array, as I got an error that any higher values were out of the 32-bit range. So if it won't work in 4 GB, I might be outta luck, at least assuming it is a BAR issue.

geerlingguy avatar Nov 04 '20 20:11 geerlingguy

Dangit, same issue, so it seems I'm hitting:

[   73.734651] [drm] Not enough PCI address space for a large BAR.

And then it keeps trying to initialize though, but stops at the point of [ 73.739897] [drm] Chained IB support enabled! and won't progress any further, meanwhile the entire Pi kinda locks itself up. That message comes from here: https://github.com/raspberrypi/linux/blob/69b14a2e6d4e840c7609370dbf0bac847c3bb15c/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c#L1062

So maybe even if there's not enough address space for a large BAR, could it still work with a 'small BAR'? After all, it would be much safer to have no BAR in this time of Covid.

Alternatively, I'm building from the default branch of the raspberrypi/linux project on GitHub (rpi-5.4.y) — is it possible I need to be on a newer version? It looks like that's the latest version of that file, at least.

geerlingguy avatar Nov 04 '20 20:11 geerlingguy

Reading through some mailing list messages, I found this:

Now your Polaris 10 cards have either 8GB or 4GB installed on each board and additionally to the installed memory we need 2MB for each card for the doorbell bar. Since the assignments can basically only be done as a power of two we end up with a requirement of 16GB address space for the 8GB card and 8GB address space for the 4GB.

For compatibility reasons the cards only advertise a 256MB window for the video memory BAR to the BIOS on boot and we later try to resize that to the real size of the installed memory.

Following that to it's conclusion, it seems this card requires 4 GB of BAR space, which I'm providing (well, maybe one byte less than that, dumb 32 bit integer!)... but it doesn't like maybe that there's one byte less. Or maybe it's hoping for 8 GB which I just can't provide.

In any case:

Fortunately the driver manages to fallback to the original 256MB configuration and continues with that. That is a bit sub-optimal, but still not a real problem.

So it's something else. Going to try powered connector and see if maybe it's a power issue.

geerlingguy avatar Nov 04 '20 21:11 geerlingguy

With the PCE164P-NO3 VER 006, I'm getting:

[    1.206474] brcm-pcie fd500000.pcie: link down

Also, after boot, the fan on the card goes to 100% and puts out quite a bit of air!

geerlingguy avatar Nov 04 '20 21:11 geerlingguy

Interesting, with this other adapter (a 2 port PCIe switch), I'm not getting the link down issue, and I see:

$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 20)
01:00.0 PCI bridge: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch (rev 05)
02:01.0 PCI bridge: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch (rev 05)
02:02.0 PCI bridge: Pericom Semiconductor PI7C9X2G304 EL/SL PCIe2 3-Port/4-Lane Packet Switch (rev 05)
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]

So let's give it a go: modprobe amdgpu

[   75.713936] [drm] amdgpu kernel modesetting enabled.
[   75.714124] pci 0000:00:00.0: of_irq_parse_pci: failed with rc=-22
[   75.714153] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x640000000 -> 0x64fffffff
[   75.714161] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x650000000 -> 0x6501fffff
[   75.714169] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x600000000 -> 0x60003ffff
[   75.714270] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   75.714293] pci 0000:01:00.0: enabling device (0000 -> 0002)
[   75.714313] pci 0000:02:01.0: enabling device (0000 -> 0002)
[   75.714332] amdgpu 0000:03:00.0: enabling device (0000 -> 0002)
[   75.714806] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[   75.714841] [drm] register mmio base: 0x00000000
[   75.714846] [drm] register mmio size: 262144
[   75.714851] [drm] PCI I/O BAR is not found.
[   75.714860] [drm] PCIE atomic ops is not supported
[   75.714882] [drm] add ip block number 0 <vi_common>
[   75.714887] [drm] add ip block number 1 <gmc_v8_0>
[   75.714892] [drm] add ip block number 2 <tonga_ih>
[   75.714897] [drm] add ip block number 3 <gfx_v8_0>
[   75.714903] [drm] add ip block number 4 <sdma_v3_0>
[   75.714908] [drm] add ip block number 5 <powerplay>
[   75.714913] [drm] add ip block number 6 <dm>
[   75.714919] [drm] add ip block number 7 <uvd_v6_0>
[   75.714924] [drm] add ip block number 8 <vce_v3_0>
[   75.972711] ATOM BIOS: 113-36764-U61
[   75.972794] [drm] UVD is enabled in VM mode
[   75.972798] [drm] UVD ENC is enabled in VM mode
[   75.972805] [drm] VCE enabled in VM mode
[   75.972852] [drm] GPU posting now...
[   76.091407] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[   76.096689] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x650000000-0x6501fffff 64bit pref]
[   76.096698] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x640000000-0x64fffffff 64bit pref]
[   76.096749] pci 0000:02:01.0: BAR 9: releasing [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096756] pci 0000:01:00.0: BAR 9: releasing [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096761] pci 0000:00:00.0: BAR 9: releasing [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096781] pci 0000:00:00.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[   76.096785] pci 0000:00:00.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[   76.096793] pci 0000:01:00.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[   76.096797] pci 0000:01:00.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[   76.096803] pci 0000:02:01.0: BAR 9: no space for [mem size 0xc0000000 64bit pref]
[   76.096807] pci 0000:02:01.0: BAR 9: failed to assign [mem size 0xc0000000 64bit pref]
[   76.096840] amdgpu 0000:03:00.0: BAR 0: no space for [mem size 0x80000000 64bit pref]
[   76.096845] amdgpu 0000:03:00.0: BAR 0: failed to assign [mem size 0x80000000 64bit pref]
[   76.096851] amdgpu 0000:03:00.0: BAR 2: no space for [mem size 0x00200000 64bit pref]
[   76.096856] amdgpu 0000:03:00.0: BAR 2: failed to assign [mem size 0x00200000 64bit pref]
[   76.096864] pci 0000:02:02.0: PCI bridge to [bus 04]
[   76.096885] pci 0000:00:00.0: PCI bridge to [bus 01-04]
[   76.096911] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.096933] pci 0000:00:00.0: PCI bridge to [bus 01-04]
[   76.096940] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.096958] pci 0000:00:00.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[   76.096966] pci 0000:01:00.0: PCI bridge to [bus 02-04]
[   76.096987] pci 0000:01:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.096994] pci 0000:01:00.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[   76.097016] pci 0000:02:01.0: PCI bridge to [bus 03]
[   76.097025] pci 0000:02:01.0:   bridge window [mem 0x600000000-0x6000fffff]
[   76.097043] pci 0000:02:01.0:   bridge window [mem 0x640000000-0x657ffffff 64bit pref]
[   76.097079] [drm] Not enough PCI address space for a large BAR.
[   76.097098] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x640000000-0x64fffffff 64bit pref]
[   76.097131] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x650000000-0x6501fffff 64bit pref]
[   76.097185] amdgpu 0000:03:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[   76.097190] amdgpu 0000:03:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
[   76.097209] [drm] Detected VRAM RAM=2048M, BAR=256M
[   76.097213] [drm] RAM width 64bits GDDR5
[   76.100942] [TTM] Zone  kernel: Available graphics memory: 1944480 KiB
[   76.100948] [TTM] Initializing pool allocator
[   76.100960] [TTM] Initializing DMA pool allocator
[   76.101058] [drm] amdgpu: 2048M of VRAM memory ready
[   76.101069] [drm] amdgpu: 2848M of GTT memory ready.
[   76.101134] [drm] GART: num cpu pages 65536, num gpu pages 65536
[   76.102413] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[   76.106273] [drm] Chained IB support enabled!

A couple differences in the output... and when I ran modprobe, I noticed the fan started spinning slower. Not sure what to make of it. But this is using an external 5v-molex-adapted-to-floppy-connector power supply. It doesn't seem the most reliable contraption in any sense, as these connectors are very cheap quality, and it's a far cry from working inside a computer with a 300W+ quality power supply :)

geerlingguy avatar Nov 04 '20 21:11 geerlingguy

Interesting that it stops on the 'Chained IB support enabled message', as I've noticed your MEM and IB MEM sections have overlapping PCI address space mappings (assuming IB refers to the same thing):

[    1.011281] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.011338] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x067fffffff -> 0x00c0000000
[    1.011392] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
...
[    1.059610] pci_bus 0000:00: root bus resource [mem 0x600000000-0x67fffffff] (bus address [0xc0000000-0x13fffffff])

You have 2GiB of BAR space, and the last 1GiB is in the IB MEM range.

Could you paste the 'ranges' and 'dma-ranges' lines from the pcie section in your device tree? I'm not sure how the IB MEM section ended up there.

elFarto avatar Nov 06 '20 11:11 elFarto

So today for fun I tried the following:

  1. Flashed Pi OS 64-bit (full GUI) to microSD card.
  2. Cross-compiled with amdgpu driver enabled
  3. Booted the device.
  4. Blacklisted amdgpu modules by creating /etc/modprobe.d/blacklist-amdgpu.conf with contents blacklist amdgpu.
  5. Installed AMD firmware: sudo apt install -y firmware-amd-graphics
  6. Increased BAR space to maximum of 2 GB 4 GB (value 0xffffffff).
  7. Rebooted (card still not plugged in). Made sure Pi booted correctly. Then shut down.
  8. Plugged in the card via dumb 16x to 1x adapter.

The card started it's 'normal' fan routine (where it spins up, stops, then spins at a nice calm rate). Sometimes it goes into 'EVIL FAN' mode where it goes max speed and I know the card didn't power up correctly.

$ lspci
00:00.0 PCI bridge: Broadcom Limited Device 2711 (rev 20)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]

And @elFarto the lines from the decompiled device tree are:

                pcie@7d500000 {
                        compatible = "brcm,bcm2711-pcie";
                        reg = < 0x00 0x7d500000 0x00 0x9310 >;
                        device_type = "pci";
                        #address-cells = < 0x03 >;
                        #interrupt-cells = < 0x01 >;
                        #size-cells = < 0x02 >;
                        interrupts = < 0x00 0x94 0x04 0x00 0x94 0x04 >;
                        interrupt-names = "pcie\0msi";
                        interrupt-map-mask = < 0x00 0x00 0x00 0x07 >;
                        interrupt-map = < 0x00 0x00 0x00 0x01 0x01 0x00 0x8f 0x04 >;
                        msi-controller;
                        msi-parent = < 0x2a >;
                        ranges = <0x02000000 0x0 0xc0000000 0x6 0x00000000 0x0 0xffffffff>;
                        dma-ranges = < 0x2000000 0x00 0x00 0x00 0x00 0x00 0xc0000000 >;
                        brcm,enable-ssc;
                        brcm,enable-l1ss;
                        phandle = < 0x2a >;
                };

I ran sudo modprobe amdgpu and dmesg --follow died on [ 133.508246] [drm] Chained IB support enabled! again.

@elFarto - Are you thinking my ranges/dma-ranges may be out of whack, maybe causing some memory addresses to be overwritten? Wouldn't be the first time (to be honest my brain kind of collapses sometimes working with this stuff).

geerlingguy avatar Nov 11 '20 19:11 geerlingguy

I'm not entirely sure what's going on, but I don't think those ranges are correct. Firstly 0xffffffff is 4GiB - 1, not 2GiB :). Next, based on your dmesg output way above, here's where everything gets mapped (reformatted to make it easier to read):

           CPU Addresses		 PCI Addresses
ranges     0x0600000000..0x067fffffff -> 0x00c0000000..0x13fffffff
dma-ranges 0x0000000000..0x00ffffffff -> 0x0100000000..0x1ffffffff

But...based on your decompiled device tree, I can't see why the dma-ranges gets pushed up to 0x01'0000'0000, since that's not what's specified (assuming you didn't change the dma-ranges in the device tree from which that dmesg came from). PhilE did say on the RasPi forums that something (firmware?) patches the device tree, so maybe that's what's happening here (maybe you can retrieve the device tree that's actually loaded from sysfs? rather than from the filesystem).

With that said, the last device tree you've pasted has this layout (assuming the dma-ranges gets changed the same way):

           CPU Addresses		 PCI Addresses
ranges     0x0600000000..0x06ffffffff -> 0x00c0000000..0x1ffffffff
dma-ranges 0x0000000000..0x00ffffffff -> 0x0100000000..0x1ffffffff

Add to that the MSI target address which is either 0x0'ffff'fffc if dma-ranges start address is >= 0x01'0000'0000 or 0xf'ffff'fffc if it's less, we end up with 3 overlaps...I think.

You can have sizes over 4GB. Here's roughly how the ranges and dma-ranges fields are structured:

ranges = <0x02000000 0x0 0xc0000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x0 0xffffffff>;//Size (4GiB - 1)
                     
dma-ranges = < 0x2000000 0x00 0x00 //PCI address
			 0x00 0x00 //CPU address
			 0x00 0xc0000000 >; //Size (3GiB)

After the first field, they're paired making a 64-bit integer. You can also have multiple ranges (but not multiple dma-ranges, that's not supported on the Pi). So if you wanted an 8GiB BAR size, you could do this (not sure this one will work due to the alignment):

ranges = <0x02000000 0x0 0xc0000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000>;//Size (8GiB)

Now what we need is an address space that fits everything in, without overlapping. Maybe something like this:

ranges = <0x02000000 0x2 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000>;//Size (8GiB)
                     
dma-ranges = < 0x2000000 0x00 0x00 //PCI address
			 0x00 0x00 //CPU address
			 0x00 0xc0000000 >; //Size

I have no idea if that'll work, we're using a lot of PCI address space, and I can't see any details on how much it supports, so :shrug:, but nothing should be overlapping.

You could pare back the ranges size to 4GiB if that doesn't work.

elFarto avatar Nov 11 '20 20:11 elFarto

After applying the ranges in your post (8GB) it did seem to boot, and I got:

[    0.901049] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.901068] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.901126] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.901182] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.218107] brcm-pcie fd500000.pcie: link down

Note that I'm on a CM4 with 4 GB of memory—can I set the BAR space larger than the system RAM?

Now when I try sudo modprobe amdgpu I get:

[   37.923447] [drm] amdgpu kernel modesetting enabled.
[   37.923669] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x600000000 -> 0x60fffffff
[   37.923675] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x610000000 -> 0x6101fffff
[   37.923708] pci 0000:00:00.0: enabling device (0000 -> 0002)
[   37.923722] amdgpu 0000:01:00.0: enabling device (0000 -> 0002)
[   37.924048] [drm] initializing kernel modesetting (POLARIS12 0x1002:0x699F 0x1DA2:0xE367 0xC7).
[   37.924062] amdgpu 0000:01:00.0: Fatal error during GPU init
[   37.926646] amdgpu: probe of 0000:01:00.0 failed with error -12

Edit: Also, after reboots sometimes the GPU fan just goes ballistic (highest speed) and I get the PCIe 'link is down' in dmesg. I have to completely power off before the card seems to go back into not-panicking mode.

geerlingguy avatar Nov 11 '20 21:11 geerlingguy

Yes, you can set it larger that RAM size, since we're only allocating address space, not RAM itself, and the Pi has 32GiB worth of address space.

The "link down" issue might just be that the driver isn't waiting long enough. Currently it's hardcoded (in the pcie-brcmstb.c file) to wait 100ms for the link to establish.

Not sure on the other error, it seems to occur just before the 'register mmio base' line is printed, which seems to have something to do the PCI BARs. Could you paste the dmesg for that boot, specifically the BAR mappings it was assigned?

elFarto avatar Nov 11 '20 21:11 elFarto

With:

                        ranges = <0x02000000 0x2 0x00000000 0x6 0x00000000 0x2 0x00000000>;
                        dma-ranges = < 0x2000000 0x00 0x00 0x00 0x00 0x00 0xc0000000 >;

I end up getting the following in dmesg after reboot with the card connected:

[    0.900642] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.900661] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.900718] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x07ffffffff -> 0x0200000000
[    0.900774] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    0.948085] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.948383] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.948399] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.948414] pci_bus 0000:00: root bus resource [mem 0x600000000-0x7ffffffff] (bus address [0x200000000-0x3ffffffff])
[    0.948466] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.948684] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.952283] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.952485] pci 0000:01:00.0: [1002:699f] type 00 class 0x030000
[    0.952600] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
[    0.952641] pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
[    0.952669] pci 0000:01:00.0: reg 0x20: [io  0x0000-0x00ff]
[    0.952696] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x0003ffff]
[    0.952723] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[    0.952753] pci 0000:01:00.0: enabling Extended Tags
[    0.953028] pci 0000:01:00.0: supports D1 D2
[    0.953039] pci 0000:01:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.953101] pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x1 link at 0000:00:00.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    0.953242] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.953319] pci 0000:01:00.1: [1002:aae0] type 00 class 0x040300
[    0.953408] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[    0.953516] pci 0000:01:00.1: enabling Extended Tags
[    0.953706] pci 0000:01:00.1: supports D1 D2
[    0.957171] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.957211] pci 0000:00:00.0: BAR 9: assigned [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957225] pci 0000:00:00.0: BAR 8: no space for [mem size 0x00100000]
[    0.957236] pci 0000:00:00.0: BAR 8: failed to assign [mem size 0x00100000]
[    0.957254] pci 0000:01:00.0: BAR 0: assigned [mem 0x600000000-0x60fffffff 64bit pref]
[    0.957291] pci 0000:01:00.0: BAR 2: assigned [mem 0x610000000-0x6101fffff 64bit pref]
[    0.957325] pci 0000:01:00.0: BAR 5: no space for [mem size 0x00040000]
[    0.957335] pci 0000:01:00.0: BAR 5: failed to assign [mem size 0x00040000]
[    0.957348] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00020000 pref]
[    0.957358] pci 0000:01:00.0: BAR 6: failed to assign [mem size 0x00020000 pref]
[    0.957370] pci 0000:01:00.1: BAR 0: no space for [mem size 0x00004000 64bit]
[    0.957381] pci 0000:01:00.1: BAR 0: failed to assign [mem size 0x00004000 64bit]
[    0.957391] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0100]
[    0.957402] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0100]
[    0.957414] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.957437] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x617ffffff 64bit pref]
[    0.957535] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0

geerlingguy avatar Nov 11 '20 21:11 geerlingguy

Ok, I was worried about that. The ranges setting is purely 64-bit, and if there are 32-bit only BARs there's no valid addresses for them to use. So....I guess we allocate MORE BAR SPACE!:

ranges = <0x02000000 0x2 0x00000000  //PCI address
                     0x6 0x00000000  //CPU address
                     0x2 0x00000000  //Size (8GiB 64-bit only)
          0x02000000 0x0 0x00000000  //PCI address
                     0x4 0x00000000  //CPU address
                     0x0 0x80000000  //Size (2GiB 32-bit)
                     >;

Now, do we need 10GiB of BAR space? To that I answer, who are you and what have you done with the real Jeff :)

edit Might need to make the CPU address 0x5'0000'0000 on the second allocation, 0x4'0000'0000 is mapped to 'L2 Cached (allocating)', not sure what that is.

elFarto avatar Nov 11 '20 21:11 elFarto