freebsd-wifibox icon indicating copy to clipboard operation
freebsd-wifibox copied to clipboard

Qualcomm Atheros QCNFA765: ath11k_pci crashing

Open Defenso-QTH opened this issue 1 year ago • 7 comments

Description

Appliance seems to detect the hardware but firmware crashes on load.

mhi mhi0: Direct firmware load for ath11k/WCN6855/hw2.1/amss.bin failed with error -2

Host operating system

FreeBSD 13.3-RELEASE-p7 GENERIC amd64

Wireless NIC

Qualcomm Technologies, Inc
QCNFA765 Wireless Network Adapter

Wifibox version

0.14.0

Disk image type and version

wifibox-alpine 20240911

Changes to the default configuration files

No response

Logs

Unable to post now (no network)

Additional context

Add any other context about the problem here that might help the investigation.

Have you tried to turn it on and off?

  • [X] Yes, I have read all the manual pages first!

Defenso-QTH avatar Oct 30 '24 15:10 Defenso-QTH

It seems /lib/firmware/ath11k/WCN6855/hw2.1/ directory is missing from the Alpine appliance image.

Defenso-QTH avatar Oct 31 '24 04:10 Defenso-QTH

Thanks for reporting the issue! Based on this, I believe I was able to identify the root cause. Some of the files that are shipped with linux-firmware are symbolic links and they are not stored in the git repository (from where the contents of the respective tarball is extracted) but need to be added through calling a package builder script.

I created a fix for that in the fix/net/wifibox-alpine/linux-firmware-symlinks branch of the pgj/freebsd-wifibox-port repository. Please try it by reinstalling the net/wifibox-alpine port (version 20241101) from there:

https://github.com/pgj/freebsd-wifibox-port/tree/fix/net/wifibox-alpine/linux-firmware-symlinks

pgj avatar Nov 01 '24 18:11 pgj

Thank you. Now it does not crash but it seems the driver is waiting for something that never happens:

[    0.603295] ath11k_pci 0000:00:06.0: MSI vectors: 1
[    0.603356] ath11k_pci 0000:00:06.0: wcn6855 hw2.1
[    0.605637] NET: Registered PF_QIPCRTR protocol family
[    0.761336] mhi mhi0: Requested to power ON
[    0.761345] mhi mhi0: Power on setup success
[    0.847684] mhi mhi0: Wait for device to enter SBL or Mission mode

I also upgraded the host to 14.1-RELEASE but I do not think that makes any difference.

Defenso-QTH avatar Nov 04 '24 09:11 Defenso-QTH

I have done some investigation, and it seems this a known bug of the ath11k driver. Essentially, the driver does not tolerate well when it is run in a virtualized environment because it assumes that the location of the MSI table matches with that of the host. There is a patch that may address this boot issue, we can take a chance with that if you are available for testing it.

pgj avatar Nov 05 '24 07:11 pgj

Thanks a lot for the research! Of course I would be glad to help with the testing. Can you apply the patch to the branch?

Defenso-QTH avatar Nov 06 '24 04:11 Defenso-QTH

Unfortunately, I have just noticed that this patch has been made part of both Linux 6.6.50 and Linux 6.10.9 that are integrated into wifibox-alpine 20240911. This means that the problem must be with something else in this case.

pgj avatar Nov 06 '24 07:11 pgj

Well that's sad. Please let me know if you think of anything else or if there is some more data I could give to help solve this.

For the record this is the wifi chip for Lenovo T14s Gen4 AMD laptops.

Defenso-QTH avatar Nov 06 '24 15:11 Defenso-QTH

This also appears to be an issue for the Thinkpad P14s Gen 5 AMD, which may be very similar internals to the T14s. I can help debug if there's anything to try.

spuriousdata avatar Apr 09 '25 03:04 spuriousdata

Have you tried with the latest alpine image? I saw it was updated a couple weeks ago.

Defenso-QTH avatar Apr 09 '25 07:04 Defenso-QTH

I was using whichever one pkg installed yesterday. I think it was dated from March

spuriousdata avatar Apr 09 '25 07:04 spuriousdata

Based on my earlier comment above, I believe the reason for what you see is that the ath11k driver still does not know how to deal with virtualization. There seems to be a workaround, though. There one could at least feed the MSI information for the driver through the host_msi_vector_addr and host_msi_vector_data module parameters.

The patch is not present in either 6.12 or 6.13, not even in 6.15. So this is something we could try to add to the Wifibox/Alpine kernel. Looks like the required parameters might be obtained using the FreeBSD port of the lspci utility (as part of the sysutils/pciutils package):

# lspci -s 03:00.00 -vv | fgrep "Data"
                Address: 00000000fee00000  Data: 003d

pgj avatar Apr 10 '25 06:04 pgj

That Data line shows all zeros for both the address and the data.

root@thinkpad:~ # lspci -s 2:00.00 -vv | grep Data
                Address: 00000000  Data: 0000

lspci-full.txt

spuriousdata avatar Apr 10 '25 07:04 spuriousdata

Same here:

	Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit-
		Address: 00000000  Data: 0000
		Masking: 00000000  Pending: 00000000

qualcomm_lspci.txt

Defenso-QTH avatar Apr 10 '25 09:04 Defenso-QTH

Uhm, that is interesting. I assume that the patch above would not help that much.

pgj avatar Apr 10 '25 21:04 pgj

Aw, now I see. MSI is not enabled for the card, hence we have these empty values. That is indicated by Enable-.

pgj avatar Apr 10 '25 21:04 pgj

I am curious then what the following command says for you:

sysctl hw.pci.enable_msi

pgj avatar Apr 10 '25 21:04 pgj

That sysctl seems to be enabled for me:

root@thinkpad:~ $ sysctl hw.pci.enable_msi
hw.pci.enable_msi: 1

spuriousdata avatar Apr 11 '25 00:04 spuriousdata

Yeah, thanks for the response. For what it is worth, that is what I expected. The Linux MSI documentation (since lspci is a Linux tool) has some hints about the causes why an interface is marked for disabled MSI, but they cannot be applied verbatim to FreeBSD.

Another chance that could save us here is disabling the MSI blacklisting mechanism, for which the following should be added to loader.conf(5) and as such requires a reboot:

hw.pci.honor_msi_blacklist=0

pgj avatar Apr 11 '25 06:04 pgj

I added that and rebooted and tried again. Basically the same result in that there is no wlan0 device inside the vm, but this time there is slightly more info in dmesg:

wifibox:~# dmesg | egrep 'wlan|ath11k'
[    0.654536] ath11k_pci 0000:00:06.0: BAR 0 [mem 0x800000000-0x8001fffff 64bit]: assigned
[    0.654787] ath11k_pci 0000:00:06.0: MSI vectors: 1
[    0.654846] ath11k_pci 0000:00:06.0: wcn6855 hw2.1
[   22.622319] ath11k_pci 0000:00:06.0: failed to power up mhi: -110
[   22.622324] ath11k_pci 0000:00:06.0: failed to start mhi: -110
[   22.622327] ath11k_pci 0000:00:06.0: failed to power up :-110
[   22.625492] ath11k_pci 0000:00:06.0: failed to create soc core: -110
[   22.625500] ath11k_pci 0000:00:06.0: failed to init core: -110
[   22.625521] Modules linked in: qrtr ath11k_pci(+) ath11k qmi_helpers mhi
[   22.625602]  ath11k_pcic_free_irq+0x51/0xe0 [ath11k]
[   22.625617]  ath11k_pci_probe+0x7d8/0x800 [ath11k_pci]
[   22.625665]  ? ath11k_pci_get_msi_irq+0x10/0x10 [ath11k_pci]
[   22.625670]  ath11k_pci_init+0x1b/0x30 [ath11k_pci]
[   22.625674]  ? ath11k_pci_get_msi_irq+0x10/0x10 [ath11k_pci]
[   22.659158] ath11k_pci 0000:00:06.0: probe with driver ath11k_pci failed with error -110

spuriousdata avatar Apr 11 '25 09:04 spuriousdata

Changing anything around the MSI will not help immediately, I believe. That is because the ath11k_pci Linux driver will not be able to discover the MSI vectors automatically. We will have to add the patch I was talking about above and feed the driver with the address and data values (which should not be zeroes).

That said, I would first check if lspci can see that MSI is enabled, just as you have queried it earlier. If the answer is positive, and now we have a non-zero address and data for the card, I could add the kernel patch and work out a mechanism to configure the MSI address and data, and then we could try again.

pgj avatar Apr 12 '25 06:04 pgj

Unfortunately, the MSI stuff still shows up as all zeros with that tunable set.

root@thinkpad:~ # sysctl hw.pci.honor_msi_blacklist
hw.pci.honor_msi_blacklist: 0

root@thinkpad:~ # lspci -s 02:00.00 -vv | grep -A2 MSI
	Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit-
		Address: 00000000  Data: 0000
		Masking: 00000000  Pending: 00000000

spuriousdata avatar Apr 12 '25 06:04 spuriousdata

What does pciconf say by the way?

pciconf -lc pci0:2:0:0 | fgrep MSI

pgj avatar Apr 12 '25 18:04 pgj

Apparently, pciutils has another tool, called setpci, which can be used to enable MSI explicitly as follows:

setpci -s 2:0.0 COMMAND=0510

pgj avatar Apr 12 '25 19:04 pgj

root@thinkpad:~ # pciconf -lc pci0:2:0:0 | grep MSI
    cap 05[50] = MSI supports 32 messages, vector masks

setpci doesn't seem to have changed anything:

root@thinkpad:~ # setpci -s 2:0.0 COMMAND=0510

root@thinkpad:~ # lspci -s 2:0.0 -vv | grep -A2 MSI
	Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit-
		Address: 00000000  Data: 0000
		Masking: 00000000  Pending: 00000000

I wouldn't be surprised if tools like that didn't work correctly on FreeBSD, even though the port exists and runs. I think they're really expecting to be talking to a Linux kernel.

spuriousdata avatar Apr 12 '25 19:04 spuriousdata

For what it is worth, it worked for me... Even though the MSI was enabled, setpci seemed to trigger the recreation of the vector as its coordinates changed. It confused the running Wifibox instance for a moment.

Anyhow, pciconf (FreeBSD) itself says that MSI is not enabled for that card. It should have shown something like that:

    cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message

The source code of pciconf indicates that the enabled with N message is displayed only if MSI is enabled for the device. Perhaps FreeBSD itself has a tool to turn that on.

pgj avatar Apr 12 '25 19:04 pgj

In the meantime, you could try to boot your FreeBSD system with verbose messages and check dmesg for information about the MSI allocation and activation. Maybe it will show why MSI could not be enabled.

pgj avatar Apr 12 '25 20:04 pgj

Interesting. I stand corrected. I can't seem to find any way in freebsd to enable or disable msi per device, other than the msi blacklist.

spuriousdata avatar Apr 12 '25 21:04 spuriousdata

the dmesg with verbose kernel output has tons and tons of stuff, but there doesn't seem to be any errors. As far as I can tell, the relevant blocks about this card are these two:

found-> vendor=0x17cb, dev=0x1103, revid=0x01
        domain=0, bus=2, slot=0, func=0
        class=02-80-00, hdrtype=0x00, mfdev=0
        cmdreg=0x0000, statreg=0x0010, cachelnsz=8 (dwords)
        lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
        powerspec 3  supports D0 D3  current D0
        MSI supports 32 messages, vector masks
        map[10]: type Memory, range 64, base 0x90600000, size 21, memory disabled
pcib2: allocated memory range (0x90600000-0x907fffff) for rid 10 of pci0:2:0:0

and

found-> vendor=0x17cb, dev=0x1103, revid=0x01
        domain=0, bus=2, slot=0, func=0
        class=02-80-00, hdrtype=0x00, mfdev=0
        cmdreg=0x0002, statreg=0x0010, cachelnsz=8 (dwords)
        lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
        powerspec 3  supports D0 D3  current D0
        MSI supports 32 messages, vector masks
pci0:2:0:0: reprobing on driver added

The second one occurs multiple times. I thought maybe the part about memory disabled was an indication of a problem but there are many other devices that show that as well.

spuriousdata avatar Apr 12 '25 23:04 spuriousdata

I have done some further investigation, and it seems that pciconf itself could be used to send commands to PCI devices with the -w flag (and replace setpci). But since it does not seem to be documented well enough, it looks risky to do so.

Nevertheless, I collected verbose MSI-related events from the dmesg output and they looked like as follows:

found-> vendor=0x8086, dev=0x1916, revid=0x07
        domain=0, bus=0, slot=2, func=0
        class=03-00-00, hdrtype=0x00, mfdev=0
        cmdreg=0x0007, statreg=0x0010, cachelnsz=0 (dwords)
        lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
        intpin=a, irq=11
        powerspec 2  supports D0 D3  current D0
        MSI supports 1 message
        map[10]: type Memory, range 64, base 0xe0000000, size 24, enabled

and then later:

ppt0: attempting to allocate 1 MSI vectors (1 supported)
msi: routing MSI IRQ 133 to local APIC 0 vector 59
ppt0: using IRQ 133 for MSI

I have unloaded the vmm kernel module and then lspci showed me MSI: Enable- for the device (as it had no driver attached). This suggests me that it is the driver itself that should be activating MSI, which is ppt (bhyve PCI pass-through driver) in this case. It is part of the vmm module, so once it is loaded and ppt is configured for the device, it should just be working.

You can try reserving the device for ppt by adding the following line to loader.conf(5) (and restart your system), if you have not done so:

pptdevs="2/0/0"

Another reason for MSI being disabled might be due to the BIOS. This is either disabled somewhere in the settings or in the firmware itself due to instabilities.

pgj avatar Apr 13 '25 08:04 pgj

Another update. Looks like one does not have to unload vmm, it is enough if no bhyve VM runs with PCI pass-through configured in order to get the MSI: Enable- status in the lspci output.

pgj avatar Apr 13 '25 10:04 pgj