freebsd-wifibox
freebsd-wifibox copied to clipboard
Qualcomm Atheros QCNFA765: ath11k_pci crashing
Description
Appliance seems to detect the hardware but firmware crashes on load.
mhi mhi0: Direct firmware load for ath11k/WCN6855/hw2.1/amss.bin failed with error -2
Host operating system
FreeBSD 13.3-RELEASE-p7 GENERIC amd64
Wireless NIC
Qualcomm Technologies, Inc
QCNFA765 Wireless Network Adapter
Wifibox version
0.14.0
Disk image type and version
wifibox-alpine 20240911
Changes to the default configuration files
No response
Logs
Unable to post now (no network)
Additional context
Add any other context about the problem here that might help the investigation.
Have you tried to turn it on and off?
- [X] Yes, I have read all the manual pages first!
It seems /lib/firmware/ath11k/WCN6855/hw2.1/ directory is missing from the Alpine appliance image.
Thanks for reporting the issue! Based on this, I believe I was able to identify the root cause. Some of the files that are shipped with linux-firmware are symbolic links and they are not stored in the git repository (from where the contents of the respective tarball is extracted) but need to be added through calling a package builder script.
I created a fix for that in the fix/net/wifibox-alpine/linux-firmware-symlinks branch of the pgj/freebsd-wifibox-port repository. Please try it by reinstalling the net/wifibox-alpine port (version 20241101) from there:
https://github.com/pgj/freebsd-wifibox-port/tree/fix/net/wifibox-alpine/linux-firmware-symlinks
Thank you. Now it does not crash but it seems the driver is waiting for something that never happens:
[ 0.603295] ath11k_pci 0000:00:06.0: MSI vectors: 1
[ 0.603356] ath11k_pci 0000:00:06.0: wcn6855 hw2.1
[ 0.605637] NET: Registered PF_QIPCRTR protocol family
[ 0.761336] mhi mhi0: Requested to power ON
[ 0.761345] mhi mhi0: Power on setup success
[ 0.847684] mhi mhi0: Wait for device to enter SBL or Mission mode
I also upgraded the host to 14.1-RELEASE but I do not think that makes any difference.
I have done some investigation, and it seems this a known bug of the ath11k driver. Essentially, the driver does not tolerate well when it is run in a virtualized environment because it assumes that the location of the MSI table matches with that of the host. There is a patch that may address this boot issue, we can take a chance with that if you are available for testing it.
Thanks a lot for the research! Of course I would be glad to help with the testing. Can you apply the patch to the branch?
Unfortunately, I have just noticed that this patch has been made part of both Linux 6.6.50 and Linux 6.10.9 that are integrated into wifibox-alpine 20240911. This means that the problem must be with something else in this case.
Well that's sad. Please let me know if you think of anything else or if there is some more data I could give to help solve this.
For the record this is the wifi chip for Lenovo T14s Gen4 AMD laptops.
This also appears to be an issue for the Thinkpad P14s Gen 5 AMD, which may be very similar internals to the T14s. I can help debug if there's anything to try.
Have you tried with the latest alpine image? I saw it was updated a couple weeks ago.
I was using whichever one pkg installed yesterday. I think it was dated from March
Based on my earlier comment above, I believe the reason for what you see is that the ath11k driver still does not know how to deal with virtualization. There seems to be a workaround, though. There one could at least feed the MSI information for the driver through the host_msi_vector_addr and host_msi_vector_data module parameters.
The patch is not present in either 6.12 or 6.13, not even in 6.15. So this is something we could try to add to the Wifibox/Alpine kernel. Looks like the required parameters might be obtained using the FreeBSD port of the lspci utility (as part of the sysutils/pciutils package):
# lspci -s 03:00.00 -vv | fgrep "Data"
Address: 00000000fee00000 Data: 003d
That Data line shows all zeros for both the address and the data.
root@thinkpad:~ # lspci -s 2:00.00 -vv | grep Data
Address: 00000000 Data: 0000
Same here:
Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit-
Address: 00000000 Data: 0000
Masking: 00000000 Pending: 00000000
Uhm, that is interesting. I assume that the patch above would not help that much.
Aw, now I see. MSI is not enabled for the card, hence we have these empty values. That is indicated by Enable-.
I am curious then what the following command says for you:
sysctl hw.pci.enable_msi
That sysctl seems to be enabled for me:
root@thinkpad:~ $ sysctl hw.pci.enable_msi
hw.pci.enable_msi: 1
Yeah, thanks for the response. For what it is worth, that is what I expected. The Linux MSI documentation (since lspci is a Linux tool) has some hints about the causes why an interface is marked for disabled MSI, but they cannot be applied verbatim to FreeBSD.
Another chance that could save us here is disabling the MSI blacklisting mechanism, for which the following should be added to loader.conf(5) and as such requires a reboot:
hw.pci.honor_msi_blacklist=0
I added that and rebooted and tried again. Basically the same result in that there is no wlan0 device inside the vm, but this time there is slightly more info in dmesg:
wifibox:~# dmesg | egrep 'wlan|ath11k'
[ 0.654536] ath11k_pci 0000:00:06.0: BAR 0 [mem 0x800000000-0x8001fffff 64bit]: assigned
[ 0.654787] ath11k_pci 0000:00:06.0: MSI vectors: 1
[ 0.654846] ath11k_pci 0000:00:06.0: wcn6855 hw2.1
[ 22.622319] ath11k_pci 0000:00:06.0: failed to power up mhi: -110
[ 22.622324] ath11k_pci 0000:00:06.0: failed to start mhi: -110
[ 22.622327] ath11k_pci 0000:00:06.0: failed to power up :-110
[ 22.625492] ath11k_pci 0000:00:06.0: failed to create soc core: -110
[ 22.625500] ath11k_pci 0000:00:06.0: failed to init core: -110
[ 22.625521] Modules linked in: qrtr ath11k_pci(+) ath11k qmi_helpers mhi
[ 22.625602] ath11k_pcic_free_irq+0x51/0xe0 [ath11k]
[ 22.625617] ath11k_pci_probe+0x7d8/0x800 [ath11k_pci]
[ 22.625665] ? ath11k_pci_get_msi_irq+0x10/0x10 [ath11k_pci]
[ 22.625670] ath11k_pci_init+0x1b/0x30 [ath11k_pci]
[ 22.625674] ? ath11k_pci_get_msi_irq+0x10/0x10 [ath11k_pci]
[ 22.659158] ath11k_pci 0000:00:06.0: probe with driver ath11k_pci failed with error -110
Changing anything around the MSI will not help immediately, I believe. That is because the ath11k_pci Linux driver will not be able to discover the MSI vectors automatically. We will have to add the patch I was talking about above and feed the driver with the address and data values (which should not be zeroes).
That said, I would first check if lspci can see that MSI is enabled, just as you have queried it earlier. If the answer is positive, and now we have a non-zero address and data for the card, I could add the kernel patch and work out a mechanism to configure the MSI address and data, and then we could try again.
Unfortunately, the MSI stuff still shows up as all zeros with that tunable set.
root@thinkpad:~ # sysctl hw.pci.honor_msi_blacklist
hw.pci.honor_msi_blacklist: 0
root@thinkpad:~ # lspci -s 02:00.00 -vv | grep -A2 MSI
Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit-
Address: 00000000 Data: 0000
Masking: 00000000 Pending: 00000000
What does pciconf say by the way?
pciconf -lc pci0:2:0:0 | fgrep MSI
Apparently, pciutils has another tool, called setpci, which can be used to enable MSI explicitly as follows:
setpci -s 2:0.0 COMMAND=0510
root@thinkpad:~ # pciconf -lc pci0:2:0:0 | grep MSI
cap 05[50] = MSI supports 32 messages, vector masks
setpci doesn't seem to have changed anything:
root@thinkpad:~ # setpci -s 2:0.0 COMMAND=0510
root@thinkpad:~ # lspci -s 2:0.0 -vv | grep -A2 MSI
Capabilities: [50] MSI: Enable- Count=1/32 Maskable+ 64bit-
Address: 00000000 Data: 0000
Masking: 00000000 Pending: 00000000
I wouldn't be surprised if tools like that didn't work correctly on FreeBSD, even though the port exists and runs. I think they're really expecting to be talking to a Linux kernel.
For what it is worth, it worked for me... Even though the MSI was enabled, setpci seemed to trigger the recreation of the vector as its coordinates changed. It confused the running Wifibox instance for a moment.
Anyhow, pciconf (FreeBSD) itself says that MSI is not enabled for that card. It should have shown something like that:
cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message
The source code of pciconf indicates that the enabled with N message is displayed only if MSI is enabled for the device. Perhaps FreeBSD itself has a tool to turn that on.
In the meantime, you could try to boot your FreeBSD system with verbose messages and check dmesg for information about the MSI allocation and activation. Maybe it will show why MSI could not be enabled.
Interesting. I stand corrected. I can't seem to find any way in freebsd to enable or disable msi per device, other than the msi blacklist.
the dmesg with verbose kernel output has tons and tons of stuff, but there doesn't seem to be any errors. As far as I can tell, the relevant blocks about this card are these two:
found-> vendor=0x17cb, dev=0x1103, revid=0x01
domain=0, bus=2, slot=0, func=0
class=02-80-00, hdrtype=0x00, mfdev=0
cmdreg=0x0000, statreg=0x0010, cachelnsz=8 (dwords)
lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
powerspec 3 supports D0 D3 current D0
MSI supports 32 messages, vector masks
map[10]: type Memory, range 64, base 0x90600000, size 21, memory disabled
pcib2: allocated memory range (0x90600000-0x907fffff) for rid 10 of pci0:2:0:0
and
found-> vendor=0x17cb, dev=0x1103, revid=0x01
domain=0, bus=2, slot=0, func=0
class=02-80-00, hdrtype=0x00, mfdev=0
cmdreg=0x0002, statreg=0x0010, cachelnsz=8 (dwords)
lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
powerspec 3 supports D0 D3 current D0
MSI supports 32 messages, vector masks
pci0:2:0:0: reprobing on driver added
The second one occurs multiple times. I thought maybe the part about memory disabled was an indication of a problem but there are many other devices that show that as well.
I have done some further investigation, and it seems that pciconf itself could be used to send commands to PCI devices with the -w flag (and replace setpci). But since it does not seem to be documented well enough, it looks risky to do so.
Nevertheless, I collected verbose MSI-related events from the dmesg output and they looked like as follows:
found-> vendor=0x8086, dev=0x1916, revid=0x07
domain=0, bus=0, slot=2, func=0
class=03-00-00, hdrtype=0x00, mfdev=0
cmdreg=0x0007, statreg=0x0010, cachelnsz=0 (dwords)
lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
intpin=a, irq=11
powerspec 2 supports D0 D3 current D0
MSI supports 1 message
map[10]: type Memory, range 64, base 0xe0000000, size 24, enabled
and then later:
ppt0: attempting to allocate 1 MSI vectors (1 supported)
msi: routing MSI IRQ 133 to local APIC 0 vector 59
ppt0: using IRQ 133 for MSI
I have unloaded the vmm kernel module and then lspci showed me MSI: Enable- for the device (as it had no driver attached). This suggests me that it is the driver itself that should be activating MSI, which is ppt (bhyve PCI pass-through driver) in this case. It is part of the vmm module, so once it is loaded and ppt is configured for the device, it should just be working.
You can try reserving the device for ppt by adding the following line to loader.conf(5) (and restart your system), if you have not done so:
pptdevs="2/0/0"
Another reason for MSI being disabled might be due to the BIOS. This is either disabled somewhere in the settings or in the firmware itself due to instabilities.
Another update. Looks like one does not have to unload vmm, it is enough if no bhyve VM runs with PCI pass-through configured in order to get the MSI: Enable- status in the lspci output.