qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

dom0 Kernel latest (6.16+) doesnt start - amdxdna expects iommu handling in dom0

Open b90g opened this issue 3 months ago • 34 comments

How to file a helpful issue

Qubes OS release

4.3rc2

Brief summary

With kernel latest installed it doesnt reach the point asking me for the LUKS encyrption passphrase.

Steps to reproduce

install kernel latest

Expected behavior

being asked for disc encryption passphrase

Actual behavior

kernel stops at 7 seconds: amdxdna: probe with driver amdxdna failed with error -5

followed by amdgpu complaining not finding optional firmware (amdgpu/isp_4_1_0.bin) at second 10.

then it stops for minutes.

https://c.ymy.be/s/zGEeKEoiZANdFw9 (ipv6 required)

Additional information

on 6.15.11-1 it still works:

...
[    7.840194] hid-generic 0018:093A:0255.0001: input,hidraw0: I2C HID v1.00 Mouse [UNIW0001:00 093A:0255] on i2c-UNIW0001:00
[    7.863081] ACPI: video: Video Device [VGA] (multi-head: yes  rom: no  post: no)
[    7.864326] amdxdna 0000:67:00.1: enabling device (0000 -> 0002)
[    7.867003] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:1a/LNXVIDEO:00/input/input6
[    7.867631] xen: registering gsi 43 triggering 0 polarity 1
[    7.870125] xen: --> pirq=43 -> irq=43 (gsi=43)
[    7.872792] amdxdna 0000:67:00.1: [drm] *ERROR* aie2_init: Enable PASID failed, ret -19
[    7.876202] amdxdna 0000:67:00.1: [drm] *ERROR* amdxdna_probe: Hardware init failed, ret -19
[    7.882079] sdhci: Secure Digital Host Controller Interface driver
...

Device is: AMD Ryzen AI 9 HX 370 w/ Radeon 890M

b90g avatar Sep 27 '25 13:09 b90g

I also confirm there is something wrong with 6.16.8-1. I also get errors about amdxdna but it lets me enter the LUKS password and then hangs on the 3 dots screen. Booting with nomodeset and qubes.skip_autostart and then Alt+F2 let me login to Dom0 so I imagine this is an amdgpu driver issue. Reverting to 6.15.11-1 boots normally. I'm on a laptop with an 8xxxHS series processor.

Tehvan avatar Oct 10 '25 03:10 Tehvan

i m not sure if its amdgpu, the amdxdna or something else entirely

today i installled

amd-gpu-firmware.noarch                              1:20251011-1.fc41                    qubes-dom0-cached
amd-ucode-firmware.noarch                            1:20251011-1.fc41                    qubes-dom0-cached
linux-firmware-whence.noarch                         1:20251011-1.fc41                    qubes-dom0-cached

no improvement.

here is dmesg dom0 from the working 6.15.11 2025-10-12_61511.txt

i dont know how i could get dmesg from the 6..16.8 kernel its not listed in journalctl --list-boot :/

b90g avatar Oct 12 '25 08:10 b90g

tested 6.17.4 from today, no success.

here are 2 screenpictures (i couldnt focus the whole screen... ) :

Image Image

b90g avatar Oct 23 '25 08:10 b90g

Upstream bug report: https://gitlab.freedesktop.org/drm/amd/-/issues/4656

marmarek avatar Oct 23 '25 09:10 marmarek

See new comments in the above issue. I can also prepare kernel built with requested commit reverted.

marmarek avatar Oct 23 '25 21:10 marmarek

You can get patched kernel via unstable repo: sudo qubes-dom0-update --enablerepo=qubes-dom0-unstable --action=update kernel-latest You should get version 6.17.4-1.qubes.1.fc41 (note the 1 after qubes)

marmarek avatar Oct 24 '25 08:10 marmarek

thanks i will try to reboot in lunch break :)

( also i will try to set the grub boot things from upstream issue comments )

b90g avatar Oct 24 '25 08:10 b90g

booted the qubes.1 kernel, same result. it gets stuck at the following screen:

Image

in a few hours then the next try with the mentioned dcdebugmask i guess i put them into grub.

strong indicator for me that the suspicion of a graphics issue could be true: the external monitors dont switch on.

b90g avatar Oct 24 '25 11:10 b90g

back with new dmesg xl-dmesg and beautiful screenpictures (:

nomodeset:

2025-10-24_1718_nomodeset.xen.txt 2025-10-24_1718_nomodeset.txt

it did not boot on dcdebugmask

0x400 i didnt recognise a difference so i didnt take a picture.

0x800:

Image

0x10:

Image

b90g avatar Oct 24 '25 15:10 b90g

Hopefully this helps. Diff between boot on kernel 6.15.11 and 6.17.4

diff.html

Tehvan avatar Oct 24 '25 22:10 Tehvan

I see you managed to capture errors from amdgpu driver, yes, that is likely very helpful

marmarek avatar Oct 24 '25 22:10 marmarek

a framework user seems to run into the same issue.

https://forum.qubes-os.org/t/hcl-framework-13-2025-ryzen-ai-300/34846/2

b90g avatar Oct 25 '25 21:10 b90g

Just tried installing QubesOS R4.3.0-rc3-x86_64 using the latest kernel option. Including the two optional components for the desktop environment.

What works:

  • Setup boots and installs to disk all as it should.
  • Boot password gets asked.

What doesn't:

  • I get a black screen with a single "_" at the top left corner after entering the disk encryption password. No login screen is reached.

When switching to:

  • tty1 all I see is [ 2.789723] dracut-cmdline[829]: Warning: USB in dom0 is not restricted. Consider rd.qubes.hide_all_usb or usbcore.authorized_default=0
  • tty2 to 6 and tty8 to 12: A blinking "_" in the top left corner.
  • tty7: Entirely black, not even the blinking cursor.

agowa avatar Oct 28 '25 03:10 agowa

@agowa what hardware? The behavior you describe (black screen after disk password, instead of freeze before) suggests a different issue.

marmarek avatar Oct 28 '25 03:10 marmarek

CPU: 2x AMD EPYC GPU: Intel A380 (+ a small onboard one for the iKVM) Motherboard: SuperMicro H11DSi Installed onto one of two NVMe (with ArchLinux installed on the other one).

agowa avatar Oct 28 '25 03:10 agowa

So, significantly different platform (Intel GPU instead of AMD, EPYC instead of Ryzen). That's a different issue. Anyway, check if the system isn't simply using different output - try connecting monitor to a different port (or check if it's visible on iKVM). If still nothing, open a separate issue, or maybe even better ask on https://forum.qubes-os.org.

marmarek avatar Oct 28 '25 03:10 marmarek

Sorry, my bad then. Misread and thought it was AMD CPU and not GPU related...

And no, it's not on a different output (sadly). I already checked that.

agowa avatar Oct 28 '25 03:10 agowa

i noticed that a reseller of the tongfang laptop model i have issues a warning to not upgrade the GPU drivers:

https://go.xmg.gg/xmg-evo-e25-amd-driver-notice_en

maybe this does not just apply only to windows but also the linux kernel?

b90g avatar Nov 03 '25 21:11 b90g

november firmware no change

b90g avatar Nov 13 '25 08:11 b90g

addition: booted a cachyos 6.17.8 kernel (and wayland) bare metal successfully

lmk if i should provide dmesg from that.

b90g avatar Nov 27 '25 17:11 b90g

finally got qubes builder to run - and it breaks apparently with merges from 6.15.11 to 6.16 (.0)

now i try to figure out this bisecting thing...

6.18 on bare metal seems to work. (see earlier remarks on cachyos)

b90g avatar Dec 07 '25 11:12 b90g

It will be most effective if you build directly from Linux git clone, skipping qubes builder... You'll need either a trusted VM, or building it in dom0. The former is more correct approach, the latter is slightly easier. Generally, the approach would be:

  1. git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git (it will take a few GB of space)
  2. git verify-tag v6.16 (you'll need to import keys first - you have them in builderv2/artifacts/sources/linux-kernel/*.asc already)
  3. Take kernel config from /boot/config-* and copy to .config in kernel sources.
  4. git bisect start v6.16 v6.15.11

And then test the version you get this way, after which you select git bisect good or git bisect bad to report the result and get new version to test. The build itself would be something like this (make a short script, it will be easier this way):

set -xe
make olddefconfig
make -j$(nproc) WERROR=0
make modules_install install

The above would work in dom0. If building in VM, replace the last line with:

rm -rf out
mkdir -p out
make INSTALL_MOD_PATH=$PWD/out INSTALL_PATH=$PWD/out INSTALLKERNEL=/bin/true modules_install install

And then copy out dir to dom0 to appropriate places (/lib/modules and /boot according as in the out dir), and then call dracut -f --kver $NEW_KERNEL_VERSION && grub2-mkconfig -o /boot/grub2/grub.cfg (where $NEW_KERNEL_VERSION is the version as in the out/lib/modules/ dir name).

Note you'll need to manually remove old (test) kernels after this whole operation...

marmarek avatar Dec 07 '25 13:12 marmarek

At each iteration, make sure you boot the kernel you just built, not just the newest one - might be easier if you remove the one from previous attempt from /boot before copying new one in.

marmarek avatar Dec 07 '25 13:12 marmarek

ah thank you. that makes thinks easier, i will try that soon, this weekend i did 2 iterations of bad, each took me several hours :)

note/edit for future generations: use the config file from dom0.... its much faster than the whole kernel for a generic domU... now compile runs take 20 minutes or so not 2 hrs....

b90g avatar Dec 08 '25 07:12 b90g

Oh well, i guess its IOMMU again?

[user@kernel linux]$ git bisect good
7c8896dd4a2a27c84b04dcf0990e6f6b118cb6b2 is the first bad commit
commit 7c8896dd4a2a27c84b04dcf0990e6f6b118cb6b2
Author: Jason Gunthorpe <[email protected]>
Date:   Fri Apr 18 16:01:24 2025 +0800

    iommu: Remove IOMMU_DEV_FEAT_SVA
    
    None of the drivers implement anything here anymore, remove the dead code.
    
    Signed-off-by: Jason Gunthorpe <[email protected]>
    Signed-off-by: Lu Baolu <[email protected]>
    Reviewed-by: Kevin Tian <[email protected]>
    Reviewed-by: Yi Liu <[email protected]>
    Tested-by: Zhangfei Gao <[email protected]>
    Link: https://lore.kernel.org/r/[email protected]
    Signed-off-by: Joerg Roedel <[email protected]>

 drivers/accel/amdxdna/aie2_pci.c            | 13 ++-----------
 drivers/dma/idxd/init.c                     |  8 +-------
 drivers/iommu/amd/iommu.c                   |  2 --
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  2 --
 drivers/iommu/intel/iommu.c                 |  6 ------
 drivers/iommu/iommu-sva.c                   |  3 ---
 drivers/misc/uacce/uacce.c                  |  9 ---------
 include/linux/iommu.h                       |  9 +--------
 8 files changed, 4 insertions(+), 48 deletions(-)
[user@kernel linux]$ git bisect log 
git bisect start
# status: waiting for both good and bad commits
# bad: [038d61fd642278bab63ee8ef722c50d10ab01e8f] Linux 6.16
git bisect bad 038d61fd642278bab63ee8ef722c50d10ab01e8f
# status: waiting for good commit(s), bad commit known
# good: [0ff41df1cb268fc69e703a08a57ee14ae967d0ca] Linux 6.15
git bisect good 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
# good: [43db1111073049220381944af4a3b8a5400eda71] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
git bisect good 43db1111073049220381944af4a3b8a5400eda71
# bad: [11fcf368506d347088e613edf6cd2604d70c454f] uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again
git bisect bad 11fcf368506d347088e613edf6cd2604d70c454f
# bad: [ec71f661a572a770d7c861cd52a50cbbb0e1a8d1] Merge tag 'soc-dt-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad ec71f661a572a770d7c861cd52a50cbbb0e1a8d1
# bad: [9d49da438819c5dd82840eb63d929edbdccb80d8] Revert "iommu: make inclusion of arm/arm-smmu-v3 directory conditional"
git bisect bad 9d49da438819c5dd82840eb63d929edbdccb80d8
# good: [d8441523f21375b11a4593a2d89942b407bcb44f] Merge tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
git bisect good d8441523f21375b11a4593a2d89942b407bcb44f
# good: [eafd95ea74846eda3e3eac6b2bb7f34619d8a6f8] Merge tag 'pinctrl-v6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect good eafd95ea74846eda3e3eac6b2bb7f34619d8a6f8
# good: [dd91b5e1d6448794c07378d1be12e3261c8769e7] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
git bisect good dd91b5e1d6448794c07378d1be12e3261c8769e7
# bad: [879b141b7cfa09763f932f15f19e9bc0bcb020d5] Merge branches 'fixes', 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
git bisect bad 879b141b7cfa09763f932f15f19e9bc0bcb020d5
# bad: [21c03574df19f0d77cb2e4d28bc02c79b21e656a] iommu: Hide ops.domain_alloc behind CONFIG_FSL_PAMU
git bisect bad 21c03574df19f0d77cb2e4d28bc02c79b21e656a
# good: [d50aaa4a9ffb0149d2187dfe3477300561f06fec] iommu: Update various drivers to pass in lg2sz instead of order to iommu pages
git bisect good d50aaa4a9ffb0149d2187dfe3477300561f06fec
# bad: [17fce9d2336d952b95474248303e5e7d9777f2e0] iommu/vt-d: Put iopf enablement in domain attach path
git bisect bad 17fce9d2336d952b95474248303e5e7d9777f2e0
# good: [249d3327f0236302a92d9eccb2b32f64c8daaf86] iommu/vtd: Remove iommu_alloc_pages_node()
git bisect good 249d3327f0236302a92d9eccb2b32f64c8daaf86
# good: [0da188c8468d8fe544d0aa2a5f610c78b8d34819] iommu: Split out and tidy up Arm Kconfig
git bisect good 0da188c8468d8fe544d0aa2a5f610c78b8d34819
# bad: [7c8896dd4a2a27c84b04dcf0990e6f6b118cb6b2] iommu: Remove IOMMU_DEV_FEAT_SVA
git bisect bad 7c8896dd4a2a27c84b04dcf0990e6f6b118cb6b2
# good: [cfea71aea921311350aabd7d5fc92269a052410e] iommu/arm-smmu-v3: Put iopf enablement in the domain attach path
git bisect good cfea71aea921311350aabd7d5fc92269a052410e
# first bad commit: [7c8896dd4a2a27c84b04dcf0990e6f6b118cb6b2] iommu: Remove IOMMU_DEV_FEAT_SVA

i entred good when the boot process exceeded the stuck boot process and the graphics changed and bad only when it stuck the exact same way why i came here in the first place. it was my first bisect so maaaybe i am mistaken here.

b90g avatar Dec 09 '25 00:12 b90g

Hm, interesting, theoretically it shouldn't matter, as dom0 is not managing IOMMU (Xen is). But maybe there is some side effect. Normally I'd propose to test v6.16 with that commit reverted to be sure, but it doesn't revert cleanly...

marmarek avatar Dec 09 '25 00:12 marmarek

maybe just the 2 lines in den amd/iommu.c ?

b90g avatar Dec 09 '25 00:12 b90g

Posted some finding on the gitlab side. I have an idea: try to blacklist the whole amdxdna module - add to your kernel cmdline module_blacklist=amdxdna

marmarek avatar Dec 09 '25 00:12 marmarek

module_blacklist=amdxdna worked for my laptop on 6.17.9-1 (didn't try others). System booted normally and with display =-)

On Mon, Dec 8, 2025, 6:23 PM Marek Marczykowski-Górecki < @.***> wrote:

marmarek left a comment (QubesOS/qubes-issues#10275) https://github.com/QubesOS/qubes-issues/issues/10275#issuecomment-3629623170

Posted some finding on the gitlab side. I have an idea: try to blacklist the whole amdxdna module - add to your kernel cmdline module_blacklist=amdxdna

— Reply to this email directly, view it on GitHub https://github.com/QubesOS/qubes-issues/issues/10275#issuecomment-3629623170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDBXKE44BZNCBBCALGWQXT4AYJAZAVCNFSM6AAAAACHVFVGLOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMRZGYZDGMJXGA . You are receiving this because you commented.Message ID: @.***>

Tehvan avatar Dec 09 '25 03:12 Tehvan

yes blocking the amdxdna modules worked (will put this in etc/default as i personally dont want amdxdna especially in qubes :) )

should i provide xl dmesg, linux dmesg with blocked amdxdna?

b90g avatar Dec 09 '25 07:12 b90g