qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

X11 display staying black trying to unlock screen, on AMD Renoir with 4.17.5-6

Open ydirson opened this issue 10 months ago • 4 comments

Qubes OS release

Qubes OS 4.2

Brief summary

I started earlier this month to experience problems when coming back to my laptop and attempting to unlock the screen. The xss-lock banner does not get displayed, screen stays black (IIRC I still had the mouse cursor moving). I'm able to switch to another virtual console and login there; killing the screensaver does not change a thing.

At one point while working on a nearby machine, I did see that the locked screensaver got what looked like spurious wake-ups (the unlock banner popping up without apparent reason)... did not touch it and the next morning found that back screen again.

I could not see any obvious problem reported in the system logs or xen logs. The dom0 kernel does spit this slightly scary lines quite often though (which in my mind resonates with the spurious xss-lock wake-ups), but it also does the same in a working setup:

Mar 11 10:03:26 dom0 kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
Mar 11 10:03:26 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
Mar 11 10:03:26 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: reserve 0x900000 from 0x80fd000000 for PSP TMR
Mar 11 10:03:26 dom0 lvm[5948]: Monitoring thin pool qubes_dom0-pool00-tpool.
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
Mar 11 10:03:27 dom0 kernel: [drm] kiq ring mec 2 pipe 1 q 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
Mar 11 10:03:27 dom0 kernel: amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes

After checking the dnf history and trial and error downgrading stuff, it turns out the problem goes away if I downgrade the xen packages from 4.17.5-6 back to to 4.17.5-5.

Any idea of an experiment to get useful info?

Steps to reproduce

No response

Expected behavior

No response

Actual behavior

No response

Additional information

No response

ydirson avatar Mar 12 '25 08:03 ydirson

While the system was much stable after downgrading Xen, the problem still manifested after 9 days uptime. Since there were quite some kernel+firmware updates since then I've upgraded and we'll see how that plays out.

ydirson avatar Mar 21 '25 08:03 ydirson

While the system was much stable after downgrading Xen, the problem still manifested after 9 days uptime. Since there were quite some kernel+firmware updates since then I've upgraded and we'll see how that plays out.

It just required a few hours and half a handful of "lock screen" to trigger it again.

ydirson avatar Mar 21 '25 14:03 ydirson

After checking the dnf history and trial and error downgrading stuff, it turns out the problem goes away if I downgrade the xen packages from 4.17.5-6 back to to 4.17.5-5.

The only difference between those two versions is https://github.com/QubesOS/qubes-vmm-xen/pull/202. It does look related to your hardware, but several people reported it actually improves situation (probably on different AMD systems)...

As for the log, it looks like messages related to system resume. Does the issue happen around S3 suspend/resume? Or maybe you have enabled automatic suspend after some inactivity time? Is there some message (a bit earlier) about failing to suspend?

marmarek avatar Mar 21 '25 14:03 marmarek

As for the log, it looks like messages related to system resume. Does the issue happen around S3 suspend/resume? Or maybe you have enabled automatic suspend after some inactivity time? Is there some message (a bit earlier) about failing to suspend?

Suspend is disabled on this machine (still the same Bravo 17 that essentially fails to resume, except sometimes it accepts to suspend/resume once). The timing does feel like after some time the system would be put to suspend if that was enabled, but XScreensaver settings do show PM disabled, and just "blank after 10min" configured - but the blocked screens definitely happen after a much longer time.

New data points:

  • while with 4.17.6 it happens systematically, with 4.1.17.5 in happens "during first long break (aka lunch or night) or not at all"
  • in https://forum.qubes-os.org/t/vm-with-pci-passthrough-refusing-to-restart/34078/2 I described how my sys-usb would refuse to start after reboot sometimes (recent behaviour, seen twice) and how after several days uptime it would finally succeed booting. Now the catch: after the night I was stuck on my black screen again, so I was wondering if PCI passthrough could fit in here ... digging the logs, in fact this time I did no have any amdgpu complaints (or any unusual logs) showing, and nothing in hypervisor.log either.

ydirson avatar Jun 10 '25 07:06 ydirson