qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

Screen does not wake up after resume (AMD Ryzen 7 Pro 4750U)

Open isodude opened this issue 3 years ago • 155 comments

Qubes OS release

R4.1, kernel 5.14.7-1 (fedora 5.14) (same behavior in lower kernels.) XEN 4.14.3 (build from @marmarek branch)

Brief summary

Laptops does not resume after third sleep/resume cycle. The problem seems to be with

[drm] psp command (0x7) failed and response status is (0xFFFF0007)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmp failed!

It feels like there's a hung process in the amdgpu drivers for some reason.

Not sure how to debug this properly, XEN is not giving me much info at all. The problem is visible with X started as well obviously but I try to make the bug surface smaller.

Steps to reproduce

Boot laptop with X disabled, no VMs started. run systemctl suspend three times (and resuming) run reboot to restore system

Expected behavior

Possible to suspend limitless.

Actual behavior

Screen does not wake up on third resume. It's possible to write reboot and restart.

Notes

Works well with kernel booted without XEN. crash.filtered.log crash.filtered.xen.log

Workarounds

A bit more testing is needed but I do have sort of stable suspend/resume now. It even survives when everything goes south. There's a bit of tearing, but I'd rather have suspend than tearing.

cat << > /etc/X11/xorg.conf.d/50-video.conf 
Section "Device"
	Identifier "card0"
	Driver "amdgpu"
	Option "AccelMethod" "none"
EndSection

Compile xorg-x11-drv-amdgpu from https://github.com/freedesktop/xorg-xf86-video-amdgpu Run make install and install amdgpu_drv.so in /usr/lib64/xorg/modules/drivers on dom0.

For more stability run with kernel cmdline preempt=none

Do note that e.g. 4k external screen will be royally sluggish.

Sometimes the screen turns up black, type in the password anyhow and switch to tty2 and back again / suspend-resume again and it will most likely come to life again. Suspend/resume too fast could lead to instant reboot.

isodude avatar Sep 30 '21 20:09 isodude

The Xen processor (-19) from ACPI errors go away if I boot the kernel with nosmt, obviously.

In the console with lightdm never started it can survive at least 5-6 suspend-resume-cycles now.

Now compiling the kernel with CONFIG_DRM_AMD_DC_HDCP=n CONFIG_HSM_AMD_SVM=n CONFIG_AMD_MEM_ENCRYPT=n

isodude avatar Sep 30 '21 21:09 isodude

There is a problem with installing xorg-x11-driver-amdgpu, X won't start with errors related to unwind information not existing.I tried installing kernel-devel to make the amdgpu driver happy but it did not work out.

isodude avatar Sep 30 '21 21:09 isodude

Compiling the kernel without the mentioned flags above I managed to do a sleep/resume a lot longer.

When X is running it still dies on 'failed to terminate hdcp ta' anyhow though.

Not getting the xorg amdgpu driver to work even though I boot with older kernels.

isodude avatar Oct 01 '21 05:10 isodude

For those wondering how to build xen, here is my builder.conf.

# Since it's a very upstream branch
INSECURE_SKIP_CHECKING = vmm-xen
GIT_URL_vmm_xen = https://github.com/marmarek/qubes-vmm-xen
BRANCH_vmm_xen = update-4.14.3
COMPONENTS = \
builder \
builder-rpm \
vmm-xen

BUILDER_PLUGINS += builder-rpm

isodude avatar Oct 01 '21 06:10 isodude

amdgpu xorg driver now works with xorg-x11-drv-amdgpu-21.0.0-1 (https://fedora.pkgs.org/33/fedora-updates-x86_64/xorg-x11-drv-amdgpu-21.0.0-1.fc33.x86_64.rpm.html), not stable during suspend/resume or removing jitter after resume.

isodude avatar Oct 01 '21 06:10 isodude

Thanks for the help isodude. Tried xen 4.14.3 and kernel 5.13.13 and resuming from suspend is still broken (Ryzen 2400G). smt is off.

dom0 kernel: ------------[ cut here ]------------ dom0 kernel: WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:462 switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Modules linked in: loop nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 snd_hda_codec_realtek cfg80211 snd_hda_codec_gener> dom0 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.13-1.fc32.qubes.x86_64 #1 dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019 dom0 kernel: RIP: e030:switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Code: 00 00 65 48 89 05 e7 8f fa 7e e9 77 fd ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 0f 30 e9 57 fd ff ff 41 89 f6 e9 9d fe ff ff <0f> 0b e8 > dom0 kernel: RSP: e02b:ffffc900400afeb8 EFLAGS: 00010006 dom0 kernel: RAX: 000000000ea3c000 RBX: ffff8881002c4f00 RCX: 0000000000000040 dom0 kernel: RDX: ffff8881002c4f00 RSI: 0000000000000000 RDI: ffff88808ea3c000 dom0 kernel: RBP: ffffffff829d84e0 R08: 0000000000000000 R09: 0000000000000000 dom0 kernel: R10: 0000000000000004 R11: 0000000000000000 R12: ffff888100236a40 dom0 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 dom0 kernel: FS: 0000000000000000(0000) GS:ffff888127240000(0000) knlGS:0000000000000000 dom0 kernel: CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 dom0 kernel: CR2: 00005bde388bd0e8 CR3: 0000000002810000 CR4: 0000000000050660 dom0 kernel: Call Trace: dom0 kernel: switch_mm+0x1c/0x30 dom0 kernel: play_dead_common+0xa/0x20 dom0 kernel: xen_pv_play_dead+0xa/0x60 dom0 kernel: do_idle+0xd1/0xe0 dom0 kernel: cpu_startup_entry+0x19/0x20 dom0 kernel: asm_cpu_bringup_and_idle+0x5/0x1000 dom0 kernel: ---[ end trace 75177836fdaa3aca ]--- ... dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU1 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU3 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU5 dom0 kernel: xen_acpi_processor: (_PXX): Hypervisor error (-19) for ACPI CPU7 dom0 kernel: cpu 1 spinlock event irq 67 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: cpu 2 spinlock event irq 73 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: cpu 3 spinlock event irq 79 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: [drm] psp command (0x5) failed and response status is (0x0) dom0 kernel: [drm:psp_hw_start [amdgpu]] ERROR PSP load tmr failed! dom0 kernel: [drm:psp_resume [amdgpu]] ERROR PSP resume failed dom0 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] ERROR resume of IP block failed -22 dom0 kernel: amdgpu 0000:06:00.0: amdgpu: amdgpu_device_ip_resume failed (-22). dom0 kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -22 dom0 kernel: amdgpu 0000:06:00.0: PM: failed to resume async: error -22

johnnyboy-3 avatar Oct 01 '21 08:10 johnnyboy-3

with smt on:

dom0 kernel: ------------[ cut here ]------------ dom0 kernel: WARNING: CPU: 1 PID: 0 at arch/x86/mm/tlb.c:462 switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Modules linked in: nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_> dom0 kernel: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.13-1.fc32.qubes.x86_64 #1 dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019 dom0 kernel: RIP: e030:switch_mm_irqs_off+0x381/0x3a0 dom0 kernel: Code: 00 00 65 48 89 05 e7 8f fa 7e e9 77 fd ff ff b9 49 00 00 00 b8 01 00 00 00 31 d2 0f 30 e9 57 fd ff ff 41 89 f6 e9 9d fe ff ff <0f> 0b e8 > dom0 kernel: RSP: e02b:ffffc900400afeb8 EFLAGS: 00010006 dom0 kernel: RAX: 00000001023e0000 RBX: ffff8881002c8000 RCX: 0000000000000040 dom0 kernel: RDX: ffff8881002c8000 RSI: 0000000000000000 RDI: ffff8881823e0000 dom0 kernel: RBP: ffffffff829d84e0 R08: 0000000000000000 R09: 0000000000000000 dom0 kernel: R10: 0000000000000008 R11: 0000000000000000 R12: ffff88810a523300 dom0 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 dom0 kernel: FS: 0000000000000000(0000) GS:ffff888127240000(0000) knlGS:0000000000000000 dom0 kernel: CS: 10000e030 DS: 002b ES: 002b CR0: 0000000080050033 dom0 kernel: CR2: 00007202ec011726 CR3: 0000000002810000 CR4: 0000000000050660 dom0 kernel: Call Trace: dom0 kernel: switch_mm+0x1c/0x30 dom0 kernel: play_dead_common+0xa/0x20 dom0 kernel: xen_pv_play_dead+0xa/0x60 dom0 kernel: do_idle+0xd1/0xe0 dom0 kernel: cpu_startup_entry+0x19/0x20 dom0 kernel: asm_cpu_bringup_and_idle+0x5/0x1000 dom0 kernel: ---[ end trace 38fb75148761bdb4 ]--- ... dom0 kernel: cpu 1 spinlock event irq 67 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 2 spinlock event irq 73 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 3 spinlock event irq 79 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 4 spinlock event irq 85 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 5 spinlock event irq 91 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 6 spinlock event irq 97 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) dom0 kernel: cpu 7 spinlock event irq 103 dom0 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) ... dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=8448, emitted seq=8450 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 3765 thread X:cs0 pid 3839

... dom0 kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring gfx test failed (-110) dom0 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] ERROR resume of IP block <gfx_v9_0> failed -110 ... dom0 kernel: kfd kfd: amdgpu: error getting iommu info. is the iommu enabled? dom0 kernel: kfd kfd: amdgpu: Error initializing iommuv2 dom0 kernel: kfd kfd: amdgpu: device 1002:15dd NOT added due to errors ... dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered

johnnyboy-3 avatar Oct 01 '21 09:10 johnnyboy-3

@johnnyboy-3 do you have xorg-x11-drv-amdgpu installed?

isodude avatar Oct 01 '21 13:10 isodude

xorg-x11-drv-amdgpu v19.1.0-3 installed.

Also tried Linux Kernel 5.14.9-1 with the same bug. This time with new errors in journalctl on resume:

dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, signaled seq=9917, emitted seq=9919 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 3038 thread X:cs0 pid 3819 dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset begin! dom0 kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] ERROR ring kiq_2.1.0 test failed (-110) dom0 kernel: [drm] free PSP TMR buffer dom0 kernel: [drm] psp command (0x7) failed and response status is (0x0) dom0 kernel: [drm:psp_suspend [amdgpu]] ERROR Failed to terminate tmr dom0 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] ERROR suspend of IP block failed -22 dom0 kernel: ------------[ cut here ]------------ dom0 kernel: WARNING: CPU: 3 PID: 4326 at include/drm/ttm/ttm_bo_api.h:580 amdgpu_bo_unpin+0x5a/0xa0 [amdgpu] dom0 kernel: Modules linked in: nf_tables nfnetlink rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 cfg80211 rfkill libarc4 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec intel_rapl_msr intel_rapl_common snd_hda_core snd_hwdep joydev snd_seq snd_seq_device snd_pcm snd_timer snd soundcore wmi_bmof r8169 pcspkr sp5100_tco i2c_piix4 k10temp gpio_amdpt gpio_generic wmi video xenfs fuse ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt trusted asn1_encoder amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 ccp gpu_sched i2c_algo_bit drm_kms_helper cec drm xhci_pci xhci_pci_renesas xhci_hcd xen_acpi_processor xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput dom0 kernel: CPU: 3 PID: 4326 Comm: kworker/3:4 Tainted: G W 5.14.9-1.fc32.qubes.x86_64 #1 dom0 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Pro4, BIOS P3.60 07/31/2019 dom0 kernel: Workqueue: events drm_sched_job_timedout [gpu_sched] dom0 kernel: RIP: e030:amdgpu_bo_unpin+0x5a/0xa0 [amdgpu] dom0 kernel: Code: 75 25 48 8b bd 48 01 00 00 48 85 ff 74 05 e8 3d e2 5e c1 48 8b 85 c0 01 00 00 8b 40 10 83 f8 02 74 24 83 f8 01 74 0d 5b 5d c3 <0f> 0b 8b 85 04 02 00 00 eb ca 48 8b 85 30 01 00 00 f0 48 29 83 50 dom0 kernel: RSP: e02b:ffffc9004242fcb0 EFLAGS: 00010246 dom0 kernel: RAX: 0000000000000000 RBX: ffff88810d385288 RCX: 0000000000000000 dom0 kernel: RDX: ffff888013cc8000 RSI: 0000000000000000 RDI: ffff88810a717800 dom0 kernel: RBP: ffff88810a717800 R08: 0000000000000003 R09: 000000000036d488 dom0 kernel: R10: ffffc9004242fad8 R11: ffffffff82947168 R12: ffff88810d385288 dom0 kernel: R13: ffff88810a717800 R14: ffff888107321c00 R15: 0000000000000000 dom0 kernel: FS: 0000000000000000(0000) GS:ffff8881272c0000(0000) knlGS:0000000000000000 dom0 kernel: CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 dom0 kernel: CR2: 000074eab8891fb8 CR3: 0000000107554000 CR4: 0000000000050660 dom0 kernel: Call Trace: dom0 kernel: amdgpu_gart_table_vram_unpin+0x54/0xc0 [amdgpu] dom0 kernel: gmc_v9_0_hw_fini+0x5f/0x80 [amdgpu] dom0 kernel: amdgpu_device_ip_suspend_phase2+0xc5/0x150 [amdgpu] dom0 kernel: amdgpu_device_ip_suspend+0x32/0x60 [amdgpu] dom0 kernel: amdgpu_device_pre_asic_reset+0xa8/0x250 [amdgpu] dom0 kernel: amdgpu_device_gpu_recover.cold+0x53d/0x78e [amdgpu] dom0 kernel: amdgpu_job_timedout+0x17a/0x1a0 [amdgpu] dom0 kernel: drm_sched_job_timedout+0x74/0x110 [gpu_sched] dom0 kernel: process_one_work+0x1ec/0x390 dom0 kernel: worker_thread+0x4a/0x320 dom0 kernel: ? process_one_work+0x390/0x390 dom0 kernel: kthread+0x10f/0x130 dom0 kernel: ? set_kthread_struct+0x40/0x40 dom0 kernel: ret_from_fork+0x22/0x30 dom0 kernel: ---[ end trace d480e2c68621aa89 ]--- dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume dom0 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15dd dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset(2) failed dom0 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart dom0 kernel: kfd kfd: amdgpu: error getting iommu info. is the iommu enabled? dom0 kernel: kfd kfd: amdgpu: Error initializing iommuv2 dom0 kernel: kfd kfd: amdgpu: device 1002:15dd NOT added due to errors dom0 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -6 dom0 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout, but soft recovered

johnnyboy-3 avatar Oct 01 '21 14:10 johnnyboy-3

@johnnyboy-3 the correct kernel parameter should be nosmt btw. It's odd that xen_acpi_processor tries to send updates to XEN on thread number 2 on each processor, even though the kernel is booted with nosmt. It even says SMT: Disabled in boot.

isodude avatar Oct 05 '21 07:10 isodude

I think kernel doesn't have full knowledge which thread is running where, only Xen has direct access to that info. And in fact vcpu 2 of dom0 doesn't necessarily run on physical core/thread 2. This also means "nosmt" kernel option is not an effective mitigation against speculative execution bugs, when running under Xen.

marmarek avatar Oct 05 '21 09:10 marmarek

cool, so like a normal VM then. So like xen_acpi_processor trying to send up information about 16 cores can just be ignored.

Trying to understand and pin down exactly what makes the amdgpu drivers flip the switch and die on me when resuming, not sure which avenues are best to visit any longer in the debugging hunt.

isodude avatar Oct 05 '21 20:10 isodude

I just tried out kernel 5.15-rc5 and it's still the same behavior, however I had the laptop in sleep for the whole night and it woke up fine. Still this thing with artifacts around text sometimes when text is written to the screen.

I did one change though, I move away the ati_drv.so from /usr/lib64/xorg/modules/drivers, and I feel that xorg just behaves so much better now. Even though I can't read any direct differences in Xorg.0.log. I managed to suspend/resume a solid three times before amdgpu drivers giving up on SETUP_TMR command (which now is written out in the log due to the late kernel).

Just a note: One thing I'm concern about is that I need to revert (PCI/MSI: Use new mask/unmask functions), somewhere between 5.15-rc1 and 5.15-rc2 it was fixed, but between rc2 and rc4 it was unfixed again. I do have to bisect this. Since amdgpu dies hard on this, maybe it's a bug in their driver that just surfaces in the new mask/unmask functions.

The error that the kernel dies on this time is

[drm] psp gfx command SETUP_TMR(0x5) failed and response status is (0x0)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load tmr failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed

I'm not sure that this is the culprit or the fact that amdgpu just fails with firmware load on resume sometimes, I've seen HDCP fail as well. I've tried to unload TMR (Trusted Memory Region) by setting CONFIG_AMD_ENCRYPT_MEM=n. TBH I don't know what Xens standpoint is about those features, maybe @marmarek knows? But in general the kernel dies on HDCP and TMR.

isodude avatar Oct 06 '21 05:10 isodude

Yay, latest kernel-ark with

CONFIG_SND_SOC_AMD_RENOIR=n
CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_DRM_AMD_SECURE_DISPLAY=n
CONFIG_HSA_AMD_SVM=n
CONFIG_AMD_MEM_ENCRYPT=n

booting with kernel options pci=nomsi

Now it actually suspends/resumes correctly.

Attached is lspci -vv with Enable+ selected. lscpi-msi.log

isodude avatar Oct 06 '21 14:10 isodude

Tried dom0 linux kernel 5.10.61 recompilation with mentioned kernel & boot options on R4.1 - no luck.

johnnyboy-3 avatar Oct 06 '21 20:10 johnnyboy-3

@johnnyboy-3 I guess you need to be past the new MSI mask/unmask patches (somewhere between 5.14 and 5.15). I tried 5.12.14 and it was no go there. I can update my linux-kernel-tree if you'd like.

I did manage to get a crash, in like the 10th-15ths resume. Pretty much when the usb ports resetted. It feels like the problem may be in how the USB is done. I try to ignore 02:00.4 (the USB ports in the expansion port), but I lack the expertise to tell Xen just to ignore them. Soon I'll rip out ehci from the kernel :)

Looking at my lspci log it seems that xhci and ehci got MSI disabled, but not the other AMD PCI devices.

isodude avatar Oct 06 '21 21:10 isodude

15h sleep with 0.277Wh, that's pretty solid for S3! 5.15 Worked with these kernel configs and pci=nomsi.

CONFIG_AMD_PMC=y
CONFIG_HSA_AMD=n

These were set but I don't think they do any difference.

CONFIG_DRM_AMD_DC_HDCP=n
CONFIG_DRM_AMD_SECURE_DISPLAY=n

Text-jitter is almost gone completely compared to before.

I am going to compile 5.14.9 and see how well that fares with CONFIG_AMD_PMC=y CONFIG_HSA=AMD=n, because there's no need for disabling MSI. Then I'm going to bisect the problems with MSI in 5.15.

isodude avatar Oct 08 '21 05:10 isodude

Thats some good news!

I can update my linux-kernel-tree if you'd like.

Thanks for your offer but I don't think that's necessary for now. I wonder if this problem can be fixed on older kernels in Qubes R4.0 too.

johnnyboy-3 avatar Oct 08 '21 08:10 johnnyboy-3

5.14.9 doesn't work that well out of the box, with pci=nomsi it's quirky (external screen dies sometimes, internal screen dies somtimes), but I've suspend/resumed at least 10 times now without reboot. Not how well it works in 5.15 with pci=nomsi though.

This is 5.14.9 (latest qubes-linux-kernel) with

CONFIG_AMD_PMC=y
CONFIG_HSA_AMD=n

Will try to get tip booted without pci=nomsi now, that should be fun!

isodude avatar Oct 09 '21 01:10 isodude

Thanks for your offer but I don't think that's necessary for now. I wonder if this problem can be fixed on older kernels in Qubes R4.0 too.

I'm pessimistic! There's alot of changes between those kernels and the new ones.

isodude avatar Oct 09 '21 01:10 isodude

With some patches in msi drivers I got kernel 5.15 working.

X is restarting once in a while, but that's fine since X running inside VMs survive :) I guess that relates to my hacked up X amdgpu drivers.

isodude avatar Oct 12 '21 20:10 isodude

Progress, yeah! ^^

With some patches in msi drivers I got kernel 5.15 working.

X is restarting once in a while, but that's fine since X running inside VMs survive :) I guess that relates to my hacked up X amdgpu drivers.

You are running a clean R4.1 RC1 or did you add/changed anything beside modified Kernel 5.15 and msi drivers? Kernel self-compiled with CONFIG_AMD_PMC=y and CONFIG_HSA_AMD=n, right? What msi patches? Anything else?

I tried RC1 out of the box and with kernel-latest 5.14.10 (testing) but same issue as before, just to be sure ^^

bigdx avatar Oct 13 '21 15:10 bigdx

builder.conf:

GIT_URL_linux_kernel = https://github.com/isodude/qubes-linux-kernel
BRANCH_linux_kernel = devel-5.15

I don't get how I should make make get-sources work properly, but I download it manually instead.

wget https://gitlab.com/cki-project/kernel-ark/-/archive/v5.15-rc5/kernel-ark-v5.15-rc5.tar.bz2

unpack it, rename the folder to linux-5.15-rc5, pack it again as .tar.

I'm compiling the kernel now to see if it really works with what I commited. It's quirky right now, but haven't had to reboot the system yet.

isodude avatar Oct 14 '21 04:10 isodude

I updated the patch a bit for MSI a bit, to reflect what actually was missing between the two commits (adding msi functions vs removing old ones).

I see these flip done timeouts still though, I though it was going quite good with the new MSI patches but I get flip done timeout anyways, but not near as bad as without them.

Something is stuck somewhere and I have no clue how to even see what is wrong. All I know is, 5.15 with pci=nomsi is a good combo at least. Even 5.15 without X started doesn't fare good with suspend, but 5.15 with pci=nomsi just keeps going even if there's errors.

If anyone has any idea about what to do or what to analyze, please do tell.

isodude avatar Oct 16 '21 18:10 isodude

@isodude, I don't understand a lot of the things you're trying and discussing here, but do you think this amdgpu issue I'm experiencing with Ryzen 7 4800H might be related to this one? FWIW, I also have the same "screen does not wake up after resume" issue as well, though I haven't actually tried to diagnose that at all... :sweat_smile:

na-- avatar Oct 18 '21 08:10 na--

@na-- I've been quite pessimistic lately towards that there's one fix to rule them all. But rather a whole slew of small patches that makes up the forest. There's a bunch om upstreamed commits regarding the amdgpu driver that's worth testing out.

I just sent in the PCI/MSI-patch this morning and I hope that it gets accepted: https://lore.kernel.org/linux-pci/[email protected]/T/#u that will make our life a bit easier when testing out 5.15 kernel, it's already included in my qubes-linux-kernel branch.

isodude avatar Oct 18 '21 10:10 isodude

I've got some updates, hopefully better updates after a bit more trial.

Anyhow, I tried compiling withing CCP (the AMD crypto co processor) and the system survived all resumes I threw at it. Sometime sdma just drops dead and I have to restart lightdm (doing it with a blank screen), but after that suspend/resume works anyhow. I'm doing some patching on the amdgpu-drivers now around the sdma resume area. Hopefully that will yield good result.

It would be nice to see if disabling CCP has good effect on versions below 5.15 as well.

isodude avatar Oct 23 '21 21:10 isodude

thanks for all the work you're putting into this

tzelch avatar Oct 23 '21 21:10 tzelch

Hopefully we will reach some sort of end to it all, thanks for following along :)

I realized that I had a config file installed that may not be common knowledge.

in /etc/X11/xorg.conf.d/50-video.conf

Section "Device"
  Identifier "card0"
  Driver "amdgpu"
  Option "AccelMethod" "none"
EndSection

Do you guys have that? There's no acceleration to talk of in Xen anyhow, and without that snippet my suspend/resume have really odd effects and X suddendly hard crashing.

isodude avatar Oct 24 '21 09:10 isodude

There's no acceleration to talk of in Xen anyhow

This is actually not true. Hardware acceleration in the GUI qube should work just fine. Please report a bug if it does not.

DemiMarie avatar Oct 24 '21 19:10 DemiMarie