qubes-issues
qubes-issues copied to clipboard
Hard crashes on 5.15 and 5.18 kernels
Qubes OS release
4.1
Hardware: Thinkpad X1 Carbon 3rd generation
Perhaps relevant software: i3 window manager
Brief summary
On newer kernels than 5.10.x, I am seeing consistent hard crashes (X locks up, no mouse movement or keyboard response, sometimes it then hard reboots, sometimes I have to reboot it manually by holding the power button.) I get these about once a day with normal use.
I have seen it on these kernels in dom0, the VM kernel doesn't seem to matter:
5.15.52-1 5.15.57-1 5.15.63-1 5.18.16-1
I'm happy to test newer versions of kernel-latest as well, but right now I am on 5.10 because I don't like crashes.
Steps to reproduce
No single activity seems to make it easily reproducible. However, I have noticed that load can sometimes contribute. For example, watching a movie over sshfs and dragging a VLC window around.
Expected behavior
No crashes.
Actual behavior
Crashes, as noted above. I see nothing in journalctl. Following a suggestion in IRC, I have tried to enable pstore following this guide https://blogs.oracle.com/linux/post/pstore-linux-kernel-persistent-storage-file-system. However, /sys/fs/pstore is still empty after a crash.
I have also been getting these crashes for 2 months so maybe it is the same issue. On the Qubes user forum there are many others getting it too. My journal logs around the crash time generally have the following:
Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] Xorg[3907] context reset due to GPU hang Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:00d7ff93, in Xorg [3907] Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:00d7ff93, in Xorg [3907] Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:0aa15006, in Xorg [3907] Sep 21 15:30:33 dom0 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error ....... Sep 21 15:31:16 dom0 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:e757fefe, in Xorg [3907] Sep 21 15:31:28 dom0 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out Sep 21 15:31:28 dom0 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:e757fefe, in Xorg [3907]
I thought it might be due to frequency changing in the GPU, so I pinned the GPU frequency and that may have helped somewhat but did not resolve it:
sudo intel_gpu_frequency --max
For me, the issue seems to be triggered the more Qubes are open. I can run for hours and hours with like < 8 qubes running. However, > 8 qubes seems to cause system instability. I have also seen this with the audio system. Sometimes audio will no longer work even though the sound bars in the audio mixer shows sound but nothing comes out. If I shutdown several qubes somehow this will restore the audio. So maybe there is some general system instability being caused with lots of qubes running?
I have tested every kernel version and the crashes are about the same across them all. Generally the system will do one of 3 things where each one is associated with graphical artifacts up until the event: 1) window freeze where nothing works 2) logout and make me log back on. logging back in will generally result in a permanent freeze shortly after 3) the machine just shutdowns.
I have monitored temperature with "sensors" to see if it is a hard shutdown due to overheating and sensors never reports a temperature close to the critical value. Typicall < 70C
From my googling it looks like the thinkpad would also be using the i915 driver and i've seen this on one of my boxes also using an i915 driver. Kernel was 5.18.9-1. This could also explain the box failing to unlock occasionally - the screen would just stay black and it ignored all input, requiring a hard reboot. @QUser534 - where did you find intel_gpu_frequency ?
I'm not convinced that @QUser534 is seeing the same crashes as I am. I don't see those messages in the log, nor do I see the variable crash behavior.
One other thing about this machine of mine: It's using the intel driver, not modesetting, for Xorg, to avoid graphical tearing.
Still reproducible on 5.15.74-2.
Finally found some logs (for some reason I was getting nothing in journalctl, but I did a reinstall and then saw the logs) and discovered my crashes are a dupe of the existing bug about Xorg pages. https://github.com/QubesOS/qubes-issues/issues/7664