linux-surface
linux-surface copied to clipboard
GPU hand on wakeup on 5.2.16
I tried qzed's 5.2.16 today and it's causing GPU hand on wakeup from suspend - not every time, but frequently enough that I got it three times today during normal use. In particular, when I wake up, sometimes the surface becomes unresponsive for a time (I can move the cursor, but not do anything like enter password to unlock it) and then I get logged out and sent back to the login screen with no hardware acceleration.
On 5.2.5 @qzed suggested that we remove modprobe -r intel_ipts
and mei lines from the system-sleep/sleep
script, and when I did that, I started getting GPU hands on wakeup (see https://github.com/jakeday/linux-surface/issues/544#issuecomment-523133653). I added back the lines that remove modules and the issue was fixed.
Now it's back again on 5.2.16 (I assume also 5.2.15 - I didn't actually try that, but the patches didn't change). And it happens even though I do remove intel_ipts
, mei
, and mei_me
modules before suspend.
journal.txt
I can confirm that something's going on there, but I haven't had the time to look at it. On the SB2, (usually) the first resume takes a couple of seconds (which could indicate a GPU hang), during which I have a black screen, afterwards everything seems to be working okay again though. We should probably test if this also happens on a kernel without IPTS support. Problem is: In my case I haven't gotten an explicit message indicating a GPU hang... just 10s of total kernel message silence. Subsequent resumes don't seem to cause any trouble for me.
@tmarkov
If you're going to unload intel_ipts
, make sure you unload all the modules which are using intel_ipts
:
$ lsmod | grep -ie ipts -ie "Used by"
Module Size Used by
ipts_surface 16384 0
intel_ipts 45056 1 ipts_surface
i915 1884160 18 intel_ipts
mei 122880 3 intel_ipts,mei_me
hid 143360 7 i2c_hid,usbhid,hid_multitouch,hid_sensor_hub,intel_ipts,hid_generic,surface_acpi
Recent kernel build added ipts_surface
, which is using intel_ipts
.
$ sudo modprobe -r intel_ipts
modprobe: FATAL: Module intel_ipts is in use.
$ sudo modprobe -r ipts_surface
$ sudo modprobe -r intel_ipts
# no error
Note: I don't have this issue on 5.3.1 on SB1 with not unloading any ipts related modules.
Unload/load ipts modules on newer build
If you're going to unload/load ipts modules on suspend, I noticed that on newer build, unloading/loading only intel_ipts
is not sufficient.
You need to unload/load ipts_surface
instead.
Unload ipts module on newer build:
sudo modprobe -r ipts_surface
# unloading ipts_surface also unloads intel_ipts
Load ipts module on newer build:
sudo modprobe ipts_surface
# Loading ipts_surface also loads intel_ipts
Thanks, that fixed my immediate issue. I'll leave this open since GPU hang happens at all, but unloading for now works fine.
Okay, my immediate issue seems to be somehow caused by the dGPU hot-plug driver. When I remove this (and do not unload the modules in the sleep script) I can reproduce the GPU hang on my SB2. Definitely caused by IPTS as this doesn't happen when I use a kernel without IPTS support.
I'll take a note here that the issue "occasionally IPTS will break resuming from suspend" (RIP: 0010:kmem_cache_alloc_trace
) (https://github.com/jakeday/linux-surface/issues/544#issuecomment-513531566) still persists on 5.3.
I think it's somehow related to this issue.
An issue that may be related to this is that occasionally on resume for my SB2, if the dGPU is off, the keyboard and touchpad stop working and the whole base seems to get stuck in a hung state. The keyboard backlight and detach keys don't do anything and it requires a full power cycle to bring the base back. Unloading and reloading the IPTS modules didn't work for me.
Until the bug is fixed, I've found that power-cycling the dGPU in the sleep script:
if [ -n "$(surface dgpu get | grep off)" ]; then surface dgpu set on; surface dgpu set off; fi
seems to allow for an instant resume and solve the problem. I assume this is safe to add to the script for all systems because if the dGPU hotplug driver isn't loaded, nothing should happen.
If the dGPU is already on but the nvidia modules are NOT loaded, I can't reproduce the hang. However, if the nvidia modules were loaded, even if nothing is using them, the X server crashes on resume. I assume that's a different bug.
@chasecovello I'm currently working on the dGPU issue. Something in 5.2 changed and now it hangs. For me usually about 10s, occasionally requires a reboot though. The current solution for turning off the dGPU is more like a band-aid fix, but implementing this properly will likely take me a bit, as this time around I want to provide a proper solution handling all edge-cases. I agree that the X-server crash is probably a different bug.
To add a bit on the second "bug". I wouldn't really describe it as a bug, but rather as an issue. The problem is likely the following: X-server/wayland/gnome/... somehow access the device before suspending and expect it to be still present and in a similar enough state as before. Currently, the dGPU gets powered off during suspend (as it tends to get warm and consumes power if we don't do that), so the state changes. Apparently userland doesn't handle that gracefully. Ideally, we'd be able to suspend/resume the dGPU without completely forcing it off (and thus loosing all state on it) and it not getting warm. You already know the workaround: Unload the nvidia drivers before suspend (given that we force it off during suspend, applications that use it will crash anyway).
Anyway, that's off-topic here, the GPU hang discussed in this issue is on the integrated one and those two are not related.