linux-surface icon indicating copy to clipboard operation
linux-surface copied to clipboard

GPU hand on wakeup on 5.2.16

Open tmarkov opened this issue 5 years ago • 9 comments

I tried qzed's 5.2.16 today and it's causing GPU hand on wakeup from suspend - not every time, but frequently enough that I got it three times today during normal use. In particular, when I wake up, sometimes the surface becomes unresponsive for a time (I can move the cursor, but not do anything like enter password to unlock it) and then I get logged out and sent back to the login screen with no hardware acceleration.

On 5.2.5 @qzed suggested that we remove modprobe -r intel_ipts and mei lines from the system-sleep/sleep script, and when I did that, I started getting GPU hands on wakeup (see https://github.com/jakeday/linux-surface/issues/544#issuecomment-523133653). I added back the lines that remove modules and the issue was fixed.

Now it's back again on 5.2.16 (I assume also 5.2.15 - I didn't actually try that, but the patches didn't change). And it happens even though I do remove intel_ipts, mei, and mei_me modules before suspend. journal.txt

tmarkov avatar Sep 21 '19 18:09 tmarkov

I can confirm that something's going on there, but I haven't had the time to look at it. On the SB2, (usually) the first resume takes a couple of seconds (which could indicate a GPU hang), during which I have a black screen, afterwards everything seems to be working okay again though. We should probably test if this also happens on a kernel without IPTS support. Problem is: In my case I haven't gotten an explicit message indicating a GPU hang... just 10s of total kernel message silence. Subsequent resumes don't seem to cause any trouble for me.

qzed avatar Sep 21 '19 20:09 qzed

@tmarkov If you're going to unload intel_ipts, make sure you unload all the modules which are using intel_ipts:

$ lsmod | grep -ie ipts -ie "Used by"
Module                  Size  Used by
ipts_surface           16384  0
intel_ipts             45056  1 ipts_surface
i915                 1884160  18 intel_ipts
mei                   122880  3 intel_ipts,mei_me
hid                   143360  7 i2c_hid,usbhid,hid_multitouch,hid_sensor_hub,intel_ipts,hid_generic,surface_acpi

Recent kernel build added ipts_surface, which is using intel_ipts.

$ sudo modprobe -r intel_ipts
modprobe: FATAL: Module intel_ipts is in use.

$ sudo modprobe -r ipts_surface
$ sudo modprobe -r intel_ipts  
# no error

Note: I don't have this issue on 5.3.1 on SB1 with not unloading any ipts related modules.

kitakar5525 avatar Sep 24 '19 17:09 kitakar5525

Unload/load ipts modules on newer build

If you're going to unload/load ipts modules on suspend, I noticed that on newer build, unloading/loading only intel_ipts is not sufficient. You need to unload/load ipts_surface instead.

Unload ipts module on newer build:

sudo modprobe -r ipts_surface
# unloading ipts_surface also unloads intel_ipts

Load ipts module on newer build:

sudo modprobe ipts_surface
# Loading ipts_surface also loads intel_ipts

kitakar5525 avatar Sep 25 '19 06:09 kitakar5525

Thanks, that fixed my immediate issue. I'll leave this open since GPU hang happens at all, but unloading for now works fine.

tmarkov avatar Sep 25 '19 17:09 tmarkov

Okay, my immediate issue seems to be somehow caused by the dGPU hot-plug driver. When I remove this (and do not unload the modules in the sleep script) I can reproduce the GPU hang on my SB2. Definitely caused by IPTS as this doesn't happen when I use a kernel without IPTS support.

qzed avatar Oct 03 '19 12:10 qzed

I'll take a note here that the issue "occasionally IPTS will break resuming from suspend" (RIP: 0010:kmem_cache_alloc_trace) (https://github.com/jakeday/linux-surface/issues/544#issuecomment-513531566) still persists on 5.3.

I think it's somehow related to this issue.

kitakar5525 avatar Oct 10 '19 09:10 kitakar5525

An issue that may be related to this is that occasionally on resume for my SB2, if the dGPU is off, the keyboard and touchpad stop working and the whole base seems to get stuck in a hung state. The keyboard backlight and detach keys don't do anything and it requires a full power cycle to bring the base back. Unloading and reloading the IPTS modules didn't work for me.

Until the bug is fixed, I've found that power-cycling the dGPU in the sleep script:

if [ -n "$(surface dgpu get | grep off)" ]; then surface dgpu set on; surface dgpu set off; fi

seems to allow for an instant resume and solve the problem. I assume this is safe to add to the script for all systems because if the dGPU hotplug driver isn't loaded, nothing should happen.

If the dGPU is already on but the nvidia modules are NOT loaded, I can't reproduce the hang. However, if the nvidia modules were loaded, even if nothing is using them, the X server crashes on resume. I assume that's a different bug.

chasecovello avatar Oct 15 '19 18:10 chasecovello

@chasecovello I'm currently working on the dGPU issue. Something in 5.2 changed and now it hangs. For me usually about 10s, occasionally requires a reboot though. The current solution for turning off the dGPU is more like a band-aid fix, but implementing this properly will likely take me a bit, as this time around I want to provide a proper solution handling all edge-cases. I agree that the X-server crash is probably a different bug.

qzed avatar Oct 16 '19 17:10 qzed

To add a bit on the second "bug". I wouldn't really describe it as a bug, but rather as an issue. The problem is likely the following: X-server/wayland/gnome/... somehow access the device before suspending and expect it to be still present and in a similar enough state as before. Currently, the dGPU gets powered off during suspend (as it tends to get warm and consumes power if we don't do that), so the state changes. Apparently userland doesn't handle that gracefully. Ideally, we'd be able to suspend/resume the dGPU without completely forcing it off (and thus loosing all state on it) and it not getting warm. You already know the workaround: Unload the nvidia drivers before suspend (given that we force it off during suspend, applications that use it will crash anyway).

Anyway, that's off-topic here, the GPU hang discussed in this issue is on the integrated one and those two are not related.

qzed avatar Oct 16 '19 20:10 qzed