linux-surface icon indicating copy to clipboard operation
linux-surface copied to clipboard

Touch and pen issue persists

Open ghost opened this issue 6 years ago • 188 comments

#242

I'm sorry to say that despite your hard work, in 4.19.18-2 touch intermittently will crash still. I haven't used the latest kernel for very long but my initial impression is that it is no better or worse than using one of the 4.18.7 kernals I was previously using.

My experience continues to be it most obviously crashes as a result of visiting a site that will load new information as you scroll down a page ad infinitum. For example, scrolling down someone's twitter feed at reading speed. Another example is if I visit a site that has a "long" home page packed with links, pictures and short video clips. An example that comes to mind is dailymail.co.uk which often will trigger me with rage, and my SP4's touch to crash.

In both examples, it doesn't crash the moment i venture onto the site, I usually have scrolled for a while before it will crash.

I don't know why both those examples reliably crash it.

ghost avatar Jan 29 '19 16:01 ghost

Oh, I can still reproduce this issue on SB1. Touch stopped at about 10min browsing on dailymail.co.uk with the display is in portrait orientation. Though I think this is much more stable than before.

And as before, there are no useful logs.

kitakar5525 avatar Jan 29 '19 17:01 kitakar5525

Same. Installed the latest kernel and touch crashes after a while on my SP4 running Ubuntu 18.04.

ghost avatar Jan 30 '19 07:01 ghost

I would assume @kitakar5525 was experiencing a different issue on 4.16+ (which I don't think I was, as it wasn't less stable for me), and which Jakeday fixed. But I can again confirm I have reproduced this on the original 4.9 code.

tmarkov avatar Jan 30 '19 09:01 tmarkov

@tmarkov I think you are right.

We could not set priority to GUC_CLIENT_PRIORITY_HIGH in guc_client_alloc() on 4.16+, which is fixed on latest commit 1143fca. This change made significant stability for me. (Originally, This change was proposed on here.)

But as you point out, the real problem is in original IPTS code. I thought this original problem could be fixed with the commit 1143fca, but it turned out to be mitigation of the real problem.

kitakar5525 avatar Jan 30 '19 10:01 kitakar5525

By the way, I created IPTS patch from original repository ipts-linux-org/ipts-linux-new for reference.

kitakar5525 avatar Jan 30 '19 11:01 kitakar5525

I'm really going to need some logs for debugging here. I can't reproduce this on an SB1, SB2 or Surface Laptop.

jakeday avatar Jan 30 '19 11:01 jakeday

Here is dmesg. I switched on without the keyboard attached, began browsing. I was reading a twitter feed at the time and I tapped a tweet to look at a still image, touch crashed simultaneously as the tweet was brought up.

20mins has elapsed since the last event before touch crashes and I now attach the keyboard.

[Wed Jan 30 15:50:45 2019] usb 1-7: new full-speed USB device number 3 using xhci_hcd [Wed Jan 30 15:50:45 2019] usb 1-7: New USB device found, idVendor=045e, idProduct=07e8, bcdDevice= 2.07 [Wed Jan 30 15:50:45 2019] usb 1-7: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [Wed Jan 30 15:50:45 2019] usb 1-7: Product: Surface Type Cover [Wed Jan 30 15:50:45 2019] usb 1-7: Manufacturer: Microsoft [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input19 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover Mouse as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input20 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover Consumer Control as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input21 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input22 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover Touchpad as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input23 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input24 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input25 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input26 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input27 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input28 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input29 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input30 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input31 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input32 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input33 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input34 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input35 [Wed Jan 30 15:50:45 2019] input: Microsoft Surface Type Cover UNKNOWN as /devices/pci0000:00/0000:00:14.0/usb1/1-7/1-7:1.0/0003:045E:07E8.0003/input/input36 [Wed Jan 30 15:50:45 2019] hid-multitouch 0003:045E:07E8.0003: input,hiddev0,hidraw1: USB HID v1.11 Keyboard [Microsoft Surface Type Cover] on usb-0000:00:14.0-7/input0

Then I run these commands as a script.

xset dpms force off && xset dpms force on

or this also works

sudo rmmod intel_ipts sudo rmmod mei_me sudo rmmod mei sudo modprobe intel_ipts sudo modprobe mei_me sudo modprobe mei

and the next events are:

[Wed Jan 30 15:52:18 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: 0x80000004 failed status = 14 [Wed Jan 30 15:52:19 2019] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [Wed Jan 30 15:52:19 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: touch enabled 4

To be clear, I ran dmsg once the keyboard was attached and again after I ran the script to clearly identify the before and after.

I think dmesg has been provided before and as alluded to it doesn't offer a lot. Perhaps another command I'm not aware of will draw more info? Or perhaps something could be built into the kernel that would produce a more informative log in the event touch crashes?

ghost avatar Jan 30 '19 17:01 ghost

I'll make some tweaks to the ipts module to address this.

jakeday avatar Jan 31 '19 12:01 jakeday

Can you test this with the latest 4.19.19 kernel?

jakeday avatar Feb 04 '19 12:02 jakeday

This issue is still happening for me. And I am not sure if the commit a3a3ed3 changed the situation.

kitakar5525 avatar Feb 05 '19 13:02 kitakar5525

Yes is crashing still. It is difficult to say it is better or worse.

Some additional info is that as has been mentioned before and is worth mentioning again is that rotating the screen often kicks touch into working. Here is dmesg in that eventuality, where the previous event 15 minutes before is omitted as irrelevant.

[Tue Feb 5 03:34:42 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: 0x80000004 failed status = 14 [Tue Feb 5 03:34:42 2019] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS. [Tue Feb 5 03:34:42 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: touch enabled 4

These events are identical to the events in dmesg occurring once the commands written into the script detailed in my previous post are run.

Let me know if anything more from the verbose dmesg is needed.

ghost avatar Feb 05 '19 17:02 ghost

Touch crashing for those who can't reproduce it. Note the flickering prior to crashing. Touch usually crashes for me while simultaneously interpreting I have clicked on something which I definitely haven't. In this case an image is brought up but it could otherwise be a link taking me away from the page I'm on.

https://vimeo.com/315696472

ghost avatar Feb 06 '19 17:02 ghost

Can you give this a try with 4.19.20-1? I'm curious is cleaning up some things helped with this.

jakeday avatar Feb 08 '19 12:02 jakeday

Used just under 1 1/2 hrs. By comparison, less flickering. Still crashed. I'll need to keep using it to judge how much better it works on that other site that regularly crashed touch.

ghost avatar Feb 08 '19 14:02 ghost

Setting touch mode to ”single touch” twice can fix touch input into ”single touch” mode, and then setting touch mode to ”multi touches” once can fix touch input into ”multi touches” mode.

$ sudo su -c "echo 0 > /sys/kernel/debug/ipts/mode"
# kern  :err   : [Fri Feb  8 23:39:10 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: 0x80000004 failed status = 14
$ sudo su -c "echo 0 > /sys/kernel/debug/ipts/mode"
# kern  :err   : [Fri Feb  8 23:39:32 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: touch enabled 3
$ sudo su -c "echo 1 > /sys/kernel/debug/ipts/mode"
# kern  :err   : [Fri Feb  8 23:39:51 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: touch enabled 4

If you want to automate it, you need about 0.5 sec sleep:

sudo su -c "echo 0 > /sys/kernel/debug/ipts/mode"
sleep 0.5
sudo su -c "echo 0 > /sys/kernel/debug/ipts/mode"
sleep 0.5
sudo su -c "echo 1 > /sys/kernel/debug/ipts/mode"

Below 2 cases do not fix touch input.

$ sudo su -c "echo 1 > /sys/kernel/debug/ipts/mode"
# the command does not return... I interrupt with Ctrl+C
^CInterrupt
# in dmesg, nothing was recorded
$ sudo su -c "echo 0 > /sys/kernel/debug/ipts/mode"
# kern  :err   : [Fri Feb  8 23:45:01 2019] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: 0x80000004 failed status = 14
$ sudo su -c "echo 1 > /sys/kernel/debug/ipts/mode"
# the command does not return... I interrupt with Ctrl+C
^CInterrupt
# in dmesg, nothing was recorded

kitakar5525 avatar Feb 08 '19 15:02 kitakar5525

Issue persists in yesterday's release, probably no better or worse.

ghost avatar Feb 17 '19 19:02 ghost

Same issue on suspend (not always) [ 6823.436509] IPTS ipts_mei_cl_exit() is called [ 6823.436624] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: error in reading m2h msg [ 6823.436710] IPTS removed [ 6835.143692] IPTS ipts_mei_cl_init() is called [ 6835.156343] probing Intel Precise Touch & Stylus [ 6835.156345] IPTS using DMA_BIT_MASK(64)

Ubuntu 18.04 , 4.19.23-surface-linux-surface-1 , no UEFI .

korakios avatar Feb 27 '19 23:02 korakios

Same with 4.19.27-surface-linux-surface # 2 kernel in Deepin

EDIT: It always happens, no need to suspend, unloading and reloading manually the module ,makes touch not working. EDIT2: When hybernating and resuming touch works EDIT3: same behavior with 5.0.1-surface-linux-surface # 4

korakios avatar Mar 15 '19 19:03 korakios

@korakios Seems like you're talking about a different issue, see https://github.com/jakeday/linux-surface/issues/270 (it is closed, but maybe it's still happening). This issue is about the case when touchscreen spontaneously stops working and reloading modules/rotating screen fixes it.

tmarkov avatar Mar 21 '19 08:03 tmarkov

Oups, I will post there instead , thank you :)

korakios avatar Mar 21 '19 15:03 korakios

Seeing as this issue is so persistent, would it be possible to have a brief post explaining why it happens? When you refer back to posts describing the first debian installations on the SP4 there is no mention of touch failing. Is it possible this issue has always existed but wasn't documented? I don't understand how the failure happens, what triggers it and what makes it so difficult to fix. It would simply be a little helpful to know more. Googling hasn't helped me understand the IPTS issue.

I use my ubuntu partition with the keyboard always attached because I so regularly have to execute a script to re-start touch. It's not possible to exaggerate how often touch crashes.

How is it possible the issue doesn't occur on all Surface detachables? Do we know which Surface detachables touch doesn't crash on and is it possible to draw any conclusions from that?

Thanks in advance...

ghost avatar May 02 '19 23:05 ghost

why it happens?

From what I know, no one knows yet. If I ask the IPTS to change sensor mode to single touch mode after touch crashed, it replies 0x80000004 failed status = 14, which means TOUCH_SENSOR_QUIESCE_IO_RSP status is TOUCH_STATIS_TIMEOUT (maybe misspelling of TOUCH_STATUS_TIMEOUT).

https://github.com/jakeday/linux-surface/blob/2b206b56e125c9fda4661dbcc91095e9bf993d28/patches/5.0/0005-ipts.patch#L3516 https://github.com/jakeday/linux-surface/blob/2b206b56e125c9fda4661dbcc91095e9bf993d28/patches/5.0/0005-ipts.patch#L3576 https://github.com/jakeday/linux-surface/blob/2b206b56e125c9fda4661dbcc91095e9bf993d28/patches/5.0/0005-ipts.patch#L3832-L3846

I don't know what Quiesce I/O flow means yet.

kitakar5525 avatar May 30 '19 13:05 kitakar5525

Do we know which Surface detachables touch doesn't crash on

At least, I've heard this issue from SB1 and SP4 owners only (both are Skylake generation).

I want to hear from the owners of newer devices like SB2, SL, SP5 or later (or even non-surface devices which also use IPTS like HP Spectre x2), whether this issue happens/happened on your device or not...

kitakar5525 avatar May 30 '19 13:05 kitakar5525

@kitakar5525 That sounds like something that could be leveraged for a workaround, like a script that automatically detects touch has crashed and restarts it.

It's also strange that it doesn't always happen SB1/SP4 devices. I feel like when I originally installed linux on my SB it wasn't happening. But I had other issues and so switched to Windows. Eventually I went back to linux; now it was happening. Then my device broke, I exchanged it, and now the issue is not happening.

tmarkov avatar May 30 '19 15:05 tmarkov

@kitakar5525 I'm not using touch/stylus that extensively on Linux, but I haven't had this happen on my SB2. Could be due to a difference in CPU/iGPU models?

qzed avatar May 30 '19 17:05 qzed

@tmarkov

a script that automatically detects touch has crashed

The IPTS behaves as if it is working well until ipts_send_sensor_quiesce_io_cmd() is sent (by turning the display off/change sensor mode manually from debugfs). So, from the IPTS side, I think it is difficult to detect the crash automatically… We have to find another way to detect the crash.

@qzed

Could be due to a difference in CPU/iGPU models?

Before, sebanc pointed out

https://github.com/jakeday/linux-surface/issues/31#issuecomment-366778166 priority mechanism which seems not supported on Skylake guc

I'm not sure if this is true, but the patch actually reduced the frequency of crashing (https://github.com/jakeday/linux-surface/issues/374#issuecomment-458886999)

kitakar5525 avatar May 31 '19 14:05 kitakar5525

This is good. And so, I want to focus on my SP4 and look at how it responds to different things but it's difficult because there aren't any debugging options available focused specifically on IPTS are there? There is no way for me monitor live statistics coming from IPTS that would show some sort of threshold breach that causes touch to crash? I want to try to look at this myself if possible because I understand it isn't a priority for every SP/SB. Can I tweak things in a methodical way myself to rule out things that aren't part of the problem?

Is there anything to be learnt from Windows and how it handles IPTS? I've heard Skylake is troublesome but if I never have issues on Windows it must be possible to port over in such a way that it doesn't get maxed out in some way and crash in Ubuntu? Thanks again.

ghost avatar May 31 '19 15:05 ghost

There is no way for me monitor live statistics coming from IPTS that would show some sort of threshold breach that causes touch to crash?

I know evemu.

list available devices:

sudo evemu-describe

For me, ipts 1B96:005E Touchscreen is finger touch input device. Watch the device:

# for example. `13` may vary.
sudo evemu-record /dev/input/event13

...and I realized that when touch crashes, the last output is ~~always~~ almost always like this:

E: 2.755715 0000 0000 0000    # ------------ SYN_REPORT (0) ---------- +0ms
E: 2.856273 0003 0039 -001    # EV_ABS / ABS_MT_TRACKING_ID   -1
E: 2.856273 0001 014a 0000    # EV_KEY / BTN_TOUCH            0
E: 2.856273 0000 0000 0000    # ------------ SYN_REPORT (0) ---------- +101ms

whereas if I released my finger when the touch is still alive, it is like this:

E: 143.244776 0000 0000 0000    # ------------ SYN_REPORT (0) ---------- +7ms
E: 143.257205 0003 0039 -001    # EV_ABS / ABS_MT_TRACKING_ID   -1
E: 143.257205 0001 014a 0000    # EV_KEY / BTN_TOUCH            0
E: 143.257205 0004 0005 780000    # EV_MSC / MSC_TIMESTAMP        780000
E: 143.257205 0000 0000 0000    # ------------ SYN_REPORT (0) ---------- +13ms

When finger touch crashed:

  • The last SYN_REPORT (0) takes 100ms+
  • There is no # EV_MSC / MSC_TIMESTAMP

kitakar5525 avatar May 31 '19 15:05 kitakar5525

@qzed @kitakar5525

I don't have that problem. Here's my /proc/cpuinfo (on SB1). Perhaps we can compare with someone who has the problem to see if there's any difference?

cpuinfo.txt

tmarkov avatar May 31 '19 16:05 tmarkov

As for the script, I was thinking maybe the touch status can be exposed to debugfs. I can't find how it's tracked though.

tmarkov avatar May 31 '19 19:05 tmarkov

Just looking around, I see Microsoft have touchscreen tests including buffering tests & reporting rate tests. This feeling that heavy sites cause touch to crash, could this be related in particular to how touch buffers? The idea being that it can't buffer enough then crashes? Just like filling up your RAM?

Is there a way to monitor or test this in this kernel do you think? Or how is buffering managed in these patches? Or am I barking up the wrong tree? If you could compare touchscreen test results with your Windows partition it might help direct us to a solution.

That evemu-tools utility is interesting.

Separately, I though I'd post these here because they contain info on touch and event codes that probably you are familier with but are interesting nonetheless.

ghost avatar May 31 '19 23:05 ghost

As an aside, do any of you get invited to download an html link rather than open it in your browser? I think this is related to touch issues rather then a separate issue with the kernel and a lower priority for me anyway over touch crashing but I'm interested if you all experience the same thing. Often around the same point touch is frequently crashing tapping on a web link is met by a relatively long pause, then a window offering to download the html link is offered instead of opening the link and leaving the current page. It is the same window offered when you long press on a web link and tap "Save link as..."

Often once this begins, it doesn't stop and I need to use the touch pad on my keyboard to open links until I restart.

ghost avatar May 31 '19 23:05 ghost

@kitakar5525, @condemnedmeat

The IPTS behaves as if it is working well until ipts_send_sensor_quiesce_io_cmd() is sent (by turning the display off/change sensor mode manually from debugfs).

This could be a good lead to follow up on.

Also I don't think you'll get much debugging info from the IPTS driver without modifying it (e.g. putting print statements everywhere). HID (e.g. via hid-recorder) is probably the closest you can get to raw data, as the IPTS devices directly send HID data to the kernel. evemu shows the output after it has already been processed by the kernel.

qzed avatar Jun 01 '19 18:06 qzed

@qzed

e.g. putting print statements everywhere

This wouldn't be appropriate?

I've been using hid-replay. It is showing touch misbehaving. I'm almost of the opinion running hid-record stops touch crashing because it seems to do everything but, so rather strangely I can't tell you what happens in that instance. At the moment anyway.

I've attached an image which explains to the extent I understand these things what hid-record shows. Other numbers not referred to describe finger movements in directions other than just horizontal and vertical.

git

What happened today was hid-record interpreted multiple phantom touches on the screen, (I.e each of ten fingers (except the 1st) is 20 pairs, a number of these 20 pairs were being interpreted by hid-record. The number of fingers counted at the bottom of that image remained at one, unless I touched the screen). These phantom touches were stuck on. During this it caused recognisable issues with touch but touch didn't crash. Instead I was able to scroll a web page but not, for example, minimise the browser, it wasn't recognised. These phantom touches became unstuck by themselves minutes after they started the first time it happened, I rotated the screen from portrait to landscape to see what is output (nothing) and when I tried to rotate back to portrait it wouldn't go. Rotation had crashed at this point. I persevered and again shortly after, hid-record interpreted multiple phantom touches again. This time when I touched the screen, instead of staying at 01, registering one finger, it went to 02 instead. (I.e first time around while phantom touches were stuck on touching the screen with one finger meant the listed number of fingers stayed at one. This time touching the screen with one finger meant the touch count went to 02.)

While this isn't the main focus it perhaps shows hid-replay might be helpful if we can see what happens when touch crashes.

Edit 0706 It turns out the phantom touches were also present in Windows and I ran the hot-fix described later to get rid of them.

ghost avatar Jun 03 '19 22:06 ghost

....and running xset dpms force off && xset dpms force on stops the phantom touches

ghost avatar Jun 03 '19 22:06 ghost

Here you go, hid-record interpreting touch crashing....

Screenshot from 2019-06-04 12-55-38

And so following the explanation of what the different information presented means in the previously posted image you can see that I've only used one finger since I began hid-record, I haven't pinched-to-zoom etc. At 182 seconds my finger is not touching the screen, this is verified because it says "e4". I touch the screen at 192 seconds as is verified by it changing to "e7" and immediately touch crashed. I couldn't do anything with the screen, it wasn't recognised. When I run the command to reset touch the screen goes black then comes back on as is normal when running that command. You can see the next two entries then appear at 433.22 & 433.23 seconds. I did not touch the screen to create these entries. You can see the first entry says I'm touching the screen and the final entry says I'm not touching the screen. Touch now functions as expected.

One interesting thing to observe as a result of using hid-record would be when the records at 433.22 and 433.23 seconds are produced in relation to that command. Because the screen goes black I don't believe I will be able to see that but it might be seen if the Surface was duplicated to another screen.

Another interesting observation would be what happens in hid-record if I use the alternative command

"sudo rmmod intel_ipts sudo rmmod mei_me sudo rmmod mei sudo modprobe intel_ipts sudo modprobe mei_me sudo modprobe mei"

To restart touch.

If I have the time to test the former I will add any useful information here.

What does this all tell us about the IPTS patches in relation to touch crashing? I only looked into hid-replay as a result of the response to my prompt for things we could try, are there any other resources in linux we have recourse to worth trying to add to our collective knowledge?

ghost avatar Jun 04 '19 13:06 ghost

@condemnedmeat

This wouldn't be appropriate?

I didn't want to criticize the method, just wanted to say that you'd have to re-compile the kernel every time you want to add/remove a print statement. It's always a good debugging technique, as long as you place them at the right points. Unfortunately figuring that out takes at least a couple of tries (and it's also not nice for normal users to get spammed with debug messages).

As a side-note: I found the hid-tools directly from the repo a bit more friendly as they also show some comments on the data, so you don't have to figure all out by yourself. They're python scripts, so easy to run without needing to install them.

To go further on this tough, I think you'd need to check how the IPTS driver sends the HID messages. As far as I know (I might be wrong) there is a direct mapping of the MEI communication to HID (as in MEI actually sending HID data). I really know too little to even speculate, but could the phantom touches maybe result from missing or truncated MEI/HID messages? I think if the phantom touches can't be caused by missing or somehow malformed (i.e. truncated) HID messages and the HID messages are a 1:1 mapping from MEI messages, you'd have to consider the phantom touches to come from faulty hardware. Also note that they don't necessarily have to have anything in common with the crashes.

qzed avatar Jun 04 '19 22:06 qzed

Thanks very much for this @qzed. I'll have a look at all that.

you'd have to consider the phantom touches to come from faulty hardware

Most definitely, we need others with the same touch issue to use the hid tool of their choice and see if it records touch crashing in the same way. As dmesg shows nothing it will be the only thing we all have in common.

Overall I'm not sure how to interpret touch crashing now, according to hid-record one conclusion that could be drawn is it is related to phantom touches because it "crashes" recording an open-ended touch bracket. It, (e7 at one end and e4 at the other) doesn't close until I run those commands. Then these other two events appear and close the bracket. This doesn't match with the fingers in touch with the screen listed at the bottom of the events within that bracket which makes it look less like something to do with phantom touches and more to do with mismatched IPTS information. Tell me if you think my logic is wrong here. After all, many people are affected by this issue I believe and it's hard to imagine we all have faulty hardware.

Also note that they don't necessarily have to have anything in common with the crashes.

Correct, I suppose we needed to find a thread to pull and then keep pulling it. At least hid-replay shows something.

I've only been using hid-replay for a short while so there is more to be learnt.

On Windows, I had phantom touches but there was a hot-fix (no longer available to download). Apparently some surface screens were made by Sony, their devices also experienced these phantom touches or dead spots and a fix is still available from their site. I haven't had any issues whatsoever on the Windows side since I applied the hot-fix. The phantom touches on the windows side were very obvious, on the linux side I wouldn't have described anything I've seen since I installed the kernel as a phantom touch. Only using hid-replay drew that out as a possibility. How does this lack-of-problems on the Windows side square with the faulty hardware outcome?

ghost avatar Jun 05 '19 01:06 ghost

If I followed breadcrumb trails correctly, the hotfix should be this.

mirh avatar Jun 05 '19 14:06 mirh

@condemnedmeat Right. At the moment I don't think the crashes are faulty hardware. The phantom touches might be, but I doubt the crashes are.

On the Microsoft page of the hot-fix, the phantom touches are described as a calibration issue. Further

This tool saves calibration information to the touch firmware in the Surface device. It does not change any settings in Windows. Therefore, you can reimage the device or reinstall Windows after you run the tool without having to run the tool again.

So if it's the same issue, you should be able to run the fix on Windows and if it's the same problem, the issues on Linux should go away.

qzed avatar Jun 05 '19 17:06 qzed

Thanks for following up on this @mirh

@qzed,

So if it's the same issue, you should be able to run the fix on Windows and if it's the same problem, the issues on Linux should go away.

This is right, in fact I solved the issue of phantom touches in the JakeDay kernel for another user

https://github.com/jakeday/linux-surface/issues/244#issuecomment-420960064

And I remarked

I note it says it fixes firmware outside of Windows.

I mentioned Windows phantom touches for the purpose of bringing information together in this thread. We should not be pursuing the idea this is a hardware issue currently in my view. As I said earlier today it looks like phantom touches and touch crashing have things in common with one another. You haven't said why you think the phantom touches might be hardware while the crashing isn't. Telling us why will give us more information to run with if you offer enough detail.

Hid-replay continues to be a way to gather evidence along with the new avenue of pursuit "To go further on this tough, I think you'd need to check how the IPTS driver sends the HID messages."

My view is this issue is a mole hunt and it should be pursued by all stakeholders using powers of deduction until the cause is known.

I have posted a feature request as a result of difficulties with this issue. It surely won't be the last time an issue remains persistent despite kernel adjustment over and over again.

ghost avatar Jun 05 '19 18:06 ghost

You haven't said why you think the phantom touches might be hardware while the crashing isn't.

I thought I did (although it's a bit short):

As far as I know (I might be wrong) there is a direct mapping of the MEI communication to HID (as in MEI actually sending HID data). I really know too little to even speculate, but could the phantom touches maybe result from missing or truncated MEI/HID messages? I think if the phantom touches can't be caused by missing or somehow malformed (i.e. truncated) HID messages and the HID messages are a 1:1 mapping from MEI messages, you'd have to consider the phantom touches to come from faulty hardware.

Let me rephrase/elaborate that a bit: As far as I know, the MEI messages contain HID data which is not modified in the driver, and the MEI messages come directly from hardware. With the phantom-touches HID data is still being sent, so as long as this it true, there are four possibilities: Mis-configured firmware, faulty firmware, faulty hardware, or some buffer overflow happens and the data gets truncated. If the issue can't happen due to truncated/malformed messages, only three remain. Also I think with truncated messages, the behavior would be a bit different (I think the HID data would not be as consistent). In short: I think it looks like the HID messages indicating phantom touches are coming directly from the hardware/controller firmware. Again, this is all as far as I know and some speculation. You'd have to verify that to be sure.

The crash on the other hand stops the flow of HID messages completely.

My view is this issue is a mole hunt and it should be pursued by all stakeholders using powers of deduction until the cause is known.

True. As already said, I think the way forward is to check the MEI messages (and verify that they're indeed directly mapped to HID and not modified). Much of that can be done by looking at the patches, but to get the message content you'd probably need to add some print statements to the IPTS driver and recompile the kernel. That's not something your feature-request is going to fix: As far as I know, the kernel already has run-time debugging enabled, so you can, e.g. by running dmesg -n 8, enable all debug-prints. Likely that's not going to be enough though and you'll have to add your own at the places that make most sense for this issue, e.g. by dumping some message-buffers etc. Those kinds of things are very issue-specific, so just randomly adding a few here and there and uploading a "debug-kernel" isn't going to help.

I think your starting-point should be the ipts_handle_hid_data function and walk up the call-tree. This seems to directly forward the HID data via hid_input_report to the corresponding HID device instance.

qzed avatar Jun 05 '19 19:06 qzed

This is great @qzed thank you again. I will attempt to look at the IPTS driver.

Before that I have more questions. On phantom touches...

If the issue can't happen due to truncated/malformed messages, only three remain.

Why can they not happen due to truncated/malformed messages?

Also I think with truncated messages, the behavior would be a bit different (I think the HID data would not be as consistent)

What do you mean by consistent? What do you see in the information provided that makes you think phantom touches reported by hid-record are consistent?

On touch crashing...

The crash on the other hand stops the flow of HID messages completely.

Temporarily, we get one touch event then touch "crashes" with hid-record saying a finger is in contact with the screen when it isn't. When the command to fix crashed touch is run you get additional events and the touch bracket is closed. How do you explain these additional events, illustrated here by two additional events at 433 seconds (though I have found it can be many more events than that)? What does that command do that reverses touch crashing and why do these events suddenly appear. To a layman it looks like the command "unblocks" touch which would explain to me why you get given these additional events despite not touching the screen. In that case you could describe them as queued.

ghost avatar Jun 05 '19 20:06 ghost

@condemnedmeat What I meant by consistent is that the HID data seems to be (at least from what you've described) syntactically valid and really shows a finger the way it is represented. If the message would be truncated (and directly forwarded), I think the HID data would likely not be syntactically correct or at least contain some weird values. Although, come to think about it, if the buffer is being re-used, it could also contain old values from earlier messages. If only HID messages are sent via that buffer, it also might be syntactically correct, depending on where the message is truncated. All depends on the data previously sent over the buffer.

Another possibility for the phantom touches could be something like a missing "finger lifted" message, but in that case you should have the same finger positions as you've had some time before.

That all can at least explain your first phantom touches. What's very interesting here though is something you've noted:

This time when I touched the screen, instead of staying at 01, registering one finger, it went to 02 instead.

Since the number of fingers is at the end, this can't be a truncated message, so it'd have to be stuck somewhere in the device/firmware (again only if HID messages are not modified by the IPTS driver).

To a layman it looks like the command "unblocks" touch which would explain to me why you get given these additional events despite not touching the screen. In that case you could describe them as queued.

Right, the command seems to reset some state, so what could happen is that the messages are already queued on the device, probably shortly before the crash, but the driver has not yet collected them. When the driver state is re-initialized (e.g. by re-loading the modules) it sees and collects the old messages. This could fit to the priority-issue @kitakar5525 mentioned. However the messages could also be the result of some internal state check, e.g. after a reset the device sees that it hasn't yet sent the "no more finger touching the screen" messages although there are no more fingers touching the screen. This also doesn't have to be a hard-reset, could also be something like "disable events for device X" and "enable events for device X" commands.

qzed avatar Jun 05 '19 20:06 qzed

I see. Do you think a truncated event in hid-record means the end of the list of number pairs in a single event (for clarity, an event record to me is the single block of numbers pictured here) is missing or wrong? Or could it mean errors earlier in the record? When I use hid-record it often contains lots of what I would describe as errors. Because I don't know what matters and what doesn't I didn't mention them. Typically these errors don't occur early on after starting the OS but later. I can't show you them yet but as an example if you focus on the block of numbers reserved for a fifth finger touching the screen, despite having only used one or two fingers since booting up hid-record will occasionally show the odd incomplete piece information describing that finger. This sort of thing can pop up in any of the blocks reserved for fingers in an event record. I would say overall that the information being displayed by hid-record gets progressively messier the longer I've been booted up. Do you think on a device without these touch issues information being sent to hid-record would remain tidier?

Just for information the time between each event being sent to hid-record is approximately 0.009s. A quick tap on the screen makes a touch bracket of about four events. These little errors are difficult to look out for because of the volume of events for example, in a single swipe.

ghost avatar Jun 05 '19 22:06 ghost

When I get round to it I'll post a video of phantom touches so you can see for yourself what they look like.

@qzed

hidrecerrors

Here is what I was thinking are errors in hid-record. In these events I have one finger touching the screen which is being displayed by hid-record in the block reserved for a second finger. The first block is not in use but instead of saying "e4" it says "e7" which means it is interpreting a touch. This despite at the bottom of the event accurately saying I have one finger touching the screen. There is invalid data being shown in block three-eight & ten.

There were no issues with touch while hid-record was displaying these errors, no phantom touches no touch crashes.

ghost avatar Jun 05 '19 23:06 ghost

Do you think on a device without these touch issues information being sent to hid-record would remain tidier?

Probably, yeah. Although that's at this point only a guess. I'll try to upload a log of some touchscreen events from my device later, so you can compare.

qzed avatar Jun 06 '19 11:06 qzed

Looking at code or compiling kernels isn't something I've done before but I've modified the patch and got some new information and I just need some help understanding it.

This is the information:

[ +0.000000] >> tdt : fw status : A280505D 00000000 00000000 00000000 00000000 00000000 [ +0.000001] >> == DB s:1, c:0 == [ +0.000001] >> == WQ h:0, t:0 ==

it comes from:

https://github.com/jakeday/linux-surface/blob/a4a9b7ca2021b5b6948245ac69a4397964b7bb49/patches/5.1/0005-ipts.patch#L1726-L1740

I'm finding it difficult to follow the clues "up the tree" and am trying to understand what the numbers in bold are if someone with certain ability can say.

As to how I modified the patch, it wasn't the first thing I saw but eventually I realised that debugging was commented out from

https://github.com/jakeday/linux-surface/blob/a4a9b7ca2021b5b6948245ac69a4397964b7bb49/patches/5.1/0005-ipts.patch#L5899

&

https://github.com/jakeday/linux-surface/blob/a4a9b7ca2021b5b6948245ac69a4397964b7bb49/patches/5.1/0005-ipts.patch#L5911

So with the modified patch you get more debug printing now lol! Use dmesg -wH.

ghost avatar Jun 10 '19 09:06 ghost

@condemnedmeat Building the kernel for Debain-based distros is fairly easy, everything should be explained in the readme. Just one note on that: instead of the deb-pkg target I'd recommend the bindeb-pkg target. This allows incremental compilation, so only the changes you made after the previous make invocation need to be compiled. Basically you just need to run

make -j <your-number-of-cores-here> bindeb-pkg LOCALVERSION=-linux-surface

every time you want to build it. Also another tip: check-out a kernel version (i.e. 5.1.7) from git, apply all surface-linux patches except the IPTS patches, commit, apply the IPTS patches, commit, and then make your changes (directly in the kernel source). This makes it easier to spot the differences or get a modified patch-set for IPTS directly from your commits.

Unfortunately I can't really help you with the numbers, but I'll try to have a better look at it at some point.

qzed avatar Jun 10 '19 23:06 qzed

Heya @qzed Thanks for this! I have already modified the patch and compiled it which has yielded loads more messages in dmesg, I recommend others try it. The issue has always been no useful logs. Had I known I could increase debugging easily by modifying the patch I would have done it months ago. I'm now getting the sort of information all who are collaborating on this repository have been asking for.

I don't think anyone had noticed that a debugging option in line 5911 had been comented out, correct me if I'm wrong. The other debugging option was alluded to but didn't yield anything.

The reason I mention those numbers is because they now appear as a live snapshot in dmesg, sampled every three seconds while they don't in the regular release. I have an idea what many of them are but I haven't a clue what "A280505D" (AX805XXD) means.

Touch crashes when 505D becomes 547D. I can tell just looking at dmesg -wH that touch has crashed, I don't have to touch the screen to find out.

Remember when you said

Right, the command seems to reset some state

Well the command you were referring to resets 547D back to 505D which re-enables touch.

https://github.com/jakeday/linux-surface/blob/a4a9b7ca2021b5b6948245ac69a4397964b7bb49/patches/5.1/0005-ipts.patch#L1732-L1733

These lines tell you what that number is, I can't understand what they mean of why touch crashes when it changes from 505D to 547D. I can't offer much more than this so hopefully someone will be able to deduce why this annoying thing has remained unfixable to this point.

ghost avatar Jun 11 '19 00:06 ghost

How about, I don't know, asking this on the linux-input (or intel-gfx?) mailing list? Also perhaps inquiring if anybody in their right minds has even considered mainlining in the last handful of years.

mirh avatar Jun 11 '19 10:06 mirh

If anyone wants to deal with the mailing list, please feel free to go ahead. I doubt that you'll get much of a response from that though, the code has never been proposed for upstreaming. I also haven't seen any official-looking development past https://github.com/ipts-linux-org.

That being said, I think you'll get a better response from either

Those two are the authors from the IPTS org. Coskunses authored the commits in the old repo, Yang the ones in the new repo and the chrome-os version.

Unfortunately I'm a bit busy at the moment (and some other issues currently have precedence for me), so I'd really appreciate if someone else wants to do that.

qzed avatar Jun 11 '19 23:06 qzed

Hi Everyone,

Sorry to hear, in latest kernel touch stop working. I cannot work on this problem hands-on at the moment but it would help if you can help narrow down the issue.

If it works and stops after a while, could you try to disable DMC and see if issue goes away. I am not sure exactly how to remove it in latest kernels but it should be easy to find. It used to be as easy as removing the Firmware library but it might have changed.

Other than that: The logs mentioned by @condemnedmeat would help. Sorry it needs some deciphering but it fundamentally tells which HW block stopped working so we can isolate the issue and work from there.

ardacoskunses avatar Jun 12 '19 17:06 ardacoskunses

Thanks!! @ardacoskunses I can see mention of HW & DMC in the graphics stack recipe but I can't see what they are acronyms for, would you mind telling us?


Sorry it needs some deciphering but it fundamentally tells which HW block stopped working so we can isolate the issue and work from there.

That's it, if someone with this issue can understand C they should be able to understand what those numbers are defined as in the patch. I am looking at it myself but it's like translating a language I don't understand.


In the intermediate, if you come to this thread wondering if you have the same issue you should swap the 0005-ipts.patch for for this one which will "start debug thread" on booting once compiled.

My experience with dmesg once the debug thread has begun is...

When touch is functioning you see fw status : AX805X5D 00XX00000 XXXXXXXX 00000000 00000000 00000000

Where X in bold can be a defined, narrow range of numbers and those not in bold any number or a letter up to f. Those in italic represent a kind of touch count.

When touch is not functioning effectively you see fw status : AX80547D 00XX00000 XXXXXXXX XXXX 000x 0000000x 00000000

Where the numbers in bold are frozen. The numbers in italic represent a kind of touch count that changes when you touch the screen in spite of the touch screen not working. Occasionally the touch count can climb relentlessly without you touching the screen. "x" represents a "sensor reset" count.

Running running xset dpms force off resets the fw status back to a functioning state.

You should see if your experience is the same or you may have a different problem.

Additonally, see this & this.

This issue probably shouldn't be closed even if discussion drys up for the time being.

ghost avatar Jun 13 '19 01:06 ghost

I am not following the thread, I guess as my name mentioned, GitHub did the curtesy sending me an email to let me know, otherwise I would reply earlier :)

Couple more points, It has been more than 2 years since I worked on this and I am not with Intel anymore. I just feel responsible to help to community to get this working.

DMC: is graphics power controller, running as a separate HW block. https://01.org/linuxgraphics/downloads/firmware

That's it, if someone with this issue can understand C they should be able to understand what those numbers are defined as in the patch. I am looking at it myself but it's like translating a language I don't understand.

IPTS is two HW blocks, GuC and ME talking to each other on their own terms over memory via DMA. Both of which has their own FW, so CPU is not aware. Debug thread pull some information from both HW in certain intervals to see what is going on. So unfortunately understanding and making sense of it requires some insider information (or some sharp hacking skills... ) I wish to be able to share more information but we simple never had liberty besides making the code publicly accessible.

Back to the problem: 1- Touch is working for sometime and crash later on right? could you confirm?

2 - any particular pattern like after sleep resume 3- Could you try disable DMC and try? 4-

Running running xset dpms force off resets the fw status back to a functioning state.

Can you elaborate this, after this everything is working?

5- Can you share: [ +0.000000] >> tdt : fw status : A280505D 00000000 00000000 00000000 00000000 00000000 [ +0.000001] >> == DB s:1, c:0 == [ +0.000001] >> == WQ h:0, t:0 ==

Logs, before, during and after.

6- Is there any FW update before this issue either in GuC or ME side?

Note: HID information would only be relevant if we can clear with logs both GuC and ME is working.

ardacoskunses avatar Jun 13 '19 16:06 ardacoskunses

@ardacoskunses Well you're being very generous with your time! I won't waste it. Thank you for the additional information.

1- Touch is working for sometime and crash later on right? could you confirm?

Yes, works from boot then "crashes" at some point. Sometimes very quickly, sometimes after a long time. At which point the device works as normal but the screen is apparently unresponsive to touch. Until that command is run and touch works again. It will crash again, sometimes very soon after reset, sometimes after a longer delay. From an earlier post:

My experience continues to be it most obviously crashes as a result of visiting a site that will load new information as you scroll down a page ad infinitum. For example, scrolling down someone's twitter feed at reading speed. Another example is if I visit a site that has a "long" home page packed with links, pictures and short video clips. An example that comes to mind is dailymail.co.uk

At some point touch becomes unrecoverable despite running that command and a restart is required - I can't yet show you this in a log.

2 - any particular pattern like after sleep resume

None other than described!

3- Could you try disable DMC and try?

On this, give us some time to look at how this is done, unless it is as easy as running a command or amending a patch. I don't thinks it's right to continually ask "how do you do that" The only thing concerning dmc I have found in dmesg is

"[drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27)"

6- Is there any FW update before this issue either in GuC or ME side?

Could you mean outside of the OS? Clarification of that aside, the issue can be traced back to at least May last year.


Logs This is an interesting one because because despite touch "crashing", when I touched the screen numbers changed in the forth column and when I was not touching it numbers remained at their last state which led me to believe the fw was detecting touches. Touch crashing and running the command to reset it is highlighted by *. This is a less common eventuality.

In this one, not the entire log, the numbers in the fourth column relentlessly climb once touch has crashed whether I touch the screen or not. I cannot tell yet if they increase quicker as a result or a touch. This is the most common outcome when touch crashes.

In this one, not the entire log, touch crashes while I'm not touching the screen. This is a much less common eventuality.

This, not the entire log, is what happens when you run the command and get a sensor reset. Log begins after command has been run. This eventuality is rare.

In all cases I am most likely just to have been browsing or doing something "light"

Thank you very much for taking a look, just do what you can with these.

ghost avatar Jun 13 '19 19:06 ghost

@condemnedmeat thanks for prompt reply. I could not read and digest whole message yet but quick suggestion. removing DMC should be as easy as removing this file from the filesystem.

"[drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27)"

Once you won't see this message it means DMC not loaded.

I will read rest of the message later today.

ardacoskunses avatar Jun 13 '19 22:06 ardacoskunses

@condemnedmeat I've had chance go through your logs.

First step remains same, "disable DMC", actually until this issue resolved we should keep it disabled, one less moving part.

In the logs I can see touch stops but I cannot see attempt to recovery. Could you add a log into i915_guc_ipts_reacquire_doorbell method? When Doorbell control remains same, control thread should catch it and call above method. Lets see if recovery is working. In fact if you could add some log along the stack of calling this "reacquire" func, it would be great.

6- Is there any FW update before this issue either in GuC or ME side?

Firmwares of GuC, ME and DMC are all OS-agnostic. GuC and DMC Firmwares are loaded from file system during init. Is there any update those? ME firmware comes along with BIOS, AFAIK linux cannot update BIOS yet but Windows can. So was there any BIOS update coming from windows boot?

If this last step is overwhelming, could you simply answer this: If you revert only the kernel, crash issue goes away?

ardacoskunses avatar Jun 14 '19 04:06 ardacoskunses

Heya @ardacoskunses,

So far I have failed to disable DMC by removing the files from /lib/firmware/i915, I assume it's baked into the kernel. I need to test that. I removed the HUC file from the same folder but it remained enabled. Edit unlink doesn't work either.

Looking at the patch in this repository, there is a log message already in line 399: https://github.com/jakeday/linux-surface/blob/a4a9b7ca2021b5b6948245ac69a4397964b7bb49/patches/4.19/0005-ipts.patch#L392-L399 It would be better to have others as well? You don't need to be more specific, I just wonder if this log should be showing up if recovery isn't working.

I'll work on DMC. Then revert to an earlier kernel, perhaps starting with one released beginning 2018 as this was the last 1/4 the firmware was being released I think.

I'll look at the bios updates last.

ghost avatar Jun 14 '19 06:06 ghost

Existing log only prints err condition. We dont know if this function ever called or thread which suppose to call this is spinned or how frequently working.

ardacoskunses avatar Jun 14 '19 13:06 ardacoskunses

I can't follow all the comments right now, but for disabling DMC, you can disable it by passing a kernel parameter like this: i915.dmc_firmware_path=/dev/null

$ dmesg -xH | grep -i DMC
kern  :notice: [  +0.000000] Setting dangerous option dmc_firmware_path - tainting kernel
kern  :notice: [  +0.000001] i915 0000:00:02.0: Failed to load DMC firmware /dev/null. Disabling runtime power management.
kern  :notice: [  +0.000001] i915 0000:00:02.0: DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

$ sudo cat /sys/kernel/debug/dri/0/i915_dmc_info
fw loaded: no
path: /dev/null
program base: 0xbffb3fef
ssp base: 0x00000000
htp: 0x00000000

However, the issue still persists.

kitakar5525 avatar Jun 14 '19 15:06 kitakar5525

A script for fixing the crash automatically

Sorry to interrupt the conversation for finding the root cause, but I can confirm that when the touch crashed, the fw status contains 547D. For example: >> tdt : fw status : A280547D 00200000 4A474A41 214B0000 00000000 00000000. Thank you @condemnedmeat for finding it! So, we may ~~workaround~~ fix the crash by known ways[1] when it contains 547D.

I personally made a debugfs entry for printing debug information (the one that you can get on dmesg when RUN_DBG_THREAD is uncommented):

$ sudo cat /sys/kernel/debug/ipts/debug
>> tdt : fw status : A280545D 00990000 583E5805 CAFC0000 00000000 00000000
>> == DB s:1, c:356a9 ==
>> == WQ h:2704, t:2704 ==

Then, we can ~~work around~~ know the crash automatically by reading the fw status somehow and fix the crash by a known way. I personally made a script for that. (I think there is more effective way… Any suggestions are welcome!)

(Or, we may ~~work around~~ fix the crash inside the kernel module directly in the same way as the script above, but I don't know it is the right way.)

[1]:

  • reloading modules (https://github.com/jakeday/linux-surface/issues/374#issuecomment-459023490)
  • display off (xset dpms force off && xset dpms force on)
  • change sensor mode (https://github.com/jakeday/linux-surface/issues/374#issuecomment-461833074)

kitakar5525 avatar Jun 19 '19 15:06 kitakar5525

Oh, we can get the fw status also by /sys/class/mei/mei0/fw_status. No need to add the debugfs entry with regard to the fw status.

$ cat /sys/class/mei/mei0/fw_status
A280505D
00070000
EB19EB18
00600000
00000000
00000000

kitakar5525 avatar Jun 19 '19 16:06 kitakar5525

@kitakar5525 Awesome, this is great news!!

This fix indicates something is wrong in ME FW. Is there any update in BIOS might be causing this? I will check how to check version numbers.

To run this in kernel module, you can try to add within recovery thread, which is doing a very similar trick for GuC FW. This recovery thread runs within certain intervals, check status of FWs and take action.

Fix vs Workaround, bugs are mostly either in HW or FW in ipts case we do workaround them very much like what you did. All getting fixed generation after generation but takes time.

ardacoskunses avatar Jun 19 '19 17:06 ardacoskunses

Workaround using a standalone script

Updated the script and now it is working standalone, no need to apply a kernel patch to read fw_status anymore.

Workaround using a kernel patch

I also made a kernel patch which uses kthread. I realized that simply sending ipts_send_sensor_clear_mem_window_cmd(ipts) fixes the crash. So, inside the recovery thread, check the FW status and if the things go wrong, send the command to recovery the IPTS functionality. (There must be things to improve in the code. Any suggestions are welcome.) Thank you @ardacoskunses for the advice!

One possible problem is...

the output of fw_status could be completely different between devices. I need more information from the other than SP4/SB1.

  • cat $(find /sys/devices/pci0000:00/0000:00:16.4 -name fw_status)

I used the sixth number (zero-based) of fw_status as an indicator of the touch crash. ('7' is hardcoded into the code.) However, '7' could indicate another status on the other devices (?).


@ardacoskunses

recovery thread, which is doing a very similar trick for GuC FW.

If you know the usage in Linux kernel, let me know where to find it. I want to have a look at that code.

This fix indicates something is wrong in ME FW.

Yes, it seems that ME FW is being wrong when the crash happened.

bugs are mostly either in HW or FW

However, I'm not sure HW or FW is completely to blame. Changing the IPTS GuC client priority to a higher one (1143fca) made significant stability for me (but this issue is still occasionally happening).

Is there any update in BIOS might be causing this? I will check how to check version numbers.

If I recall correctly, this issue has been happening since I bought my SB1 (March 2018). So, I'm not sure.

kitakar5525 avatar Jun 20 '19 08:06 kitakar5525

Ah... sorry, my recovery thread uses one of my CPU 100%...

Again,

recovery thread, which is doing a very similar trick for GuC FW.

I want to look at proper recovery thread implementation on Linux or want someone to rewrite it.

EDIT Should I insert some sleep on every loop (?)

EDIT2 Updated the kernel patch. I think not everyone wants to enable the recovery thread, so, I added a module parameter to enable the recovery thread. Also, I added a parameter to change the recovery thread loop interval.

Added module parameters

  • enable_recovery_thread (default:false)
  • recovery_sleep_msec (default:1000)

Pass a parameter intel_ipts.enable_recovery_thread=1 to your bootloader to use the recovery thread.

EDIT3 It causes a null pointer dereference after s2idle.

dmesg log
kern  :info  : [  +0.036636] PM: suspend entry (s2idle)
kern  :info  : [  +0.000004] PM: Syncing filesystems ... done.
kern  :debug : [  +0.024482] PM: Preparing system for sleep (s2idle)
kern  :info  : [  +0.001453] Freezing user space processes ... (elapsed 0.002 seconds) done.
kern  :info  : [  +0.002569] OOM killer disabled.
kern  :info  : [  +0.000001] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
kern  :debug : [  +0.001231] PM: Suspending system (s2idle)
kern  :info  : [  +0.000001] printk: Suspending console(s) (use no_console_suspend to debug)
kern  :info  : [  +0.105098] mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]: reason code 3
kern  :info  : [  +0.003261] mwifiex_pcie 0000:03:00.0: None of the WOWLAN triggers enabled
kern  :err   : [  +0.071315] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: error in reading m2h msg
kern  :info  : [  +0.000060] IPTS removed
kern  :debug : [  +0.110468] PM: suspend of devices complete after 187.411 msecs
kern  :debug : [  +0.019512] PM: late suspend of devices complete after 19.499 msecs
kern  :debug : [  +0.002404] PM: suspend-to-idle
kern  :debug : [  +0.037377] PM: noirq suspend of devices complete after 37.185 msecs
kern  :debug : [  +8.049661] PM: Timekeeping suspended for 8.221 seconds
kern  :info  : [  +0.000152] ACPI: \_PR_.PR00: LPI: Device not power manageable
kern  :info  : [  +0.000005] ACPI: \_PR_.PR01: LPI: Device not power manageable
kern  :info  : [  +0.000003] ACPI: \_PR_.PR02: LPI: Device not power manageable
kern  :info  : [  +0.000003] ACPI: \_PR_.PR03: LPI: Device not power manageable
kern  :info  : [  +0.000004] ACPI: \_SB_.PCI0.GFX0: LPI: Device not power manageable
kern  :info  : [  +0.000010] ACPI: \_SB_.PCI0.ISP0: LPI: Device not power manageable
kern  :info  : [  +0.000003] ACPI: \_SB_.PCI0.HECI: LPI: Device not power manageable
kern  :debug : [  +0.082954] PM: noirq resume of devices complete after 82.858 msecs
kern  :debug : [  +0.000142] PM: resume from suspend-to-idle
kern  :info  : [  +0.000090] mwifiex_pcie 0000:03:00.0: event: unknown event id: 0x0
kern  :debug : [  +0.061364] PM: early resume of devices complete after 4.917 msecs
kern  :info  : [  +0.001550] [drm] HuC: Loaded firmware i915/skl_huc_ver01_07_1398.bin (version 1.7)
kern  :warn  : [  +0.002481] ACPI: button: The lid device is not compliant to SW_LID.
kern  :info  : [  +0.000730] [drm] GuC: Loaded firmware i915/skl_guc_ver9_33.bin (version 9.33)
kern  :info  : [  +0.000160] i915 0000:00:02.0: GuC firmware version 9.33
kern  :info  : [  +0.000002] i915 0000:00:02.0: GuC submission enabled
kern  :info  : [  +0.000002] i915 0000:00:02.0: HuC enabled
kern  :debug : [  +0.085659] PM: resume of devices complete after 90.570 msecs
kern  :debug : [  +0.000767] PM: Finishing wakeup.
kern  :info  : [  +0.000003] OOM killer enabled.
kern  :info  : [  +0.000003] Restarting tasks ... 
kern  :info  : [  +0.004694] probing Intel Precise Touch & Stylus
kern  :info  : [  +0.000011] IPTS using DMA_BIT_MASK(64)
kern  :warn  : [  +0.000042] done.
kern  :info  : [  +0.019912] ipts: >> start recovery thread
kern  :info  : [  +0.030165] input: ipts 1B96:005E UNKNOWN as /devices/pci0000:00/0000:00:16.4/mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F/0044:1B96:005E.0004/input/input51
kern  :info  : [  +0.003514] input: ipts 1B96:005E as /devices/pci0000:00/0000:00:16.4/mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F/0044:1B96:005E.0004/input/input53
kern  :info  : [  +0.000664] input: ipts 1B96:005E Touchscreen as /devices/pci0000:00/0000:00:16.4/mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F/0044:1B96:005E.0004/input/input54
kern  :info  : [  +0.000912] input: ipts 1B96:005E Mouse as /devices/pci0000:00/0000:00:16.4/mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F/0044:1B96:005E.0004/input/input55
kern  :info  : [  +0.000899] hid-multitouch 0044:1B96:005E.0004: input,hidraw0: <UNKNOWN> HID v16900.00 Mouse [ipts 1B96:005E] on heci3
kern  :err   : [  +0.032268] ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: touch enabled 4
kern  :info  : [  +0.041576] nvme nvme0: 4/0/0 default/read/poll queues
kern  :alert : [  +0.035580] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
kern  :alert : [  +0.000005] #PF error: [normal kernel read fault]
kern  :info  : [  +0.000001] PGD 0 P4D 0 
kern  :warn  : [  +0.000004] Oops: 0000 [#1] PREEMPT SMP PTI
kern  :warn  : [  +0.000003] CPU: 2 PID: 421 Comm: ipts_recovery_t Tainted: G     U   C OE     5.1.9-arch1-1-surface #1
kern  :warn  : [  +0.000002] Hardware name: Microsoft Corporation Surface Book/Surface Book, BIOS 91.2439.769 12/07/2018
kern  :warn  : [  +0.000006] RIP: 0010:ipts_recovery_thread+0x50/0xf4 [intel_ipts]
kern  :warn  : [  +0.000003] Code: 82 24 f5 e8 f6 11 20 f5 84 c0 0f 85 95 00 00 00 80 3d b7 58 00 00 00 0f 84 88 00 00 00 48 8b 03 c6 44 24 22 00 48 8d 74 24 04 <48> 8b 78 10 48 8b 87 00 06 00 00 48 8b 40 28 e8 50 8f d5 f5 85 c0
kern  :warn  : [  +0.000002] RSP: 0018:ffffa171c2e53ea8 EFLAGS: 00010202
kern  :warn  : [  +0.000002] RAX: 0000000000000000 RBX: ffff9662079f4018 RCX: 0000000000000000
kern  :warn  : [  +0.000002] RDX: 0000000000000000 RSI: ffffa171c2e53eac RDI: 00000000ffffffff
kern  :warn  : [  +0.000001] RBP: ffff96621c9221c0 R08: 0000000000000000 R09: 0000000000000000
kern  :warn  : [  +0.000002] R10: 0000000000000000 R11: ffff96621f320d64 R12: ffffa171c1f97a48
kern  :warn  : [  +0.000002] R13: ffff966216c4bd80 R14: ffff9662079f4018 R15: ffffffffc0ea7ccc
kern  :warn  : [  +0.000002] FS:  0000000000000000(0000) GS:ffff96621f300000(0000) knlGS:0000000000000000
kern  :warn  : [  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern  :warn  : [  +0.000001] CR2: 0000000000000010 CR3: 0000000284a0e002 CR4: 00000000003606e0
kern  :warn  : [  +0.000002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kern  :warn  : [  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kern  :warn  : [  +0.000001] Call Trace:
kern  :warn  : [  +0.000009]  kthread+0x112/0x130
kern  :warn  : [  +0.000003]  ? kthread_park+0x80/0x80
kern  :warn  : [  +0.000004]  ret_from_fork+0x35/0x40
kern  :warn  : [  +0.000004] Modules linked in: nfnetlink_log nfnetlink rfcomm cmac bnep btusb btrtl btbcm btintel bluetooth ecdh_generic overlay input_leds iptable_filter usbhid vmnet(OE) joydev mousedev intel_rapl x86_pkg_temp_thermal mwifiex_pcie intel_powerclamp coretemp kvm_intel msr mwifiex snd_soc_skl kvm hid_sensor_als hid_sensor_gyro_3d hid_sensor_rotation hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio irqbypass snd_soc_hdac_hda hid_multitouch snd_hda_ext_core hid_sensor_hub snd_soc_skl_ipc nls_iso8859_1 nls_cp437 crct10dif_pclmul hid_generic snd_soc_sst_ipc vfat snd_soc_sst_dsp crc32_pclmul snd_soc_acpi_intel_match ghash_clmulni_intel fat snd_soc_acpi snd_hda_codec_hdmi mei_hdcp cfg80211 intel_ipts(OE) squashfs fuse snd_soc_core loop snd_compress snd_hda_codec_realtek ac97_bus snd_hda_codec_generic aesni_intel ledtrig_audio snd_pcm_dmaengine snd_hda_intel aes_x86_64 crypto_simd snd_hda_codec cryptd glue_helper intel_cstate pcspkr
kern  :warn  : [  +0.000035]  intel_uncore intel_rapl_perf snd_hda_core snd_hwdep snd_pcm ipu3_imgu(C) snd_timer snd ipu3_cio2 v4l2_fwnode soundcore videobuf2_dma_sg videobuf2_memops videobuf2_v4l2 videobuf2_common tpm_crb mei_me videodev rfkill i2c_hid idma64 mei hid media intel_xhci_usb_role_switch intel_pch_thermal intel_lpss_pci roles intel_lpss surfacepro3_button tpm_tis tpm_tis_core ac battery soc_button_array evdev tpm mac_hid rng_core pcc_cpufreq vmmon(OE) vmw_vmci vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg scsi_mod crypto_user binder_linux(OE) ashmem_linux(OE) acpi_call(OE) ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci crc32c_intel xhci_hcd i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm intel_agp intel_gtt agpgart
kern  :warn  : [  +0.000038] CR2: 0000000000000010
kern  :warn  : [  +0.000003] ---[ end trace 0c75f16f03ee17a2 ]---
kern  :warn  : [  +0.000005] RIP: 0010:ipts_recovery_thread+0x50/0xf4 [intel_ipts]
kern  :warn  : [  +0.000002] Code: 82 24 f5 e8 f6 11 20 f5 84 c0 0f 85 95 00 00 00 80 3d b7 58 00 00 00 0f 84 88 00 00 00 48 8b 03 c6 44 24 22 00 48 8d 74 24 04 <48> 8b 78 10 48 8b 87 00 06 00 00 48 8b 40 28 e8 50 8f d5 f5 85 c0
kern  :warn  : [  +0.000002] RSP: 0018:ffffa171c2e53ea8 EFLAGS: 00010202
kern  :warn  : [  +0.000002] RAX: 0000000000000000 RBX: ffff9662079f4018 RCX: 0000000000000000
kern  :warn  : [  +0.000002] RDX: 0000000000000000 RSI: ffffa171c2e53eac RDI: 00000000ffffffff
kern  :warn  : [  +0.000001] RBP: ffff96621c9221c0 R08: 0000000000000000 R09: 0000000000000000
kern  :warn  : [  +0.000002] R10: 0000000000000000 R11: ffff96621f320d64 R12: ffffa171c1f97a48
kern  :warn  : [  +0.000001] R13: ffff966216c4bd80 R14: ffff9662079f4018 R15: ffffffffc0ea7ccc
kern  :warn  : [  +0.000002] FS:  0000000000000000(0000) GS:ffff96621f300000(0000) knlGS:0000000000000000
kern  :warn  : [  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern  :warn  : [  +0.000002] CR2: 0000000000000010 CR3: 0000000284a0e002 CR4: 00000000003606e0
kern  :warn  : [  +0.000001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kern  :warn  : [  +0.000002] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kern  :info  : [  +0.051942] PM: suspend exit

kitakar5525 avatar Jun 20 '19 09:06 kitakar5525

Well done @kitakar5525! This looks like an interesting direction, it never occurred to me that a script might be a solution.

On fw status, here is more information:

  ipts_debug-420   [002] ....   359.244268: mei_pci_cfg_read: [0000:00:16.4] pci cfg read PCI_CFG_HSF_X:[0x40] = 0xa280505d
  ipts_debug-420   [002] ....   359.244275: mei_pci_cfg_read: [0000:00:16.4] pci cfg read PCI_CFG_HSF_X:[0x48] = 0xe0000
  ipts_debug-420   [002] ....   359.244280: mei_pci_cfg_read: [0000:00:16.4] pci cfg read PCI_CFG_HSF_X:[0x60] = 0x526f526f
  ipts_debug-420   [002] ....   359.244284: mei_pci_cfg_read: [0000:00:16.4] pci cfg read PCI_CFG_HSF_X:[0x64] = 0x0
  ipts_debug-420   [002] ....   359.244288: mei_pci_cfg_read: [0000:00:16.4] pci cfg read PCI_CFG_HSF_X:[0x68] = 0x0
  ipts_debug-420   [002] ....   359.244292: mei_pci_cfg_read: [0000:00:16.4] pci cfg read PCI_CFG_HSF_X:[0x6c] = 0x0

This came from ftrace.

Information on this is present in the kernel [1]

Host Firmware Status Registers in PCI Config Space define PCI_CFG_HFS_1 0x40 define PCI_CFG_HFS_1_D0I3_MSK 0x80000000 define PCI_CFG_HFS_2 0x48 define PCI_CFG_HFS_3 0x60 define PCI_CFG_HFS_4 0x64 define PCI_CFG_HFS_5 0x68 define PCI_CFG_HFS_6 0x6C

Here is a sample output [2] from ftrace showing touch crashing. Having played around with ftrace a decent amount I believe touch only ever "crashes" after rpm_idle. After it crashes there is no more "mei_reg_read" or "mei_reg_write." Information on rpm_idle et al. is also present within the kernel [3]. I haven't seen any other pattern. Crucially, although it only happens after rpm_idle, it doesn't happen after every rpm_idle which has made identifying the reason it happens very difficult. I don't see an obvious pattern and I don't know what numbers like "rpm_idle+0xd9/0x330" or "-13" mean currently. If my reasoning is correct, perhaps this means touch crashing is due to power saving and amendments could be made to power configuration files to stop it happening. Can't be sure until someone repeats my result.


Using ftrace [4] Mount /# mount -t tracefs nodev /sys/kernel/tracing Trace on /tracing/events# echo 1 > enable Trace off /# echo 0 > enable Enlarge size of trace file (necessary because you will capture enormous amounts of information) echo 100000 > buffer_size_kb Clear the trace (necessary because you will otherwise gradually overwrite old information which may make the trace harder to understand) /tracing# echo > trace Don't apply any filters. Just switch on and switch off, for example, once touch has crashed.

Once you have captured something you want to look at copy the trace file to another location with plenty of space.

Split the document into manageable chunks. $ split -l 300000 trace


[1] https://github.com/torvalds/linux/blob/master/drivers/misc/mei/hw-me-regs.h [2] https://github.com/condemnedmeat/File/blob/master/ftrace.txt [3] https://github.com/torvalds/linux/blob/master/drivers/base/power/runtime.c#L385 [4] https://www.kernel.org/doc/Documentation/trace/ftrace.txt

ghost avatar Jun 22 '19 20:06 ghost

  • Tested on Ubuntu 16.04.6 LTS (Maybe latest distros such as Arch Linux won't ~~work~~ boot under the following kernels)

I built the original repository linux-ipts (Linux 4.4.0-rc8, itouch) with no modification, and so far, the touch never crashed. (Thank you @ardacoskunses for the original itouch implementation!)

DMC version:

$ sudo cat /sys/kernel/debug/dri/0/i915_dmc_info
fw loaded: yes
path: i915/skl_dmc_ver1.bin
version: 1.26
DC3 -> DC5 count: 0
DC5 -> DC6 count: 0
program base: 0x09004040
ssp base: 0x00002fc0
htp: 0x00b40068

GuC version:

$ sudo cat /sys/kernel/debug/dri/0/i915_guc_load_status
GuC firmware status:
    path: i915/skl_guc_ver4.bin
    fetch: SUCCESS
    load: SUCCESS
    version wanted: 4.3
    version found: 4.3
    header: offset is 0; size = 128
    uCode: offset is 128; size = 127936
    RSA: offset is 128064; size = 256

GuC status 0x800330ec:
    Bootrom status = 0x76
    uKernel status = 0x30
    MIA Core status = 0x3

Scratch registers:
     0:     0xf0000000
     1:     0x1
     2:     0x0
     3:     0x5f5e100
     4:     0x600
     5:     0xf56d3
     6:     0x0
     7:     0x8
     8:     0x123
     9:     0x80203
    10:     0x0
    11:     0x0
    12:     0x0
    13:     0x0
    14:     0x0
    15:     0x8

I also built a kernel from ipts-linux-new (Linux 4.9-rc3, ipts) with no modification, but the touch crashed immediately (as already reported by @tmarkov). The kernel uses version 6.1 of GuC firmware.

Note: the original linux-ipts wiki page says:

Note: for current release we need GuC version v4.3 only. OTC website has GuC v6.1 which will not work with this kernel!!

Maybe the GuC firmware version is important. However, I don't know how to use a different version of GuC firmware🤔

EDIT

  • on Arch Linux

Passing i915.guc_firmware_path=i915/skl_guc_ver9_33.bin works:

[drm] GuC: Skipping firmware version check
[drm] GuC: Loaded firmware i915/skl_guc_ver9_33.bin (version 9.33)
i915 0000:00:02.0: GuC firmware version 9.33
i915 0000:00:02.0: GuC submission enabled

but passing i915.guc_firmware_path=i915/skl_guc_ver4.bin (which does exist) results in blank screen:

i915 0000:00:02.0: Direct firmware load for i915/skl_guc_ver4.bin failed with error -2
[drm] GuC: Failed to fetch firmware i915/skl_guc_ver4.bin (error -2)
[drm] GuC: Firmware can be downloaded from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
i915 0000:00:02.0: GuC initialization failed -8
[drm:i915_gem_init_hw [i915]] *ERROR* Enabling uc failed (-8)

🤔

kitakar5525 avatar Jun 27 '19 19:06 kitakar5525

@ardacoskunses I have some anecdotal evidence (albeit pretty weak) that also suggests firmware update may have something to do with this issue. Namely, when I first installed linux on my SB (early 2017), I didn't have touch crashing issues. Then I started using Windows for some time. Went back to linux (2018) I had the issue, even using old kernels like 4.9. After a warranty exchange this year, I no longer have the issue.

tmarkov avatar Jun 27 '19 19:06 tmarkov

@kitakar5525 in the original implementation GuC is loaded from filesystem (I think you've already noticed) it is responsible for all rendering as well as iTouch, it is not only for touch. Fail in loading GuC would lead black screen.

@tmarkov indeed firmwares are doing the job we just setup the framework around them. Goes along the same lines, firmwares also responsible for crashes :) so we put workarounds. That is the reason I was initially asking if there are any updates in firmwares, GuC, ME and DMC. Either can cause issues.

We never had access to firmware codes, so I do not know what is going on in there. However debug and recovery threads ( I have to say work items ) can give us clue. Thus enabling/adding some logs in these work items can shed more light to the problems (even would not reveal root cause).

I guess this issue is fixed now (thanks to @kitakar5525), so saying this for future reference.

ardacoskunses avatar Jun 27 '19 21:06 ardacoskunses

For some reason, I have to build firmware into a kernel to use a different version of firmware:

CONFIG_EXTRA_FIRMWARE="i915/skl_guc_ver4.bin i915/skl_dmc_ver1_26.bin"

then, boot with i915.guc_firmware_path=i915/skl_guc_ver4.bin i915.dmc_firmware_path=i915/skl_dmc_ver1_26.bin, but the issue persists. Firmware is not relevant (?)

So, the possible causes of the issue I think are:

  • Changes between 4.4.0-rc8 and 4.9-rc3 (including Power Management as @condemnedmeat points out)
  • And/or changes between itouch and ipts

@ardacoskunses

I guess this issue is fixed now

The workaround I suggested above is not ideal because there is more or less time that touch input has been stopped until the input is recovered by the recovery thread...

kitakar5525 avatar Jun 27 '19 23:06 kitakar5525

Always assumed the issue was related to the GuC (as it is not present on other surface models) but your recent comments made me think it might actually be the IPTS driver. Removing the below call to ipts_send_feedback within the ipts-hid.c file seems to solve the issue on my SP4:

if (fb_buf) {
	ret = ipts_send_feedback(ipts, parallel_idx, transaction_id);
	if (ret)
		return ret;
}

Not sure if it might break the driver for other surface devices though.

sebanc avatar Jul 03 '19 19:07 sebanc

@sebanc Yes, the touch input seems stable now for my SB1, too!!! BTW, how did you figure it out those lines are the cause?

kitakar5525 avatar Jul 03 '19 20:07 kitakar5525

I put a debug print in the ipts_handle_resp function (ipts-msg-handler.c file) and tried to understand/modify the message workflow. At some point, I just decided to find out if the recurrent 0x80000006 (TOUCH_SENSOR_FEEDBACK_READY_RSP) responses triggered by the ipts_send_feedback function were actually necessary for touch to work.

sebanc avatar Jul 03 '19 20:07 sebanc

And so by what mechanism is this causing touch to "crash"?

ghost avatar Jul 03 '19 21:07 ghost

Calling the "ipts_send_feedback" function repeatedly seems to be what is causing touch to crash. Without the IPTS specs, I cannot say if its normal or not for the "ipts_send_feedback" function to be used at this point in the code. All I can say is that this call does not seem necessary for touch to work on my SP4 and that the issue stopped after removing it.

sebanc avatar Jul 04 '19 04:07 sebanc

I really could not wrap my head around this but if it is working without this again pointing to ME, @kitakar5525 s findings also indicated ME.

@sebanc thank you(!) for providing this not easy to see solution even it is hard to explain hows and whys. Without serious investigation I could only speculate or guess this as ME enchacment with side efect. I wish to be able to get to bottom of this but no time at all.

I hope this would resolve everyones issue.

ardacoskunses avatar Jul 04 '19 06:07 ardacoskunses

I propose a patch: ipts-fix-crash-caused-by-calling-ipts_send_feedback-.patch

What we don't know for now are:

  • any impact on the driver functionality?
    • especially on newer devices than SP4/SB1
  • what was the purpose of ipts_send_feedback(ipts, parallel_idx, transaction_id) ?

Please test this patch🙇 (especially on newer devices than SP4/SB1 if this patch breaks the driver functionality)

kitakar5525 avatar Jul 04 '19 11:07 kitakar5525

I commented out the section you advise be removed by adding // at the start of every line (just after the +)

https://github.com/jakeday/linux-surface/blob/9d2772a7e8e86eb91464962c9be853457bc5cf11/patches/4.19/0005-ipts.patch#L2283-L2287

While the fw status hasn't crashed yet from 505D to 547D since I compiled, I wouldn't describe touch as completely stable. There comes a moment when I'm using the OS without the keyboard when the fw status appears to know I'm touching the screen but the screen behaves like touch has crashed. The gnome parts of the screen might sometimes work during these moments but not the rest of the screen. When I connect the keyboard I'm getting issues being able to use some of the keys. As an example, I wasn't able to get to terminal to save dmesg output after a "faux" touch crash, having connected the keyboard. I could not switch to terminal, nor open a new terminal using the keyboard. In this case I could tab between the clock and the wifi/power button cluster, I could rotate but couldn't do much else. This wasn't happening before.

How do I know what's causing this? Rather than speeding ahead wouldn't it be advisable to allow others the time to test and understand what's going on? Would it not be better to understand why this amendment appears (it's far too early to be certain) to stop 505D shifting to 547D? Based on my experience it looks like the amendment patches a leak in one place and instead it has just broken out elsewhere.

Acting too quickly without pursuing how and why you think this has resolved the issue and not allowing time to test variables against a control will make it more difficult to ensure this issue is nipped in the bud. It would be good to know why there is such keenness on this amendment only 18 hours after it was mentioned.

ghost avatar Jul 04 '19 14:07 ghost

I apologize for misleading expressions.

For now, I do not intend to create this patch as a pull request to jakeday. I just want everyone (including you) to test the finding.

I completely agree with you. Let's do it carefully.

kitakar5525 avatar Jul 04 '19 15:07 kitakar5525

Also, I should not have said: "seems stable now". I describe a more accurate current situation below:

I forgot to mention, I personally revert the commit 1143fca to reproduce this issue more quickly when I debug this issue. When the commit is reverted, the touch inputs will crash almost immediately.

Under that condition, I applied the finding and the touch inputs have not crashed yet.


I understand there are another (maybe outside of this issue) IPTS problems:

  • sometimes the touch inputs act as if "Alt" button is being pressed You also mentioned this behavior before:

    get invited to download an html link rather than open it in your browser?

    (Pressing alt key on on-screen keyboard fixes it) EDIT I think this is rather a Chromium/Chrome issue because this is also happening on Surface 3 (which does not use IPTS) and not happening on another browser like GNOME Web and Firefox. BTW, you can reproduce this issue by pressing Alt+Tab. A recommendation is to use Super+Tab instead.

EDIT

Another problem (maybe outside of this issue) I understand is:

  • performance is still laggy and drops a touch input (not a crash, next touch input will be recognized) especially on high load even after the commit 1143fca

kitakar5525 avatar Jul 04 '19 15:07 kitakar5525

Oh you've nothing to apolgise for! The comment wasn't directly aimed at anyone, just my opinion. If there is disagreement I'm always open minded even if sometimes I can be blunt. I note what you say and I'll come back if I make any discoveries.

ghost avatar Jul 04 '19 19:07 ghost

Hi all, sorry I was silent for this long. I've had some other stuff to deal with and after that I've been working on getting IPTS ready for v5.2. Now that that's done I'll have a look at this.

First of all, thank you @ardacoskunses for your time and insights!

Unfortunately, the solution commenting out ipts_send_feedback breaks touch for me on the SB2 after the first touch. It feels like the device wants to get feedback before sending more events.

@kitakar5525 @sebanc @condemnedmeat I think we should try to get an exhaustive reconstruction of what's happening communication-wise. Like:

  • What messages are being sent and received prior to crash?
  • Does ipts_send_feedback trigger this directly or is it farther down the line?
  • What's the last response we get from ME (i.e. do we get a response for ipts_send_feedback)?
  • What mode is the device/driver in (HID or raw data)? Does changing the mode change something?

Also: Can we pin this down to a firmware version or something that we can determine from inside the driver that we could use to skip ipts_send_feedback for specific devices only?

Something like this should give us a bit more information:

--- a/drivers/misc/ipts/ipts-msg-handler.c
+++ b/drivers/misc/ipts/ipts-msg-handler.c
@@ -11,6 +11,8 @@ int ipts_handle_cmd(ipts_info_t *ipts, u32 cmd, void *data, int data_size)
        touch_sensor_msg_h2m_t h2m_msg;
        int len = 0;
 
+       ipts_dbg(ipts, "ipts_handle_cmd [cmd: %x]\n", cmd);
+
        memset(&h2m_msg, 0, sizeof(h2m_msg));
 
        h2m_msg.command_code = cmd;
@@ -220,6 +222,8 @@ int ipts_handle_resp(ipts_info_t *ipts, touch_sensor_msg_m2h_t *m2h_msg,
        rsp_status = m2h_msg->status;
        cmd = m2h_msg->command_code;
 
+       ipts_dbg(ipts, "ipts_handle_resp [cmd: %x, status: %d]\n", cmd, rsp_status);
+
        switch (cmd) {
                case TOUCH_SENSOR_NOTIFY_DEV_READY_RSP:
                        if (rsp_status != 0 &&

We may also need some more information about what data is being sent/received, but for now let's take it step by step.

qzed avatar Jul 25 '19 02:07 qzed

Thank you everybody for looking at this issue.

When IPTS is working and on low iGPU usage, the output is always ipts_handle_cmd then ipts_handle_resp:

[  593.866596] ipts: ipts_handle_cmd [cmd: 6]
[  593.868202] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  593.910084] ipts: ipts_handle_cmd [cmd: 6]
[  593.911744] ipts: ipts_handle_resp [cmd: 80000006, status: 0]

When IPTS crashed and on high iGPU usage like watching youtube 360 videos, ipts_handle_cmd and ipts_handle_resp output order does not necessarily appear alternately like this:

[  484.484695] ipts: ipts_handle_cmd [cmd: 6]
[  484.484835] ipts: ipts_handle_cmd [cmd: 6]
[  484.485149] ipts: ipts_handle_cmd [cmd: 6]
[  484.485622] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  484.487928] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  484.488475] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  484.514224] ipts: ipts_handle_cmd [cmd: 6]
[  484.514552] ipts: ipts_handle_cmd [cmd: 6]
[  484.514706] ipts: ipts_handle_cmd [cmd: 6]
[  484.514995] ipts: ipts_handle_cmd [cmd: 6]
[  484.516306] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  484.519873] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  484.520575] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  484.521306] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
# IPTS stopped

I think that sending ipts_send_feedback is too fast for ME to process the data.

kitakar5525 avatar Jul 25 '19 12:07 kitakar5525

What mode is the device/driver in (HID or raw data)? Does changing the mode change something?

I assume HID/RAW data mode can be changed by Multi/Single touch mode change [1].

With single touch mode, touch crash will occur much less frequently but still occur (cannot reproduce it yet).

The output; this output will appear alternatively:

ipts: ipts_handle_resp [cmd: 80000005, status: 0]
ipts: ipts_handle_cmd [cmd: 6]
ipts: ipts_handle_resp [cmd: 80000006, status: 0]
ipts: ipts_handle_cmd [cmd: 5]
ipts: ipts_handle_resp [cmd: 80000005, status: 0]
ipts: ipts_handle_cmd [cmd: 6]
ipts: ipts_handle_resp [cmd: 80000006, status: 0]
ipts: ipts_handle_cmd [cmd: 5]
# IPTS is still working

References

kitakar5525 avatar Jul 25 '19 12:07 kitakar5525

I assume HID/RAW data mode can be changed by Multi/Single touch mode change [1].

Right, should be the same, just named differently in the wiki vs. the code. Let's stick with the default for now (should be RAW/multitouch).

I think that sending ipts_send_feedback is too fast for ME to process the data.

This could be the case. I guess your other finding

With single touch mode, touch crash will occur much less frequently but still occur (cannot reproduce it yet).

could also fit into this as there may be more in-driver overhead involved which would push the feedback commands farther apart.

We could try to enforce that there are a maximum of 3 commands in flight (i.e. without a response), e.g. by delaying subsequent commands if necessary. Maybe we can also drop them? I guess a quick and dirty way to test this theory would be by rate-limiting ipts_send_feedback (see __ratelimit).

qzed avatar Jul 25 '19 15:07 qzed

I cannot figure it out what is happening, but at least I can reproduce a similar situation by rate limiting. IPTS will stop immediately when I set lower values like 64 feedbacks per 30 seconds. When I set larger values like 8192 feedbacks per 30 seconds, IPTS will work longer.

ratelimit patch

From 7d6c80206e81d1636c6088e3bf0d46a8921ae6b5 Mon Sep 17 00:00:00 2001
From: kitakar5525 <[email protected]>
Date: Fri, 26 Jul 2019 17:55:32 +0900
Subject: [PATCH] ipts: ratelimit ipts_send_feedback()

---
 drivers/misc/ipts/ipts-hid.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/ipts/ipts-hid.c b/drivers/misc/ipts/ipts-hid.c
index e85844dc1..6082e371b 100644
--- a/drivers/misc/ipts/ipts-hid.c
+++ b/drivers/misc/ipts/ipts-hid.c
@@ -54,6 +54,11 @@ typedef struct kernel_output_payload_error {
     char string[128];
 } kernel_output_payload_error_t;
 
+/*
+ * Rate limiting to no more than 64 feedbacks per 30 seconds
+ */
+DEFINE_RATELIMIT_STATE(ipts_send_feedback_ratelimit, 30 * HZ, 64);
+
 static int ipts_hid_get_hid_descriptor(ipts_info_t *ipts, u8 **desc, int *size)
 {
     u8 *buf;
@@ -415,8 +420,10 @@ static int handle_outputs(ipts_info_t *ipts, int parallel_idx)
         }
     }
 
-    if (fb_buf) {
+    if (fb_buf && ___ratelimit(&ipts_send_feedback_ratelimit, "ipts send feedback")) {
+        pr_alert("DEBUG: before ipts_send_feedback\n");
         ret = ipts_send_feedback(ipts, parallel_idx, transaction_id);
+        pr_alert("DEBUG: after ipts_send_feedback\n");
         if (ret)
             return ret;
     }
-- 
2.22.0


dmesg log

[ 1815.525565] ipts: ipts_handle_cmd [cmd: 6]
[ 1815.525691] DEBUG: after ipts_send_feedback
[ 1815.527192] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[ 1815.535935] DEBUG: before ipts_send_feedback
[ 1815.535937] ipts: ipts_handle_cmd [cmd: 6]
[ 1815.536089] DEBUG: after ipts_send_feedback
[ 1815.537546] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[ 1815.545424] DEBUG: before ipts_send_feedback
[ 1815.545426] ipts: ipts_handle_cmd [cmd: 6]
[ 1815.545574] DEBUG: after ipts_send_feedback
[ 1815.547038] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
# IPTS stopped

Rather, ipts_send_feedback function call is too slow (?)

kitakar5525 avatar Jul 26 '19 09:07 kitakar5525

That's actually quite interesting. Considering it doesn't work at all if I comment out ipts_send_feedback here, I rather think it's the messages being dropped that are the issue. This could also be the reason IPTS breaks in the first place, e.g. if the IPTS controller can only store 3 messages and is forced to drop some when the function is called too often.

Can you try something like this:

diff --git a/drivers/misc/ipts/ipts-msg-handler.c b/drivers/misc/ipts/ipts-msg-handler.c
index db5356a1c84e..f3813633b0a8 100644
--- a/drivers/misc/ipts/ipts-msg-handler.c
+++ b/drivers/misc/ipts/ipts-msg-handler.c
@@ -1,3 +1,5 @@
+#include <linux/atomic.h>
+#include <linux/delay.h>
 #include <linux/mei_cl_bus.h>
 
 #include "ipts.h"
@@ -5,6 +7,10 @@
 #include "ipts-resource.h"
 #include "ipts-mei-msgs.h"
 
+static atomic64_t feedback_queued = ATOMIC64_INIT(0);
+static atomic64_t feedback_next = ATOMIC64_INIT(0);
+static atomic64_t feedback_in_flight = ATOMIC64_INIT(0);
+
 int ipts_handle_cmd(ipts_info_t *ipts, u32 cmd, void *data, int data_size)
 {
 	int ret = 0;
@@ -30,18 +36,33 @@ int ipts_handle_cmd(ipts_info_t *ipts, u32 cmd, void *data, int data_size)
 
 int ipts_send_feedback(ipts_info_t *ipts, int buffer_idx, u32 transaction_id)
 {
+	bool warned = false;
+	u64 token;
 	int ret;
 	int cmd_len;
 	touch_sensor_feedback_ready_cmd_data_t fb_ready_cmd;
 
+	token = atomic64_inc_return(&feedback_queued) - 1;
+
 	cmd_len = sizeof(touch_sensor_feedback_ready_cmd_data_t);
 	memset(&fb_ready_cmd, 0, cmd_len);
 
 	fb_ready_cmd.feedback_index = buffer_idx;
 	fb_ready_cmd.transaction_id = transaction_id;
 
+	while (token != atomic64_read(&feedback_next) || atomic64_read(&feedback_in_flight) >= 3) {
+		if (!warned) {
+			warned = true;
+			printk("IPTS: ipts_send_feedback: sleeping\n");
+		}
+
+		msleep(1);
+	}
+
+	atomic64_inc(&feedback_in_flight);
 	ret = ipts_handle_cmd(ipts, TOUCH_SENSOR_FEEDBACK_READY_CMD,
 				&fb_ready_cmd, cmd_len);
+	atomic64_inc(&feedback_next);
 
 	return ret;
 }
@@ -363,6 +384,8 @@ int ipts_handle_resp(ipts_info_t *ipts, touch_sensor_msg_m2h_t *m2h_msg,
 			break;
 		}
 		case TOUCH_SENSOR_FEEDBACK_READY_RSP:
+			atomic64_dec(&feedback_in_flight);
+
 			if (rsp_status != 0 &&
 			  rsp_status != TOUCH_STATUS_COMPAT_CHECK_FAIL) {
 				rsp_failed(ipts, cmd, rsp_status);

qzed avatar Jul 26 '19 14:07 qzed

I've updated the diff above to address the ordering problem. Feedback messages will now be sent in the order in which ipts_send_feedback is called.

qzed avatar Jul 27 '19 19:07 qzed

Unfortunately, still this issue persists.

dmesg log 1

[  403.313469] ipts: ipts_handle_cmd [cmd: 6]
[  403.313802] ipts: ipts_handle_cmd [cmd: 6]
[  403.313968] ipts: ipts_handle_cmd [cmd: 6]
[  403.314420] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  403.316582] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  403.317369] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  403.322020] ipts: ipts_handle_cmd [cmd: 6]
[  403.323717] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  403.328549] ipts: ipts_handle_cmd [cmd: 6]
[  403.330206] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  403.348983] ipts: ipts_handle_cmd [cmd: 6]
[  403.349169] ipts: ipts_handle_cmd [cmd: 6]
[  403.349904] ipts: ipts_handle_cmd [cmd: 6]
[  403.351616] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  403.354025] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  403.354738] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
# ipts stopped
dmesg log 2

[  439.509707] ipts: ipts_handle_cmd [cmd: 6]
[  439.509846] ipts: ipts_handle_cmd [cmd: 6]
[  439.509988] ipts: ipts_handle_cmd [cmd: 6]
[  439.510149] IPTS: ipts_send_feedback: sleeping
[  439.510629] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  439.512720] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  439.513472] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  439.517883] ipts: ipts_handle_cmd [cmd: 6]
[  439.518091] ipts: ipts_handle_cmd [cmd: 6]
[  439.518265] ipts: ipts_handle_cmd [cmd: 6]
[  439.518432] IPTS: ipts_send_feedback: sleeping
[  439.518597] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  439.521382] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  439.522134] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  439.527985] ipts: ipts_handle_cmd [cmd: 6]
[  439.528721] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
# ipts stopped
dmesg log 3

[  502.871399] ipts: ipts_handle_cmd [cmd: 6]
[  502.871572] ipts: ipts_handle_cmd [cmd: 6]
[  502.871758] ipts: ipts_handle_cmd [cmd: 6]
[  502.871931] IPTS: ipts_send_feedback: sleeping
[  502.875187] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  502.878198] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  502.879079] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  502.879143] ipts: ipts_handle_cmd [cmd: 6]
[  502.879522] ipts: ipts_handle_cmd [cmd: 6]
[  502.880047] ipts: ipts_handle_cmd [cmd: 6]
[  502.883578] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  502.884281] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
[  502.885057] ipts: ipts_handle_resp [cmd: 80000006, status: 0]
# ipts stopped

EDIT: with commit 1143fca of GuC priority part reverted to reproduce this issue more quickly.

kitakar5525 avatar Jul 28 '19 12:07 kitakar5525

@qzed Can you please also try to revert GuC priority part by applying my patch and passing a kernel parameter to your bootloader

i915.ipts_guc_priority=3 i915.disable_priority_mechanism=0 i915.enable_high_priority_PRIORITY_HIGH_only=0

and please watch GPU heavy videos like Best VR 360 Video - YouTube then move your Pen/finger to see you can reproduce this issue on SB2?

EDIT

On SB1, touch input will crash within 10~30 seconds watching the video and using touch input when I revert IS_SKYLAKE(dev_priv) || IS_KABYLAKE(dev_priv) ? GUC_CLIENT_PRIORITY_HIGH : GUC_CLIENT_PRIORITY_NORMAL, to GUC_CLIENT_PRIORITY_NORMAL, in the commit 1143fca.

My patch also reverts another two part, which is not necessary to reproduce this issue more quickly, though.

EDIT2

If you can reproduce this issue on SB2 with the commit 1143fca reverted, potentially SB2 is also affected by this issue even with the commit not reverted. SB1/SP4 touch input will still occasionally crash even with the commit applied.

kitakar5525 avatar Jul 28 '19 12:07 kitakar5525

@kitakar5525 I'll try that shortly.

I guess your findings could mean that the processing takes too long. We should be able to confirm that by adding something like msleep(100) in raw_data_work_func in ipts-mei.c.

qzed avatar Jul 28 '19 19:07 qzed

@kitakar5525 Okay, I've applied the patch and the kernel parameters and with that gone through the video (highest settings) two times on Chromium and one time on Firefox. In chromium, there are some issues with touch, like interpreting drag as click or double-click (pen works fine though) which are not present in firefox, and touch seems to have slightly more lag. I think this is because Firefox might not use the GPU for decoding.

I've also gone through the video with Chromium another time without the parameters and the issues are gone. So I think the thing that's causing the crash on the SB1 causes the misinterpretation on the SB2.

qzed avatar Jul 28 '19 21:07 qzed

I noted issues in chrome here that sound rather similar in description. This is amongst the reasons I use firefox instead. You could also try a long webpage instead as I couldn't reliably crash touch viewing a video. Where as within about 10 or so minutes I could with a long webpage.

This from my opening comment

My experience continues to be it most obviously crashes as a result of visiting a site that will load new information as you scroll down a page ad infinitum. For example, scrolling down someone's twitter feed at reading speed. Another example is if I visit a site that has a "long" home page packed with links, pictures and short video clips. An example that comes to mind is dailymail.co.uk

ghost avatar Jul 28 '19 23:07 ghost

@condemnedmeat I just tried dailymail.co.uk and reddit, can't reproduce the problems there. I'm not sure about the details, but as far as I know chromium and firefox both use the GPU for 2D acceleration, so that could make the connection.

I'm currently thinking that the crashes are caused during a (at least temporary) high GPU usage, which causes the processing to be delayed. This would at least fit in with the issue being worsened by reducing the priority to "normal". As mentioned above, I think we could try to simulate this scenario by adding msleep(100) in raw_data_work_func in ipts-mei.c. If that doesn't change anything we should probably also try adding some mdelay(...) in gfx_processing_complete (ipts-gfx.c). However the second function is called in an interrupt, so that may have some other impacts (the first function is called from a workqueue).

qzed avatar Jul 29 '19 01:07 qzed

Adding msleep(100) into raw_data_work_func() caused the same issue even with the commit 1143fca applied!

I narrowed down the problematic function.

I inserted msleep(10) (msleep(100) is too long) into handle_outputs() OR ipts_send_feedback() OR ipts_handle_cmd(), it recognized a small amount of touch input, then stopped working. If I call ipts_send_sensor_quiesce_io_cmd(ipts); by known ways [1], it will output a message into dmesg:

ipts mei::3e8d0870-271a-4208-8eb5-9acb9402ae04:0F: 0x80000004 failed status = 14

So, this is the same issue we are discussing now.

[1]:

  • display off (xset dpms force off && xset dpms force on on X or going into lock screen on Wayland/X)
  • change sensor mode (https://github.com/jakeday/linux-surface/issues/374#issuecomment-461833074)

@qzed

In chromium, there are some issues with touch, like interpreting drag as click or double-click (pen works fine though)

It sounds super weird. Did you have that chromium issue before the commit 1143fca was introduced (Jan 29, 2019)?

What happends if you pass

i915.ipts_guc_priority=3 i915.disable_priority_mechanism=1 i915.enable_high_priority_PRIORITY_HIGH_only=1

instead?

Anyway, the touch input crash is not an issue on SB2, maybe because of more CPU/GPU processing capability. What will happen if you also insert msleep to those functions?

kitakar5525 avatar Jul 29 '19 03:07 kitakar5525

Hi Everyone,

I lost track of what is working and what is not working.

However there are some changes I have comment about.

By design IPTS "must" be always the highest priority GPU workload in the system and there should be a priority system and preemption. By current design this is not (much) open to discussion otherwise just lead non deterministic behavior.

The reason is touch GPU workloads prepared during init and triggered to execute by only ME. If any other 3D app, mesa, Chorme, Firefox etc preempts, there is no resume for touch again. (hard to repro but possible)

Also without priorities you may notice weird touch behavior you mention above.

It used to be 4 priority for GuC, high-low UMD, high-low KMD, KMD low > UMD high and Touch is High KMD. I dont know the current scheme but main rule applies, touch is highest and there must be preemption.

Going back previous observations: ME single touch to multi touch switch. There is long story behind this but simply this triggers an ME reset. @kitakar5525 mentioned this fixes the issue but hard to automatize. If everyone agrees, this works for everyone, you should pursue this. My suggestion is adding this to recovery thread: find reacquire Guc Doorbell function (name my be different but these are the key words) and add this trick there and trigger an ME reset.

I can help more if you decide to go this direction. Which I think very sensible.

My 2 cents.

ardacoskunses avatar Jul 29 '19 04:07 ardacoskunses

@ardacoskunses Thank you for looking at this issue!

what is working and what is not working.

The current most effective workaround for SB1/SP4 [1] (Skylake) is to comment out ipts_send_feedback() in handle_outputs(), found by @sebanc (https://github.com/jakeday/linux-surface/issues/374#issuecomment-508234110) https://github.com/jakeday/linux-surface/blob/3d0abed6c461fd269694b66b9bb6372be230fa20/patches/5.1/0005-ipts.patch#L2287-L2291 The workaround is working, the touch input crash is not happening since I applied the workaround.

However, the change will break ipts functionality at least on SB2 (Kaby Lake R), reported by @qzed (https://github.com/jakeday/linux-surface/issues/374#issuecomment-514865751)

Now, the current discussion is what is the root cause of this issue by @qzed (https://github.com/jakeday/linux-surface/issues/374#issuecomment-514865751)

another option?

If we can use GUC_CLIENT_PRIORITY_KMD_HIGH instead of IS_SKYLAKE(dev_priv) || IS_KABYLAKE(dev_priv) ? GUC_CLIENT_PRIORITY_HIGH : GUC_CLIENT_PRIORITY_NORMAL,, it may also change the situation. https://github.com/jakeday/linux-surface/blob/3d0abed6c461fd269694b66b9bb6372be230fa20/patches/5.1/0005-ipts.patch#L350-L366

Currently, IPTS is not working with GUC_CLIENT_PRIORITY_KMD_HIGH (corresponds to i915.ipts_guc_priority=0 with my patch)

recovery thread

Sorry, but ME reset using recovery thread is not an option for me because I don't want IPTS to stop even for milliseconds...

[1]

  • SB stands for Surface Book, SP stands for Surface Pro

kitakar5525 avatar Jul 29 '19 05:07 kitakar5525

@kitakar5525

What will happen if you also insert msleep to those functions?

  • Without kernel parameters: With msleep(10) I don't notice any impact, with msleep(100) it has a notable lag and I experience the same issues as in chrome watching the video, but no crashes or anything.

  • With kernel parameters (as in https://github.com/jakeday/linux-surface/issues/374#issuecomment-515760844): Same as above (although I haven't been running any stuff on the GPU, i.e. no video playing, so that was to be expected).

Did you have that chromium issue before the commit 1143fca was introduced (Jan 29, 2019)?

I honestly can't say, I normally use firefox and I don't watch much videos on this device, let alone 360 VR on highest quality. So I guess the issues would have been present then also.

What happends if you pass

i915.ipts_guc_priority=3 i915.disable_priority_mechanism=1 i915.enable_high_priority_PRIORITY_HIGH_only=1

instead?

Same issues as before, again no crash.

qzed avatar Jul 29 '19 14:07 qzed

@ardacoskunses

By design IPTS "must" be always the highest priority GPU workload in the system and there should be a priority system and preemption. By current design this is not (much) open to discussion otherwise just lead non deterministic behavior.

The reason is touch GPU workloads prepared during init and triggered to execute by only ME. If any other 3D app, mesa, Chorme, Firefox etc preempts, there is no resume for touch again. (hard to repro but possible)

I think that's exactly what we're experiencing: During high GPU usage, IPTS gets preempted. What I'd expect is that preemption causes the IPTS workload to take longer to execute, so in turn take longer to call gfx_processing_complete and thus call ipts_send_feedback. That's what I wanted to simulate via msleep in raw_data_work_func (@kitakar5525 noted that that's causing the same issue). Are there any other side-effects of preemption that could be causing issues?

qzed avatar Jul 29 '19 15:07 qzed