qubes-issues icon indicating copy to clipboard operation
qubes-issues copied to clipboard

NV41 touch-pad unresponsive and erratic during certain operations with high I/O on external storage

Open UndeadDevel opened this issue 2 years ago • 11 comments

Qubes OS release

4.1.2

Brief summary

When doing a lot of I/O with an external drive (in this case HDD; cannot currently try with external SSD), such as restoring a backup or wiping files, the touch-pad on this NV41 laptop becomes almost unusable, i.e. extremely choppy, unresponsive in moving the cursor and doesn't register "click"-events properly, so it will click randomly when I'm just trying to move the cursor.

This doesn't happen when doing a lot of I/O on the internal SSD or when stressing the system in other ways, such as having it crunch pi to a high accuracy on all cores or opening multiple qubes simultaneously to bring dom0 CPU usage up. I've reported this before as an aside of the paranoid mode backup restore issue, but I've noticed it several times now under various circumstances, always when doing a lot of I/O on an external HDD.

Curiously, the keyboard still works very smoothly; typing fast is no problem; all other system components seem to work without issue, e.g. I can get very smooth scrolling in a browser using the arrow keys, while scrolling with the touch-pad is basically impossible. Opening new VMs etc. also works fine and fast (if done with the keyboard), so this is not a "combined input problem".

Another curiosity is that the degree to which the problem exists during external I/O varies, sometimes greatly. E.g. during this very long shred operation I'm doing, I've noticed that it was very bad for the first ~40 mins and then for ~30 mins the issue completely vanished, even though the operation was still ongoing and CPU usage didn't change (xl top still showing ~170-210% on that qube). I may have found a way to trigger it to come back when I/O is already ongoing: by connecting a second HDD to another USB port...so far when I did that (twice) the problem came back within about one minute, but I suppose it could have been a coincidence, as later after the problem had vanished again, it came back without me connecting something.

Steps to reproduce

  1. Connect external drive, attach to some qube
  2. Start doing a lot of I/O on it, such as shred on a large file / backup restore

Expected behavior

Minimal or no impact on touch-pad usability

Actual behavior

Touch-pad almost unusable much of the time

Possibly related issues

#7932 #7893 Both of these seem to be about keyboard+mouse, however, while in my case it's only the touch-pad that is the problem.

UndeadDevel avatar Oct 22 '23 00:10 UndeadDevel

I can only think of three causes for this behavior:

  1. Xen gets bogged down doing some sort of emulation.
  2. High interrupt load.
  3. Bug in firmware or hardware.

DemiMarie avatar Oct 22 '23 00:10 DemiMarie

I can only think of three causes for this behavior:

Do you know a good way to test for 1 or 2? I don't have a suitable second machine to test 3 right now.

UndeadDevel avatar Oct 22 '23 00:10 UndeadDevel

I can only think of three causes for this behavior:

Do you know a good way to test for 1 or 2? I don't have a suitable second machine to test 3 right now.

/proc/interrupts in sys-usb? There is probably a way to get better data from Xen that I am not aware of right now.

DemiMarie avatar Oct 22 '23 01:10 DemiMarie

Here's the bottom end of that log...no idea if this is normal or not (millions of interrupts reported):

NMI:          0          0          0          0          0          0   Non-maskable interrupts
LOC:          0          0          0          0          0          0   Local timer interrupts
SPU:          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0   Performance monitoring interrupts
IWI:          1          1          0          0          0          1   IRQ work interrupts
RTR:          0          0          0          0          0          0   APIC ICR read retries
RES:      10599       4883       4085       3589       5896       4603   Rescheduling interrupts
CAL:    1840355    3598385    1575170    3688326    1611273    2991577   Function call interrupts
TLB:        519        182        143        154        342        223   TLB shootdowns
TRM:          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0   Threshold APIC interrupts
DFR:          0          0          0          0          0          0   Deferred Error APIC interrupts
MCE:          0          0          0          0          0          0   Machine check exceptions
MCP:        140        140        140        140        140        140   Machine check polls
HYP:    2244326    4581534    3438130    4950174    2432874    3931114   Hypervisor callback interrupts
ERR:          0
MIS:          0
PIN:          0          0          0          0          0          0   Posted-interrupt notification event
NPI:          0          0          0          0          0          0   Nested posted-interrupt event
PIW:          0          0          0          0          0          0   Posted-interrupt wakeup event

Other lines with lots of interrupts:

 39:         25          0          0          0          0          0  xen-pirq    -ioapic-level  ehci_hcd:usb1
 48:     406112          0          0          0          0          0  xen-percpu    -virq      timer0
 49:      10599          0          0          0          0          0  xen-percpu    -ipi       resched0
 50:       1787          0          0          0          0          0  xen-percpu    -ipi       callfunc0
 51:    1838568          0          0          0          0          0  xen-percpu    -ipi       callfuncsingle0
 52:          0          0          0          0          0          0  xen-percpu    -ipi       spinlock0
 53:          0     987012          0          0          0          0  xen-percpu    -virq      timer1
 54:          0       4883          0          0          0          0  xen-percpu    -ipi       resched1
 55:          0       1349          0          0          0          0  xen-percpu    -ipi       callfunc1
 56:          0    3597036          0          0          0          0  xen-percpu    -ipi       callfuncsingle1
 57:          0          0          0          0          0          0  xen-percpu    -ipi       spinlock1
 58:          0          0    1852406          0          0          0  xen-percpu    -virq      timer2
 59:          0          0       4085          0          0          0  xen-percpu    -ipi       resched2
 60:          0          0       2054          0          0          0  xen-percpu    -ipi       callfunc2
 61:          0          0    1573116          0          0          0  xen-percpu    -ipi       callfuncsingle2
 62:          0          0          0          0          0          0  xen-percpu    -ipi       spinlock2
 63:          0          0          0    1255261          0          0  xen-percpu    -virq      timer3
 64:          0          0          0       3589          0          0  xen-percpu    -ipi       resched3
 65:          0          0          0       1886          0          0  xen-percpu    -ipi       callfunc3
 66:          0          0          0    3686440          0          0  xen-percpu    -ipi       callfuncsingle3
 67:          0          0          0          0          0          0  xen-percpu    -ipi       spinlock3
 68:          0          0          0          0     813953          0  xen-percpu    -virq      timer4
 69:          0          0          0          0       5896          0  xen-percpu    -ipi       resched4
 70:          0          0          0          0       1887          0  xen-percpu    -ipi       callfunc4
 71:          0          0          0          0    1609386          0  xen-percpu    -ipi       callfuncsingle4
 72:          0          0          0          0          0          0  xen-percpu    -ipi       spinlock4
 73:          0          0          0          0          0     956074  xen-percpu    -virq      timer5
 74:          0          0          0          0          0       4603  xen-percpu    -ipi       resched5
 75:          0          0          0          0          0       1891  xen-percpu    -ipi       callfunc5
 76:          0          0          0          0          0    2989686  xen-percpu    -ipi       callfuncsingle5
 77:          0          0          0          0          0          0  xen-percpu    -ipi       spinlock5


97:          0          0   14872883          0          0          0  PCI-MSI-0000:00:06.0   0-edge      xhci_hcd

UndeadDevel avatar Oct 22 '23 01:10 UndeadDevel

Which kernel version you have in dom0 and sys-usb?

marmarek avatar Oct 22 '23 01:10 marmarek

Which kernel version you have in dom0 and sys-usb?

6.4.8-1 in both dom0 and sys-usb

Edit: I'll add another "data point": I was just running a btrfs scrub on a large volume (again external HDD) and this time the touch-pad was only mildly affected, and so were the other system components, e.g. keyboard and starting VMs is slower etc...so a more expected effect of stress on the system. CPU usage in that VM according to xl top was even higher, at usually >230%, but yeah, touch-pad is not more affected than the other system components this time.

This was on a different HDD from a different manufacturer, with a different capacity and different file system, in case it matters. I think I'll try a shred on that one today just to see how the behavior is; should it be different than on the other HDD then I'll report back.

UndeadDevel avatar Oct 22 '23 09:10 UndeadDevel

How CPU usage of sys-usb and sys-usb-dm looks on xl top during the shred?

marmarek avatar Oct 22 '23 12:10 marmarek

How CPU usage of sys-usb and sys-usb-dm looks on xl top during the shred?

On the same drive that I did the btrfs scrub on earlier (so different one from the first shred), both sys-usb and sys-usb-dm are below 20%, both during scrub as well as now during shred with sys-usb always higher than sys-usb-dm. The VM doing the shred is at >170%. But, as I'm noticing, there is no impact on the touch-pad now, though I'm seeing an impact on the keyboard, with lags during typing and some letters repeated many times even though I only pressed the key shortly, but the system is still usable.

So quite a curious result and I suppose it indicates that this has indeed to do with the hardware or firmware of the external drive...I may do another shred today on the first HDD to see if I get the same result as yesterday, which I expect to, since this was also the one I restored a backup from, where the touch-pad problems also occurred; I will report if the result is not as expected.

UndeadDevel avatar Oct 22 '23 13:10 UndeadDevel

Still a problem on Qubes 4.2rc5. Having observed more occurrences I can say that larger files seem to be a bigger problem, i.e. worse effect on the system than a lot of smaller files being copied from an external HDD, with approximately same total files size. Restoring a backup on 4.2rc5 again made the system only partially usable, especially due to the negative impact on the touchpad, while the keyboard was mostly fine.

UndeadDevel avatar Dec 03 '23 18:12 UndeadDevel

Another update: just got a Crucial X9 Pro external SSD and it's even worse than the HDD: takes more than a minute to even mount, during which time the system is extremely unresponsive to touchpad input and less, but still significantly impacted regarding keyboard input. Copying files from it to the internal SSD also has these impacts, but less bad than when mounting; copying from internal to external works best with barely any impact and best copy speed performance (almost max for the X9 Pro and the ports). I've also noticed that if I pool all three USB controllers in sys-usb (two of them are Thunderbolt 4 controllers) then SSD performance seems to be better, even on the non-TB4 port, but that may be anecdotal.

As with the external HDDs, those problems don't exist on a machine running bare-metal Fedora 38, where there is no impact on input device performance or high mount times.

Strangely, I don't see any errors or warnings in sudo dmesg or sudo journalctl, not in dom0, not in sys-usb and not in the qube I mounted in. Curiously, one of the times I mounted the SSD, it had almost no impact on the input devices, but still took over a minute to complete.

Edit: Just had another case where it mounted without creating problems (and without taking long, but I didn't normally mount it but with veracrypt, as I created a partition-volume)...and now there still is no impact whatsoever on input device performance at all even though I'm doing a lot of I/O on it...very strange.

Edit2: after more testing it now works well in all cases, i.e. both with VeraCrypt or without and from both ports with different kinds of loads...there is no more negative impact on input device performance or long mount times. The only thing I can think of that changed is the fact that I reformatted it in QubesOS using either a debian-xfce or fedora-xfce based qube running gparted. Will make another post once I reformat my HDD (end of month or so).

UndeadDevel avatar Dec 06 '23 16:12 UndeadDevel

I'm having this issue also when transferring data to and from my phone via USB:

[Comparison] old desktop running bare-metal Fedora 38 (USB transfer): up to 28 MiB/s (large file) NV41 while phone is mounted in sys-usb via Thunar (USB): up to 15 MiB/s (large file) NV41 while phone is mounted in sys-usb via Thunar (USB): up to 11 MiB/s (many smaller files) NV41 while phone is mounted in sys-usb via jmtpfs (USB): up to 50 MiB/s (large file), but hangs several seconds after file copied; 50 MiB/s is the speed without counting those seconds NV41 while phone is mounted in sys-usb via jmtpfs (USB): up to 250 KiB/s (!!!) (many smaller files), though the copy progress window claims much higher speeds and only approaches the real speed toward the end of the transfer (the number I gave is what I calculated from the total size of all files and how long it took)

So it seems some operation after the copying, especially when mounted via jmtpfs, causes some kind of hang, which totally kills performance; that last case (multiple smaller files copied while mounted via jmtpfs) produces by far the most impact on keyboard and touchpad responsiveness. When mounted via jmtpfs there is also significant impact and lags when just browsing the phone's file system, while when mounted via Thunar there is barely any impact and good responsiveness.

I also have terrible transfer speed via BlueTooth (up to 60 KiB/s for large or multiple files; up to 1 MiB/s for a single, smaller file, i.e. <1MiB file size), but no impact on touchpad / keyboard, so probably a different issue.

I haven't gotten around to formatting my other HDD yet, but maybe next week or so I'll do that and report the result as promised in my previous post.

UndeadDevel avatar Feb 27 '24 11:02 UndeadDevel

Some more testing results:

Unfortunately with my HDD mentioned above several times the issue persists even after reformatting it. Curiously, when I was creating a large Veracrypt volume on it (CPU usage in sys-usb and the "veracrypt-vm" was rather low, below 30% throughout in xl top and below 10% in the domains widget), there were almost no symptoms in the first ~2 hours, but then it got worse and worse. Same goes for a USB thumb drive I tested...so the formatting solution so far only worked for my external SSD, which continues to show good performance and (almost?) no symptoms.

I also noticed a new symptom: audio hangs, including repeats of short fragments up to the point where it's impossible to listen to...I didn't notice it until now since I don't play audio / video often and, as with the other symptoms, it doesn't always occur.

I've now updated the title and OP.

UndeadDevel avatar Mar 11 '24 20:03 UndeadDevel

Can you try xl sched-credit2 -d 0 -w 2000 in dom0? Does it help?

marmarek avatar Mar 11 '24 21:03 marmarek

Can you try xl sched-credit2 -d 0 -w 2000 in dom0? Does it help?

It's possible that there is slight improvement when executing that command, but barely, if any. It's hard to tell, because the impact on keyboard and touchpad functionality varies according to mysterious patterns. The command definitely doesn't fix the issue.

UndeadDevel avatar Mar 11 '24 21:03 UndeadDevel