NV41 touch-pad unresponsive and erratic during certain operations with high I/O on external storage
Qubes OS release
4.1.2
Brief summary
When doing a lot of I/O with an external drive (in this case HDD; cannot currently try with external SSD), such as restoring a backup or wiping files, the touch-pad on this NV41 laptop becomes almost unusable, i.e. extremely choppy, unresponsive in moving the cursor and doesn't register "click"-events properly, so it will click randomly when I'm just trying to move the cursor.
This doesn't happen when doing a lot of I/O on the internal SSD or when stressing the system in other ways, such as having it crunch pi to a high accuracy on all cores or opening multiple qubes simultaneously to bring dom0 CPU usage up. I've reported this before as an aside of the paranoid mode backup restore issue, but I've noticed it several times now under various circumstances, always when doing a lot of I/O on an external HDD.
Curiously, the keyboard still works very smoothly; typing fast is no problem; all other system components seem to work without issue, e.g. I can get very smooth scrolling in a browser using the arrow keys, while scrolling with the touch-pad is basically impossible. Opening new VMs etc. also works fine and fast (if done with the keyboard), so this is not a "combined input problem".
Another curiosity is that the degree to which the problem exists during external I/O varies, sometimes greatly. E.g. during this very long shred operation I'm doing, I've noticed that it was very bad for the first ~40 mins and then for ~30 mins the issue completely vanished, even though the operation was still ongoing and CPU usage didn't change (xl top still showing ~170-210% on that qube). I may have found a way to trigger it to come back when I/O is already ongoing: by connecting a second HDD to another USB port...so far when I did that (twice) the problem came back within about one minute, but I suppose it could have been a coincidence, as later after the problem had vanished again, it came back without me connecting something.
Steps to reproduce
- Connect external drive, attach to some qube
- Start doing a lot of I/O on it, such as shred on a large file / backup restore
Expected behavior
Minimal or no impact on touch-pad usability
Actual behavior
Touch-pad almost unusable much of the time
Possibly related issues
#7932 #7893 Both of these seem to be about keyboard+mouse, however, while in my case it's only the touch-pad that is the problem.
I can only think of three causes for this behavior:
- Xen gets bogged down doing some sort of emulation.
- High interrupt load.
- Bug in firmware or hardware.
I can only think of three causes for this behavior:
Do you know a good way to test for 1 or 2? I don't have a suitable second machine to test 3 right now.
I can only think of three causes for this behavior:
Do you know a good way to test for 1 or 2? I don't have a suitable second machine to test 3 right now.
/proc/interrupts in sys-usb? There is probably a way to get better data from Xen that I am not aware of right now.
Here's the bottom end of that log...no idea if this is normal or not (millions of interrupts reported):
NMI: 0 0 0 0 0 0 Non-maskable interrupts
LOC: 0 0 0 0 0 0 Local timer interrupts
SPU: 0 0 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 0 0 Performance monitoring interrupts
IWI: 1 1 0 0 0 1 IRQ work interrupts
RTR: 0 0 0 0 0 0 APIC ICR read retries
RES: 10599 4883 4085 3589 5896 4603 Rescheduling interrupts
CAL: 1840355 3598385 1575170 3688326 1611273 2991577 Function call interrupts
TLB: 519 182 143 154 342 223 TLB shootdowns
TRM: 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 Machine check exceptions
MCP: 140 140 140 140 140 140 Machine check polls
HYP: 2244326 4581534 3438130 4950174 2432874 3931114 Hypervisor callback interrupts
ERR: 0
MIS: 0
PIN: 0 0 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 0 0 Posted-interrupt wakeup event
Other lines with lots of interrupts:
39: 25 0 0 0 0 0 xen-pirq -ioapic-level ehci_hcd:usb1
48: 406112 0 0 0 0 0 xen-percpu -virq timer0
49: 10599 0 0 0 0 0 xen-percpu -ipi resched0
50: 1787 0 0 0 0 0 xen-percpu -ipi callfunc0
51: 1838568 0 0 0 0 0 xen-percpu -ipi callfuncsingle0
52: 0 0 0 0 0 0 xen-percpu -ipi spinlock0
53: 0 987012 0 0 0 0 xen-percpu -virq timer1
54: 0 4883 0 0 0 0 xen-percpu -ipi resched1
55: 0 1349 0 0 0 0 xen-percpu -ipi callfunc1
56: 0 3597036 0 0 0 0 xen-percpu -ipi callfuncsingle1
57: 0 0 0 0 0 0 xen-percpu -ipi spinlock1
58: 0 0 1852406 0 0 0 xen-percpu -virq timer2
59: 0 0 4085 0 0 0 xen-percpu -ipi resched2
60: 0 0 2054 0 0 0 xen-percpu -ipi callfunc2
61: 0 0 1573116 0 0 0 xen-percpu -ipi callfuncsingle2
62: 0 0 0 0 0 0 xen-percpu -ipi spinlock2
63: 0 0 0 1255261 0 0 xen-percpu -virq timer3
64: 0 0 0 3589 0 0 xen-percpu -ipi resched3
65: 0 0 0 1886 0 0 xen-percpu -ipi callfunc3
66: 0 0 0 3686440 0 0 xen-percpu -ipi callfuncsingle3
67: 0 0 0 0 0 0 xen-percpu -ipi spinlock3
68: 0 0 0 0 813953 0 xen-percpu -virq timer4
69: 0 0 0 0 5896 0 xen-percpu -ipi resched4
70: 0 0 0 0 1887 0 xen-percpu -ipi callfunc4
71: 0 0 0 0 1609386 0 xen-percpu -ipi callfuncsingle4
72: 0 0 0 0 0 0 xen-percpu -ipi spinlock4
73: 0 0 0 0 0 956074 xen-percpu -virq timer5
74: 0 0 0 0 0 4603 xen-percpu -ipi resched5
75: 0 0 0 0 0 1891 xen-percpu -ipi callfunc5
76: 0 0 0 0 0 2989686 xen-percpu -ipi callfuncsingle5
77: 0 0 0 0 0 0 xen-percpu -ipi spinlock5
97: 0 0 14872883 0 0 0 PCI-MSI-0000:00:06.0 0-edge xhci_hcd
Which kernel version you have in dom0 and sys-usb?
Which kernel version you have in dom0 and sys-usb?
6.4.8-1 in both dom0 and sys-usb
Edit: I'll add another "data point": I was just running a btrfs scrub on a large volume (again external HDD) and this time the touch-pad was only mildly affected, and so were the other system components, e.g. keyboard and starting VMs is slower etc...so a more expected effect of stress on the system. CPU usage in that VM according to xl top was even higher, at usually >230%, but yeah, touch-pad is not more affected than the other system components this time.
This was on a different HDD from a different manufacturer, with a different capacity and different file system, in case it matters. I think I'll try a shred on that one today just to see how the behavior is; should it be different than on the other HDD then I'll report back.
How CPU usage of sys-usb and sys-usb-dm looks on xl top during the shred?
How CPU usage of
sys-usbandsys-usb-dmlooks onxl topduring the shred?
On the same drive that I did the btrfs scrub on earlier (so different one from the first shred), both sys-usb and sys-usb-dm are below 20%, both during scrub as well as now during shred with sys-usb always higher than sys-usb-dm. The VM doing the shred is at >170%. But, as I'm noticing, there is no impact on the touch-pad now, though I'm seeing an impact on the keyboard, with lags during typing and some letters repeated many times even though I only pressed the key shortly, but the system is still usable.
So quite a curious result and I suppose it indicates that this has indeed to do with the hardware or firmware of the external drive...I may do another shred today on the first HDD to see if I get the same result as yesterday, which I expect to, since this was also the one I restored a backup from, where the touch-pad problems also occurred; I will report if the result is not as expected.
Still a problem on Qubes 4.2rc5. Having observed more occurrences I can say that larger files seem to be a bigger problem, i.e. worse effect on the system than a lot of smaller files being copied from an external HDD, with approximately same total files size. Restoring a backup on 4.2rc5 again made the system only partially usable, especially due to the negative impact on the touchpad, while the keyboard was mostly fine.
Another update: just got a Crucial X9 Pro external SSD and it's even worse than the HDD: takes more than a minute to even mount, during which time the system is extremely unresponsive to touchpad input and less, but still significantly impacted regarding keyboard input. Copying files from it to the internal SSD also has these impacts, but less bad than when mounting; copying from internal to external works best with barely any impact and best copy speed performance (almost max for the X9 Pro and the ports). I've also noticed that if I pool all three USB controllers in sys-usb (two of them are Thunderbolt 4 controllers) then SSD performance seems to be better, even on the non-TB4 port, but that may be anecdotal.
As with the external HDDs, those problems don't exist on a machine running bare-metal Fedora 38, where there is no impact on input device performance or high mount times.
Strangely, I don't see any errors or warnings in sudo dmesg or sudo journalctl, not in dom0, not in sys-usb and not in the qube I mounted in. Curiously, one of the times I mounted the SSD, it had almost no impact on the input devices, but still took over a minute to complete.
Edit: Just had another case where it mounted without creating problems (and without taking long, but I didn't normally mount it but with veracrypt, as I created a partition-volume)...and now there still is no impact whatsoever on input device performance at all even though I'm doing a lot of I/O on it...very strange.
Edit2: after more testing it now works well in all cases, i.e. both with VeraCrypt or without and from both ports with different kinds of loads...there is no more negative impact on input device performance or long mount times. The only thing I can think of that changed is the fact that I reformatted it in QubesOS using either a debian-xfce or fedora-xfce based qube running gparted. Will make another post once I reformat my HDD (end of month or so).
I'm having this issue also when transferring data to and from my phone via USB:
[Comparison] old desktop running bare-metal Fedora 38 (USB transfer): up to 28 MiB/s (large file)
NV41 while phone is mounted in sys-usb via Thunar (USB): up to 15 MiB/s (large file)
NV41 while phone is mounted in sys-usb via Thunar (USB): up to 11 MiB/s (many smaller files)
NV41 while phone is mounted in sys-usb via jmtpfs (USB): up to 50 MiB/s (large file), but hangs several seconds after file copied; 50 MiB/s is the speed without counting those seconds
NV41 while phone is mounted in sys-usb via jmtpfs (USB): up to 250 KiB/s (!!!) (many smaller files), though the copy progress window claims much higher speeds and only approaches the real speed toward the end of the transfer (the number I gave is what I calculated from the total size of all files and how long it took)
So it seems some operation after the copying, especially when mounted via jmtpfs, causes some kind of hang, which totally kills performance; that last case (multiple smaller files copied while mounted via jmtpfs) produces by far the most impact on keyboard and touchpad responsiveness. When mounted via jmtpfs there is also significant impact and lags when just browsing the phone's file system, while when mounted via Thunar there is barely any impact and good responsiveness.
I also have terrible transfer speed via BlueTooth (up to 60 KiB/s for large or multiple files; up to 1 MiB/s for a single, smaller file, i.e. <1MiB file size), but no impact on touchpad / keyboard, so probably a different issue.
I haven't gotten around to formatting my other HDD yet, but maybe next week or so I'll do that and report the result as promised in my previous post.
Some more testing results:
Unfortunately with my HDD mentioned above several times the issue persists even after reformatting it. Curiously, when I was creating a large Veracrypt volume on it (CPU usage in sys-usb and the "veracrypt-vm" was rather low, below 30% throughout in xl top and below 10% in the domains widget), there were almost no symptoms in the first ~2 hours, but then it got worse and worse.
Same goes for a USB thumb drive I tested...so the formatting solution so far only worked for my external SSD, which continues to show good performance and (almost?) no symptoms.
I also noticed a new symptom: audio hangs, including repeats of short fragments up to the point where it's impossible to listen to...I didn't notice it until now since I don't play audio / video often and, as with the other symptoms, it doesn't always occur.
I've now updated the title and OP.
Can you try xl sched-credit2 -d 0 -w 2000 in dom0? Does it help?
Can you try
xl sched-credit2 -d 0 -w 2000in dom0? Does it help?
It's possible that there is slight improvement when executing that command, but barely, if any. It's hard to tell, because the impact on keyboard and touchpad functionality varies according to mysterious patterns. The command definitely doesn't fix the issue.