usbipd-win icon indicating copy to clipboard operation
usbipd-win copied to clipboard

USBIPd lockup about once a month

Open rhermsen opened this issue 1 year ago • 32 comments

I'm using usbipd between Windows-11 and WSL2-Ubuntu to connect a serial device to WSL2. For the most part it works great, but about once a month it locks up. I have a Python program running that obtains data from the USB connected device every 5 seconds. To recover from this lockup I have to detach/attach (usbipd wsl detach --busid 1-4 / usbipd wsl attach --busid 1-4). And than, after a restart of my program it works again. There is no need to powercycle the device with USB (serial) port. Dates seen a lockup so far (so often far apart):

2023-06-22 15:46
2023-06-26 23:48
2023-08-09 19:15
2023-09-17 15:34
2023-10-10 01:44

I looked at https://github.com/dorssel/usbipd-win/wiki/Troubleshooting and this time I started with trace level logging. I see that the logging is without timestamps, so I wonder if this logging would be helpful for my issue? Any additional advice how I could debug this issue?

Version details: Windows:

systeminfo | findstr /B /C:"OS Name" /B /C:"OS Version"
OS Name:                   Microsoft Windows 11 Pro
OS Version:                10.0.22000 N/A Build 22000
usbipd --version
3.2.0+38.Branch.master.Sha.7ad7b9904ac0db8d759e3a53a22b1126da01b2be

WSL2:

$ cat /etc/os-release | grep VERSION=
VERSION="22.04.3 LTS (Jammy Jellyfish)"
$ uname -r
5.15.90.1-microsoft-standard-WSL2
$ apt list --installed hwdata linux-tools*
Listing... Done
hwdata/jammy,now 0.357-1 all [installed]
linux-tools-5.15.0-60-generic/now 5.15.0-60.66 amd64 [installed,local]
linux-tools-5.15.0-60/now 5.15.0-60.66 amd64 [installed,local]
linux-tools-5.15.0-86-generic/jammy-updates,jammy-security,now 5.15.0-86.96 amd64 [installed,automatic]
linux-tools-5.15.0-86/jammy-updates,jammy-security,now 5.15.0-86.96 amd64 [installed,automatic]
linux-tools-common/jammy-updates,jammy-security,now 5.15.0-86.96 all [installed,automatic]
linux-tools-virtual/jammy-updates,jammy-security,now 5.15.0.86.83 amd64 [installed]
$ update-alternatives --display usbip
usbip - auto mode
  link best version is /usr/lib/linux-tools/5.15.0-60-generic/usbip
  link currently points to /usr/lib/linux-tools/5.15.0-60-generic/usbip
  link usbip is /usr/local/bin/usbip
/usr/lib/linux-tools/5.15.0-60-generic/usbip - priority 20

USB device/driver: A particulate matter meter (fijnstof meter). https://github.com/msillano/tuyaDAEMON/wiki/custom-device-'PM-detector':-case-study

$ lsusb -s 001:005
Bus 001 Device 005: ID 1a86:7523 QinHeng Electronics CH340 serial converter

rhermsen avatar Oct 14 '23 01:10 rhermsen

Any additional advice how I could debug this issue?

https://github.com/dorssel/usbipd-win/wiki/Troubleshooting also shows how you can create captures (PCAPs). That's the most detailed logging available. You can also capture on the Linux side (usbmon), but it's a bit harder to set up. syslog messages are also helpful.

dorssel avatar Oct 14 '23 11:10 dorssel

4.0.0 has been released which fixes a (very small) memory leak. This could explain your issue. Can you confirm?

dorssel avatar Dec 06 '23 12:12 dorssel

Thanks for the update. I installed version 4.0.0 on 7-Dec. At that time 3.2.0 was running for 6 wks.

rhermsen avatar Dec 12 '23 22:12 rhermsen

Any news on this?

dorssel avatar Feb 04 '24 00:02 dorssel

Hi, Uptime is now: 'active (running) since Wed 2024-01-03 13:45:54 CET; 1 month 1 day ago' (I did had to restart a few times after installing 4.0.0 for other reasons ...)

rhermsen avatar Feb 04 '24 01:02 rhermsen

Are you monitoring the memory usage of the usbipd service?

dorssel avatar Feb 04 '24 12:02 dorssel

Hi, No I was not..., only overall memory usage (which goes up and down due to WSL2). Just added a graph to MRTG for usbipd.exe

rhermsen avatar Feb 04 '24 21:02 rhermsen

Update on a likely unrelated occurrence.

On 2024-02-29 at 20:28 the PM-Monitor stopped working. This was after 1 month and 26 days uptime (started 2024-01-03 13:45:54). This time with a different failure-mode than for which I opened the ticket. I cannot exclude that this got raised due to a connection glitch (e.g. connector accidentally touched). The Win11 event log also show some logs at the time of the issue.

Memory usage looks to increase from 10MB to about 11MB at the moment the issue occurred (2024-02-29 20:26:48 CET). Because the CH340-based serial-USB device was detached from usbipd.exe the setup stopped (no auto restart). After attaching the CH340-based serial-USB device and restarting PM-Monitor process (2024-03-01 19:19:55 CET) I see another memory usage increase from 11MB to 12MB. I have a graph of this ~20% memory usage increase. usbipd.exe wasn't restarted, so maybe it is normal to have a bit more memory in use?

Because I didn't have to restart usbipd.exe for this occurrence I assume it to be a different issue than for which I opened the ticket.

PM-Monitor restart scripts

startUSBipD.bat

@echo off
echo "Starting usbipd..."

:search
tasklist | find "wsl.exe"
IF %ERRORLEVEL% == 0 GOTO :found
TIMEOUT /T 1
GOTO :search

:found
TIMEOUT /T 3
python C:\Users\rhermsen\usbipd_start.py
TIMEOUT /T 30
wsl.exe -u root service pm_monitor restart
echo "Done"

usbipd_start.py

# -*- coding: utf-8 -*-
"""
# execute usbipd at startup
#
# rhermsen
# version 2.0
# 
Created on Fri Jun 16 21:20:07 2023


Available commands upto 3.2.0:
    usbipd wsl list
    usbipd wsl
        gives usage info
    usbipd wsl list
    usbipd wsl attach -h
    usbipd wsl detach -h
Available commands from 4.0.0:
    usbipd list
    usbipd
        gives usage info
    usbipd list
    usbipd attach -h
    usbipd detach -h


@author: rhermsen

"""

import subprocess

def executeCliCmd(command):
    '''
    Parameters
    ----------
    command : string, The full command line, including the parameters to execute.
    
        DESCRIPTION.
        Executes a command with optional parameters. The resulting output is returned in a list of strings.

    Returns
    -------
    List of strings where each line is an entry in this list.
    '''
    p = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
    out, err = p.communicate()
    outString = str(out)
    # result = outString.split('\\r\\n')
    # print(result)
    return outString.split('\\r\\n')



def findBusidAndState(list, match='CH340'):
    attached = False
    busid = None
    for line in list:
        wordList = line.split()
        if match in wordList:
            busid = wordList[0]
            if "Attached" in wordList:
                attached = True
            # print("busid =", busid, "attached", attached)
    return busid, attached

list_command = 'usbipd list'
usb_list = executeCliCmd(list_command)

busid, attached = findBusidAndState(usb_list)
attach_command = f'usbipd attach --wsl --busid {busid}'
#print("attach_command:",attach_command)
if busid != None and attached == False:
    executeCliCmd(attach_command)
elif attached == True:
    print("PM-Monitor USB already attached to WSL2")
else:
    print("Could not attach PM-Monitor.")

MRTG USBIPd memory usage

Logging

WSL

Error messages in journalctl:

Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: vhci_hcd: connection reset by peer
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: vhci_hcd: stop threads
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: vhci_hcd: release socket
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: vhci_hcd: disconnect device
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: usb 1-1: USB disconnect, device number 2
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: usb 1-1: failed to send control message: -19
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: ch341-uart ttyUSB0: ch341-uart converter now disconnected from ttyUSB0
Feb 29 20:26:48 DESKTOP-E6DGEM9 kernel: ch341 1-1:1.0: device disconnected

similar in dmesg:

[Thu Feb 29 20:27:01 CET 2024] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[Thu Feb 29 20:27:01 CET 2024] vhci_hcd: connection reset by peer
[Thu Feb 29 20:27:01 CET 2024] vhci_hcd: stop threads
[Thu Feb 29 20:27:01 CET 2024] vhci_hcd: release socket
[Thu Feb 29 20:27:01 CET 2024] vhci_hcd: disconnect device
[Thu Feb 29 20:27:01 CET 2024] usb 1-1: USB disconnect, device number 2
[Thu Feb 29 20:27:01 CET 2024] usb 1-1: failed to send control message: -19
[Thu Feb 29 20:27:01 CET 2024] ch341-uart ttyUSB0: ch341-uart converter now disconnected from ttyUSB0
[Thu Feb 29 20:27:01 CET 2024] ch341 1-1:1.0: device disconnected

Win11 event logs

(sometimes with a few events before the issue, and more after the issue): USBIPd.exe event 2024-02-29 2026.evtx: USBIPd.exe event 2024-02-29 2026.evtx.zip Win Application logs USBIPd.exe event 2024-02-29 2026.evtx: Win Application logs USBIPd.exe event 2024-02-29 2026.evtx.zip Win System logs USBIPd.exe event 2024-02-29 2026.evtx: Win System logs USBIPd.exe event 2024-02-29 2026.evtx.zip

rhermsen avatar Mar 02 '24 23:03 rhermsen

Thanks for the elaborate report.

Memory usage seems correct. usbipd-win is written in C#, a managed language using a garbage collector. It's normal for a new connection to have a temporary increase in memory consumption. It will take a while before the garbage collector will kick in.

The memory bug that I fixed earlier leaked 40 bytes per URB + a kernel event handle. I could see it in memory consumption clearly after about 100.000 URBs. I guess in your case it was the kernel handles that caused the original crash. I went up to 1 million URBs, but couldn't actually reproduce the crash. Can you guestimate your URB-rate? If that is indeed > 1 million per month it could explain the original crash.

With the current behavior, I think we can safely say the original problem is fixed.

dorssel avatar Mar 02 '24 23:03 dorssel

Is it only in one direct this issue was present?

My stab at guestimate the URB-rate: I have a tcpdump capture and assuming each URB ends up in a separate TCP segment I count

  • Win > WSL 10 segments from which 1 TCP segment without data (e.g. TCP Ack only) = 9
  • WSL > Win 12 segments from which 3 TCP segment without data (e.g. TCP Ack only) = 9
  • Same amount of data is requested every 5 seconds
  • ( 9 / 5 ) * ( 60 * 60 * 24 * 365) /12 = 4730400 URB/month in each direction

rhermsen avatar Mar 03 '24 11:03 rhermsen

Is it only in one direct this issue was present?

Usually, for each URB there is one USBIP request and one USBIP reply. It also makes sense that a single "poll" from your software consists of several URBs. Of course, when either the request or the reply doesn't fit in the MTU there will be several TCP fragments. But as a "guestimate", the 4 million makes sense.

dorssel avatar Mar 03 '24 12:03 dorssel

To comment on your last update: Indeed I see a single poll to be split in several URB's each in its own TCP segment. There is no fragmentation (largest packet is 146 MB only).

rhermsen avatar Mar 07 '24 21:03 rhermsen

Unfortunately there was another similar occurrence as on 2024-02-29.

On 2024-03-07 11:04 the PM-Monitor stopped working in a similar way as on 2024-02-29 at 20:28. This was after only 5 days uptime (started 2024-03-01 19:19:55). This time nobody was near the device. The Win11 event log doesn't show much this time...

Category: Usbipd.ConnectedClient
EventId: 2

Client 192.168.188.87 released device at 1-4 (USB\VID_1A86&PID_7523\5&9BB4271&0&4).

Memory usage was the same as after last week issue, 12MB until I restarted USBIPd.exe. But after I restarted with debug and packet capture (pcap) enabled it jumped to 22MB. (I assume a bit this is due debug and/or a packet capture running).

MRTG USBIPd memory usage 1

I hope the capture isn't growing too fast, but from the 'guestimate' it probably will...

rhermsen avatar Mar 07 '24 22:03 rhermsen

I observed another occurrence (2024-05-14, 23:11) after 1 month and 6 days. This time with capturing in place. I was still using usbipd 4.0.0 (capturing timestamps incorrect). This was just before my holiday, so I only had little time to investigate. Now I'm back from holiday.

At the moment the issue starts I see an 'address' change at the server side (windows) from 1.4.2 to 1.4.1 The requests coming from WSL2 (Ubuntu) are still going to 1.4.2. There is a response packet from 1.4.2 followed by 1.4.1 from the server side. All these packets are marked as 'URB transfer type' 'URB_INTERRUPT' in the IN direction. This repeats at every request coming from WSL2, every 5 seconds.

In the mean time I upgraded to USBIPd 4.2.0.

The complete capture is 1.19GB. I extracted the last 2 correct sequences, and 2 incorrect sequences (named: pm-monitor_20240514__2311__last_successful_and_failed2.pcapng.txt ):

packet 21 to 38 (=18), expected request response
packet 39 and 40 look a bit unexpected
packet 41 to 70 (=30, so much more packets) the request looks good, but response not.
packet 71 to 114 (=44 even more packets) the request looks good, but response not.
and this repeats every 5 seconds

Do you see what goes wrong here? pm-monitor_20240514__2311__last_successful_and_failed2.pcapng.txt

rhermsen avatar Jun 20 '24 16:06 rhermsen

1.4.2 is a bulk endpoint. In Linux terminology, it acts like a block device. You request a read, and you get a reply, always. 1.4.1 is an interrupt endpoint. In Linux terminology, it acts like a character device. You request a read, and it blocks until there is actually something to read., which may be immediately or maybe never at all.

Normally, your device apparently uses 1.4.2. And every IN request completes successfully. Then, in your log line 39. It does not respond. This is the error. A bulk pipe should always respond immediately to a read request. But this time it doesn't. It ends with ETIME, which comes from VBoxUSB where the error code is called USBSUP_XFER_DNR, which it in turn got from Windows where the error code is called USBD_STATUS_DEV_NOT_RESPONDING.

Then your client side driver again submits an IN request for 1.4.2, but it never completes. Besides that, it starts sending requests to 1.4.1.

The root cause is the device not responding (ETIME) in the first place.

dorssel avatar Jun 22 '24 18:06 dorssel

If I get it correct it the USB device (the PM-monitor) connected to the Windows machine that is 'misbehaving' by having a timeout. (this is seen from packet 39 having status ETIME (=Timer expired)). It is at the hardware side the timeout occurred, so I'm not able to solve it for future similar events.

For all previous events I never had to power-cycle the PM-Monitor to restore communication (required steps: usbipd detach/attach and restart of my python script). This time I'm not sure as I powered off all involved equipment. Is there a retry mechanism possible, or a workaround to initiate a restart e.g. from WSL2?

rhermsen avatar Jun 29 '24 21:06 rhermsen

If I get it correct it the USB device (the PM-monitor) connected to the Windows machine that is 'misbehaving' by having a timeout.

Looks like it. It is in any case the error that the VBoxUSB driver got from Windows. Normally, when devices get "stuck" it is called a STALL. And the error for that is EPIPE. Client drivers usually handle that by sending a "clear halt" (pipe reset) via channel 0 (the control channel). Maybe (just maybe) that may work for ETIME as well. And I have to admit, the translation for this error to ETIME is mine, based on https://www.kernel.org/doc/html/latest/driver-api/usb/error-codes.html. I/You/We could try to return EPIPE instead, hoping that the driver responds to that with a pipe reset, which just may work (or not)...

Translation happens here: https://github.com/dorssel/usbipd-win/blob/master/Usbipd/Tools.cs#L109

dorssel avatar Jun 30 '24 11:06 dorssel

@dorssel If it is possible to have an build with EPIPE returned i.s.o. ETIME that would be great. (I see you did that sometime in the past https://github.com/dorssel/usbipd-win/actions/runs/7308106135?pr=812)

Btw, are there any instructions available to built usbipd from source?

rhermsen avatar Jul 03 '24 19:07 rhermsen

Btw, are there any instructions available to built usbipd from source?

Simply run the commands from the GitHub build script: https://github.com/dorssel/usbipd-win/blob/master/.github/workflows/build-installer.yml

dorssel avatar Jul 03 '24 21:07 dorssel

First time I look at GitHub Actions...

I did manage to build an msi-installer, only it is for version 0.2.0 (usbipd-win_0.2.0.msi). Steps done:

  • Forked your usbipd-win repo: https://github.com/rhermsen/usbipd-win
  • enabled actions for all workflows
  • changed and committed line https://github.com/rhermsen/usbipd-win/blob/master/Usbipd/Tools.cs#L115
  • Found I also need to change and commit line https://github.com/rhermsen/usbipd-win/blob/master/UnitTests/Tools_Tests.cs#L193
  • After a few minutes this resulted in a successful installer: https://github.com/rhermsen/usbipd-win/actions/runs/9800638801
  • Uninstalled the current 4.2.0 version, and installed the new build version 0.2.0

But after this I get an exception regarding VBoxUsb driver (see debug log below). But now also after installing 4.2.0 I get the same exception message.

usbipd --version
0.2.0+1052.Branch.master.Sha.716112dce471b303d6de1ddd8ba7273823bd816e
C:\WINDOWS\system32>usbipd server Logging:LogLevel:Default=Trace "usbipd:PcapNg:Path=C:\Users\rhermsen\pm-monitor_20240706__0155.pcapng"
dbug: Microsoft.Extensions.Hosting.Internal.Host[1]
      Hosting starting
dbug: Usbipd.PcapNg[1000]
      usbipd:PcapNg:Path = 'C:\Users\rhermsen\pm-monitor_20240706__0155.pcapng'
dbug: Usbipd.PcapNg[1000]
      usbipd:PcapNg:SnapLength = unlimited
dbug: Usbipd.Server[1000]
      usbipd:Port = 3240
dbug: Usbipd.Server[1000]
      0.2.0+1052.Branch.master.Sha.716112dce471b303d6de1ddd8ba7273823bd816e
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Production
info: Microsoft.Hosting.Lifetime[0]
      Content root path: C:\Program Files\usbipd-win\
dbug: Microsoft.Extensions.Hosting.Internal.Host[2]
      Hosting started
dbug: Usbipd.PcapNg[1000]
      Flushing
dbug: Usbipd.Server[1000]
      new connection from 192.168.188.87
dbug: Usbipd.Server[1000]
      connection close: Unable to read beyond the end of the stream.
dbug: Usbipd.Server[1000]
      connection closed
dbug: Usbipd.Server[1000]
      new connection from 192.168.188.87
dbug: Usbipd.ConnectedClient[1000]
      Received opcode: OP_REQ_IMPORT
fail: Usbipd.ConnectedClient[3]
      An exception occurred while communicating with the client:
      System.IO.FileNotFoundException: Unable to find the specified file.
         at Usbipd.ConfigurationManager.GetVBoxDevice(BusId) + 0x128
         at Usbipd.VBoxUsb.<ClaimDeviceOnce>d__0.MoveNext() + 0x6d
      --- End of stack trace from previous location ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() + 0x20
         at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task) + 0xb2
         at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task, ConfigureAwaitOptions) + 0x4b
         at Usbipd.VBoxUsb.<ClaimDevice>d__1.MoveNext() + 0x145
      --- End of stack trace from previous location ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() + 0x20
         at Usbipd.VBoxUsb.<ClaimDevice>d__1.MoveNext() + 0x202
      --- End of stack trace from previous location ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() + 0x20
         at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task) + 0xb2
         at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task, ConfigureAwaitOptions) + 0x4b
         at Usbipd.ConnectedClient.<HandleRequestImportAsync>d__8.MoveNext() + 0xd99
      --- End of stack trace from previous location ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() + 0x20
         at Usbipd.ConnectedClient.<HandleRequestImportAsync>d__8.MoveNext() + 0x1823
      --- End of stack trace from previous location ---
         at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() + 0x20
         at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task) + 0xb2
         at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task, ConfigureAwaitOptions) + 0x4b
         at Usbipd.ConnectedClient.<RunAsync>d__5.MoveNext() + 0x300
dbug: Usbipd.Server[1000]
      connection close: Unable to find the specified file.
dbug: Usbipd.Server[1000]
      connection closed

rhermsen avatar Jul 06 '24 00:07 rhermsen

The version number is inconsequential. It is automatically determined from the tags (using GitVersion). Your clone does not have the release tags, so it "restarts" from 0.x. That's not a problem.

Drivers have been bumped, and for development builds that causes issues. Simply: uninstall whatever usbipd-win you have installed now. And then install another version (your dev build, for example). And if that fails (like you reported), Do a repair install. That should fix any driver versions.

EDIT: And as always with Windows, when things go wrong with driver installs: a reboot in between steps helps ...

dorssel avatar Jul 06 '24 08:07 dorssel

I uninstalled, repaired, downgraded/upgraded, rebooted usbipd-win multiple times, but only got it resolved after uninstalling a VirturalBox USB device via 'Device Manager': VirtualBox USB No drivers are installed

Now it is running with version 0.2.0

rhermsen avatar Jul 06 '24 11:07 rhermsen

Oh well, dev versions .... Good luck with this one.

dorssel avatar Jul 06 '24 16:07 dorssel

I guess you had it forced bound?

dorssel avatar Jul 06 '24 16:07 dorssel

No, didn't use a 'force' option. Only used the 'usbipd bind --busid 1-4' and 'usbipd attach --wsl --busid 1-4'.

Started from a working usbipd 4.2.0 > tried install usbipd 0.2.0 but this failed due to lower version as the one already installed > uninstall 4.2.0 > (didn't reboot...) installed 0.2.0 (don't think I rebooted) and at this point ran into that issue until I found (after a lot of steps) the 'VituralBox USB' device in de 'Device Manager'.

rhermsen avatar Jul 06 '24 20:07 rhermsen