nut icon indicating copy to clipboard operation
nut copied to clipboard

Poll UPS failed - Data stale

Open HyP3r- opened this issue 4 years ago • 6 comments

Hello I have a Bluewalker UPS Powerwalker VI 2200 LCD UPS. Unfortunately, always after some time comes the following error messages:

usbhid-ups[2387]: libusb_get_interrupt: could not claim interface 0: Device or resource busy usbhid-ups[2387]: libusb_get_report: error sending control message: No such device usbhid-ups[2387]: libusb_get_interrupt: could not claim interface 0: Device or resource busy usbhid-ups[2387]: libusb_get_report: error sending control message: No such device usbhid-ups[2387]: libusb_get_interrupt: error submitting URB: No such device usbhid-ups[2387]: libusb_get_interrupt: could not claim interface 0: Device or resource busy usbhid-ups[2387]: libusb_get_interrupt: error submitting URB: No such device usbhid-ups[2387]: libusb_get_interrupt: could not claim interface 0: Device or resource busy usbhid-ups[2387]: libusb_get_report: error sending control message: No such device usbhid-ups[2387]: libusb_get_interrupt: error submitting URB: No such device systemd[1]: /lib/systemd/system/nut-monitor.service:6: PIDFile= references path below legacy directory /var/run/, updating /var/run/nut/upsmon.pid → /run/nut/upsmon.pid; pleas systemd[1]: nut-monitor.service: Supervising process 2399 which is not our child. We'll most likely not notice when it exits. systemd[1]: /lib/systemd/system/nut-monitor.service:6: PIDFile= references path below legacy directory /var/run/, updating /var/run/nut/upsmon.pid → /run/nut/upsmon.pid; pleas systemd[1]: nut-monitor.service: Supervising process 2399 which is not our child. We'll most likely not notice when it exits. usbhid-ups[2387]: libusb_get_interrupt: error sending control message: Value too large for defined data type usbhid-ups[2387]: libusb_get_report: error sending control message: No such device usbhid-ups[2387]: libusb_get_interrupt: error submitting URB: No such device usbhid-ups[2387]: libusb_get_interrupt: error submitting URB: No such device usbhid-ups[2387]: libusb_get_interrupt: error submitting URB: No such device upsd[2391]: Data for UPS [myups] is stale - check driver upsmon[2399]: Poll UPS [myups@localhost] failed - Data stale upsmon[2399]: Communications with UPS myups@localhost lost upsmon[2399]: Poll UPS [myups@localhost] failed - Data stale

In this case the server has been running for 55 days. So after 55 days of operation now these error messages come. What could be the reason for this?

lsusb:

Bus 001 Device 050: ID 06da:ffff Phoenixtec Power Co., Ltd.

cat ups.conf:

[myups]
    driver = usbhid-ups
    port = auto
    vendorid = 06da
    desc = "myups"
    pollinterval = 15

upsd -v:

Network UPS Tools upsd 2.7.4

HyP3r- avatar Apr 21 '21 19:04 HyP3r-

For the most part the messages indicate that the OS lost the USB device, not much that NUT can do to fix that part. Might be some power-saving disabling the chip, or something else frozen on the hub and a USB reset restarting every connection (either intentionally, or because some vendors cut corners and implement a port reset same as hub reset).

Guessing by your device name, containing "vm", I might add potentially some delays or flaws in virtualization/passthrough layer as a possible problem source.

On a busy machine (hypervisor with many guests, or other compute/IO/context-switch heavy systems) or on an under-powered machine (preempted VM just not on CPU) there might also be some exceeded hardcoded timeouts in any layer of the OS, drivers and software down to libusb used by NUT driver, perceiving a loss of device because last data exchange was a bit too long ago.

In that log I see the lost device reported all over April, but the driver only aborted once? I believe that means retries succeeded to find (maybe reinitialize) the USB device except once.

For context: which OS and version of NUT (packaged or custom built) is that happening with? If you do/can run a custom build, are you in position to check if native libusb-1.0 support from some of the competing branches in issue #300 would behave better?

jimklimov avatar Apr 21 '21 20:04 jimklimov

For the most part the messages indicate that the OS lost the USB device, not much that NUT can do to fix that part. Might be some power-saving disabling the chip, or something else frozen on the hub and a USB reset restarting every connection (either intentionally, or because some vendors cut corners and implement a port reset same as hub reset).

This is a Debian 10 which runs on a HP ProLiant DL360. I mention this because I can't imagine that such a server has energy saving mechanisms for the USB system. How can I find out if something like this happened at this moment?

Guessing by your device name, containing "vm", I might add potentially some delays or flaws in virtualization/passthrough layer as a possible problem source.

It is not a virtual machine but the hypervisor. As platform I use Debian 10 with Proxmox VE.

On a busy machine (hypervisor with many guests, or other compute/IO/context-switch heavy systems) or on an under-powered machine (preempted VM just not on CPU) there might also be some exceeded hardcoded timeouts in any layer of the OS, drivers and software down to libusb used by NUT driver, perceiving a loss of device because last data exchange was a bit too long ago.

As already described, the system actually runs a hypervisor. And the system is also relatively busy. But somehow I can't imagine that this is enough to hang up so much.

In that log I see the lost device reported all over April, but the driver only aborted once? I believe that means retries succeeded to find (maybe reinitialize) the USB device except once.

It seems so yes: the last few months the connection is repeatedly interrupted, only unfortunately today it did not work to restore the connection.

For context: which OS and version of NUT (packaged or custom built) is that happening with? If you do/can run a custom build, are you in position to check if native libusb-1.0 support from some of the competing branches in issue #300 would behave better?

I just tried a few times to compile the branches "libusb-1.0+0.1", "libusb-1.0" and "libusb-compat-1.0" but with no full success: The branches "libusb-1.0+0.1" and "libusb-1.0" I could compile but were unable to use them. I always get the following error message:

upsdrvctl[44892]: Failed to read pid from /var/state/ups/usbhid-ups1-myups.pid
upsdrvctl[44892]: Network UPS Tools - Generic HID driver 0.42 (2.7.4-420-g2999c95f)
upsdrvctl[44892]: USB communication driver (libusb 1.0) 0.02
upsdrvctl[44892]: No matching HID UPS found
upsdrvctl[44892]: Driver failed to start (exit status=1)
upsmon[18049]: Poll UPS [myups@localhost] failed - Driver not connected

HyP3r- avatar Apr 21 '21 22:04 HyP3r-

That failure to start probably means that it probed available USB devices and did not find a supported vendor/product ID. Might be that as of those branches the IDs for your UPS were not known yet, which is unlikely at least if you were running last release before that, or that the device remains not seen on the OS side.

Try starting the driver with -DDDDD argument for debug with higher verbosity; it would mention IDs it saw in particular.

On Thu, Apr 22, 2021, 00:20 HyP3r @.***> wrote:

For the most part the messages indicate that the OS lost the USB device, not much that NUT can do to fix that part. Might be some power-saving disabling the chip, or something else frozen on the hub and a USB reset restarting every connection (either intentionally, or because some vendors cut corners and implement a port reset same as hub reset). This is a Debian 10 which runs on a HP ProLiant DL360. I mention this because I can't imagine that such a server has energy saving mechanisms for the USB system. How can I find out if something like this happened at this moment?

Guessing by your device name, containing "vm", I might add potentially some delays or flaws in virtualization/passthrough layer as a possible problem source. It is not a virtual machine but the hypervisor. As platform I use Debian 10 with Proxmox VE.

On a busy machine (hypervisor with many guests, or other compute/IO/context-switch heavy systems) or on an under-powered machine (preempted VM just not on CPU) there might also be some exceeded hardcoded timeouts in any layer of the OS, drivers and software down to libusb used by NUT driver, perceiving a loss of device because last data exchange was a bit too long ago. As already described, the system actually runs a hypervisor. And the system is also relatively busy. But somehow I can't imagine that this is enough to hang up so much.

In that log I see the lost device reported all over April, but the driver only aborted once? I believe that means retries succeeded to find (maybe reinitialize) the USB device except once. It seems so yes: the last few months the connection is repeatedly interrupted, only unfortunately today it did not work to restore the connection.

For context: which OS and version of NUT (packaged or custom built) is that happening with? If you do/can run a custom build, are you in position to check if native libusb-1.0 support from some of the competing branches in issue #300 https://github.com/networkupstools/nut/issues/300 would behave better? I just tried a few times to compile the branches "libusb-1.0+0.1", "libusb-1.0" and "libusb-compat-1.0" but with no full success: The branches "libusb-1.0+0.1" and "libusb-1.0" I could compile but were unable to use them. I always get the following error message:

upsdrvctl[44892]: Failed to read pid from /var/state/ups/usbhid-ups1-myups.pid upsdrvctl[44892]: Network UPS Tools - Generic HID driver 0.42 (2.7.4-420-g2999c95f) upsdrvctl[44892]: USB communication driver (libusb 1.0) 0.02 upsdrvctl[44892]: No matching HID UPS found upsdrvctl[44892]: Driver failed to start (exit status=1) upsmon[18049]: Poll UPS @.*** failed - Driver not connected

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/networkupstools/nut/issues/1022#issuecomment-824396176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMPTFBGJEM4EYPAMYYYDFTTJ5FT3ANCNFSM43LALQWA .

jimklimov avatar Apr 21 '21 22:04 jimklimov

After I called configure Debian conform the library could be compiled and used. I will now observe the whole thing for a quarter. If it doesn't hang anymore, the ticket can be closed.

I will inform you in this issue.

HyP3r- avatar Apr 23 '21 18:04 HyP3r-

thanks!

On Fri, Apr 23, 2021, 20:00 HyP3r @.***> wrote:

After I called configure Debian conform the library could be compiled and used. I will now observe the whole thing for a quarter. If it doesn't hang anymore, the ticket can be closed.

I will inform you in this issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/networkupstools/nut/issues/1022#issuecomment-825825532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMPTFHCF5A2V3WEJGDDEHTTKGYTVANCNFSM43LALQWA .

jimklimov avatar Apr 23 '21 20:04 jimklimov

OK, I hit this issue, a reboot "fixed" it but I gathered some info before the reboot , might help somebody, esp if the real fix is to raise a kernel defect. (BTW this happened about 10 days after a real power cut...that would have been annoying if it had failed):

The messages:

Feb 25 14:06:57 ybox usbhid-ups[727]: libusb_get_interrupt: could not claim interface 0: Device or resource busy                                                                                                              
Feb 25 14:07:07 ybox usbhid-ups[727]: libusb_get_interrupt: error sending control message: Protocol error                                                                                                                     
Feb 25 14:07:17 ybox usbhid-ups[727]: libusb_get_interrupt: error sending control message: Protocol error                                                                                                                     
... 33K of similar messages  

Stop and restart nut-server (debian bullseye)


root@ybox:/var/log# systemctl start  nut-server                                                                                                                                                                               
root@ybox:/var/log# systemctl status nut-server                                                                                                                                                                               
● nut-server.service - Network UPS Tools - power devices information server                                                                                                                                                   
     Loaded: loaded (/lib/systemd/system/nut-server.service; enabled; vendor preset: enabled)                                                                                                                                 
     Active: active (running) since Thu 2024-02-29 10:50:45 GMT; 38s ago                                                                                                                                                      
    Process: 1066411 ExecStart=/sbin/upsd (code=exited, status=0/SUCCESS)                                                                                                                                                     
   Main PID: 1066412 (upsd)                                                                                                                                                                                                   
      Tasks: 1 (limit: 4476)                                                                                                                                                                                                  
     Memory: 772.0K                                                                                                                                                                                                           
        CPU: 33ms                                                                                                                                                                                                             
     CGroup: /system.slice/nut-server.service                                                                                                                                                                                 
             └─1066412 /lib/nut/upsd                                      

Feb 29 10:50:45 ybox upsd[1066411]: listening on 127.0.0.1 port 3493                                                                                                                                                          
Feb 29 10:50:45 ybox upsd[1066411]: listening on 10.117.0.152 port 3493                                                                                                                                                       
Feb 29 10:50:45 ybox upsd[1066411]: listening on 127.0.0.1 port 3493                                                                                                                                                          
Feb 29 10:50:45 ybox upsd[1066411]: Can't connect to UPS [qnapups] (usbhid-ups-qnapups): No such file or directory                                                                                                            
Feb 29 10:50:45 ybox upsd[1066411]: Can't connect to UPS [qnapups] (usbhid-ups-qnapups): No such file or directory                                                                                                            
Feb 29 10:50:45 ybox upsd[1066411]: /etc/nut/upsd.users is world readable                                                                                                                                                     
Feb 29 10:50:45 ybox upsd[1066411]: /etc/nut/upsd.users is world readable                                                                                                                                                     
Feb 29 10:50:45 ybox upsd[1066412]: Startup successful                                                                                                                                                                        
Feb 29 10:50:45 ybox systemd[1]: Started Network UPS Tools - power devices information server.                                                                                                                                
Feb 29 10:50:50 ybox upsd[1066412]: User [email protected] logged into UPS [qnapups]  

The driver failed to start:

root@ybox:/var/log# /lib/nut/usbhid-ups -DDD -a qnapups -u nut   
 ....
0.032911        Checking device (051D/0002) (001/002)                                                                                                                                                                         
   0.034470     - VendorID: 051d                                                                                                                                                                                              
   0.034502     - ProductID: 0002                                                                                                                                                                                             
   0.034516     - Manufacturer: unknown                                                                                                                                                                                       
   0.034529     - Product: unknown                                                                                                                                                                                            
   0.034542     - Serial Number: unknown                                                                                                                                                                                      
   0.034554     - Bus: 001                                                                                                                                                                                                    
   0.034566     - Device release number: 0106                                                                                                                                                                                 
   0.034586     Trying to match device                                                                                                                                                                                        
   0.034605     device->Product is NULL so it is not possible to determine whether to activate max_report_size workaround                                                                                                     
   0.034641     Device matches                                                                                                                                                                                                
   0.034703     nut_usb_set_altinterface: skipped usb_set_altinterface(udev, 0)                                                                                                                                               
   0.035360     Unable to get HID descriptor (error sending control message: Protocol error)                                                                                                                                  
   0.035391     HID descriptor length (method 1) -1                                                                                                                                                                           
   0.035419     HID descriptor, method 2: (9 bytes) => 09 21 10 01 21 01 22 05 04                                                                                                                                             
   0.035440     HID descriptor length (method 2) 1029                                                                                                                                                                         
   0.035502     HID descriptor length 1029                                                                                                                                                                                    
   0.039396     Unable to get Report descriptor: Protocol error                                                                                                                                                               
...

So it thinks Product is NULL ...

root@ybox:/var/log# lsusb -d 051d:                                                                                                                                                                                            
Bus 001 Device 002: ID 051d:0002 American Power Conversion Uninterruptible Power Supply                                                                                                                                       
        
Bus 001 Device 002: ID 051d:0002 American Power Conversion Uninterruptible Power Supply                                                                                                                                       
Device Descriptor:                                                                                                                                                                                                            
  bLength                18                                                                                                                                                                                                   
  bDescriptorType         1                                                                                                                                                                                                   
  bcdUSB               1.10                                                                                                                                                                                                   
  bDeviceClass            0                                                                                                                                                                                                   
  bDeviceSubClass         0                                                                                                                                                                                                   
  bDeviceProtocol         0                                                                                                                                                                                                   
  bMaxPacketSize0         8                                                                                                                                                                                                   
  idVendor           0x051d American Power Conversion                                                                                                                                                                         
  idProduct          0x0002 Uninterruptible Power Supply                                                                                                                                                                      
  bcdDevice            1.06                                                                                                                                                                                                   
  iManufacturer           3 American Power Conversion                                                                                                                                                                         
  iProduct                1 Back-UPS XS 700U   FW:924.Z5 .I USB FW:Z5                                                                                                                                                         
  iSerial                 2 3B2041X34361                                                                                                                                                                                      
  bNumConfigurations      1                                                                                                                                                                                                   
  Configuration Descriptor:                                                                                                                                                                                                   
    bLength                 9                                                                                                                                                                                                 
    bDescriptorType         2                                                                                                                                                                                                 
    wTotalLength       0x0022                                                                                                                                                                                                 
    bNumInterfaces          1                                                                                                                                                                                                 
    bConfigurationValue     1                                                                                                                                                                                                 
    iConfiguration          0                                                                                                                                                                                                 
    bmAttributes         0xe0                                                                                                                                                                                                 
      Self Powered                                                                                                                                                                                                            
      Remote Wakeup                                                                                                                                                                                                           
    MaxPower               24mA                                                                                                                                                                                               
    Interface Descriptor:                                                                                                                                                                                                     
      bLength                 9                                                                                                                                                                                               
      bDescriptorType         4                                                                                                                                                                                               
      bInterfaceNumber        0                                                                                                                                                                                               
      bAlternateSetting       0                                                                                                                                                                                               
      bNumEndpoints           1                                                                                                                                                                                               
      bInterfaceClass         3 Human Interface Device                                                                                                                                                                        
      bInterfaceSubClass      0                                                                                                                                                                                               
      bInterfaceProtocol      0                                                                                                                                                                                               
      iInterface              0                                                                                                                                                                                               
        HID Device Descriptor:                                                                                                                                                                                                
          bLength                 9                                                                                                                                                                                           
          bDescriptorType        33                                                                                                                                                                                           
          bcdHID               1.10                                                                                                                                                                                           
          bCountryCode           33 US                                                                                                                                                                                        
          bNumDescriptors         1                                                                                                                                                                                           
          bDescriptorType        34 Report                                                                                                                                                                                    
          wDescriptorLength    1029                                                                                                                                                                                           
          Report Descriptor: (length is -1)                                                                                                                                                                                   
      Endpoint Descriptor:                                                                                                                                                                                                    
        bLength                 7                                                                                                                                                                                             
        bDescriptorType         5                                                                                                                                                                                             
        bEndpointAddress     0x81  EP 1 IN                                                                                                                                                                                    
        bmAttributes            3                                                                                                                                                                                             
          Transfer Type            Interrupt                                                                                                                                                                                  
          Synch Type               None                                                                                                                                                                                       
          Usage Type               Data                                                                                                                                                                                       
        wMaxPacketSize     0x0006  1x 6 bytes                                                                                                                                                                                 
        bInterval              10                                                                                                                                                                                             
can't get debug descriptor: Resource temporarily unavailable                                                                                                                                                                  
cannot read device status, Resource temporarily unavailable (11)   

(BTW that last line may be key)

Looking at:

root@ybox:/var/log# cat /sys/bus/usb/devices/1-1/product                                                                                                                                                                      
Back-UPS XS 700U   FW:924.Z5 .I USB FW:Z5                                                                                                                                                                                     
                                    

It appears to be there.

(also did this as user NUT also OK)

Will reboot ..

Following reboot ...all seemed to come up OK (UPS WAS NOT power cycled)

I noted the following diffs in the output of lsusb:

Before Reboot:                                                                                                                                                                                                                
          Report Descriptor: (length is -1)                                                                                                                                                                                   
          ...                                                                                                                                                                                                                 
          can't get debug descriptor: Resource temporarily unavailable                                                                                                                                                        
          cannot read device status, Resource temporarily unavailable (11)                                                                                                                                            
                                                                                                                                                                                                                              
After Reboot:                                                                                                                                                                                                                 
         Report Descriptors:                                                                                                                                                                                                  
           UNAVAILABLE                                                                                                                                                                                                  
                                                                                                                                                                                                                              
           can't get debug descriptor: Resource temporarily unavailable      
                                                                                                                                          
           **Device Status:     0x0000  **                                                                                                                                                                                       
             (Bus Powered)                                                                                                                                                                                  
   


graemev avatar Feb 29 '24 12:02 graemev

Same problem, it works initially and it stops reading the UPS after a while, I tried every possible parameter change, it still happens. Is there a way to tell NUT to restart after 15 minutes of activity as a workaround?

john8329 avatar Jan 27 '25 07:01 john8329

OK, the problem appears fixed for me at least:

Following my post I added a cronjob for root:

08 09 * * * chronic /usr/local/bin/check_ups

Where

$ cat /usr/local/bin/check_ups
#!/bin/bash

# Who   Date    What
# GPV   09Oct24 Inital version , greps lsusb(1) and nut-scanner(1) output


# See: https://unix.stackexchange.com/questions/340718/how-do-i-bring-heredoc-text-into-a-shell-script-variable
eat_here() {
    res=$(cat)
}


# Make sure these have blank lines at start and end


eat_here <<'HEREDOC'

root@zbox:/home/graeme# lsusb
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 005: ID 2148:7022  USB2.0 HUB
Bus 001 Device 004: ID 8087:07dc Intel Corp. Bluetooth wireless interface
Bus 001 Device 003: ID 0bda:0153 Realtek Semiconductor Corp. 3-in-1 (SD/SDHC/SDXC) Card Reader
Bus 001 Device 002: ID 1a40:0101 Terminus Technology Inc. Hub
Bus 001 Device 047: ID 051d:0002 American Power Conversion Uninterruptible Power Supply   <----- here
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

HEREDOC


expected_lsusb="${res}"



eat_here <<'HEREDOC'

graeme@zbox:~$ sudo nut-scanner -U
[sudo] password for graeme: 
Scanning USB bus.
[nutdev1]
        driver = "usbhid-ups"
        port = "auto"
        vendorid = "051D"
        productid = "0002"
        product = "Back-UPS XS 700U   FW:924.Z5 .I USB FW:Z5"
        serial = "3B2041X34361"
        vendor = "American Power Conversion"
        bus = "001"


HEREDOC

expected_nut_scanner="${res}"

actual_lsusb="$(lsusb)"

actual_nut_scanner="$(nut-scanner -U)"

if [[ -z "${actual_lsusb}" ]] ; then
    actual_lsusb="<NOTHING>"
fi

if [[ -z "${actual_nut_scanner}" ]] ; then
    actual_nut_scanner="<NOTHING>"
fi


declare -i OK
declare -i RC

OK=1

if [[ "${actual_lsusb}" =~ "051d:0002" ]] ; then
    :
else
    OK=0
fi


if [[ "${actual_nut_scanner}" =~ "3B2041X34361" ]] ; then
    :
else
    OK=0
fi


if [[ ${OK} -eq 1 ]]; then
    echo "check_ups: all is OK"
    RC=0
else
    msg="Error WRT UPS: lsusb reports\n ${actual_lsusb}\n we expect\n ${expected_lsusb}\n\n and nut-scanner reports \n ${actual_nut_scanner} \n and we expected \n ${expected_nut_scanner}"

    printf "$msg"

    printf "$msg" | mail -s "check_ups: Lost contact with UPS $(hostname)" graeme@home
    RC=1
fi

exit ${RC}

After a few weeks, the email triggered. The device had just disappeared. I was on a nearby computer at the time, nothing unusual had happened, no power blip, no noise from the UPS.

I did some investigation and it seemed ALL the devices on the USB hub where the UPS was attached had gone (IIRC , so had the hub itself). It was an external USB 2.0 hub. Unplugging and re-plugging "fixed" the problem ...but of no long term use.

So I bought a uGreen USB 3.0 USB hub (generally pleased with their kit) and the problem has not reoccurred in several months.

graemev avatar Jan 27 '25 09:01 graemev

@john8329 : which NUT version is it in your case? Also which UPS/driver?

There were some changes about auto-restart of connections with usbhid-ups in 2.8.0 (IIRC) and maybe more in later releases. Various appliances and distros often ship much older versions (including a decade-old NUT v2.7.4) - that we can not do much about.

I would suggest https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests to check if the newer code base has the problem already figured out and fixed. Even if not - it would give us a starting point to iterate on more investigation and possibly a fix.

jimklimov avatar Jan 27 '25 15:01 jimklimov

@jimklimov Using the "NUT" add-on for Home Assistant, v0.14.1, UPS driver controller 2.8.0. I'm reading that from logs. Building the new package in the raspberry it's running on is a little complicated (it communicates over USB, can't forward that to a dev machine), so for now it's not worth the time for me (I already spent at least 8 hours on trying to make it work, I'm done with this). If you need detailed logs or any debug info feel free to ask though.

john8329 avatar Mar 03 '25 17:03 john8329

FWIW, you can check the wiki, there were community-contributed notes on custom-building the HA plugin...

jimklimov avatar Mar 04 '25 00:03 jimklimov