zwave-js-ui icon indicating copy to clipboard operation
zwave-js-ui copied to clipboard

The serial port is not open! on 15.x

Open systems-monitor opened this issue 1 year ago • 47 comments

Checklist

  • [x] I am not using Home Assistant. Or: a developer has told me to come here.
  • [x] I have checked the troubleshooting section and my problem is not described there.
  • [x] I have read the changelog and my problem is not mentioned there.

Deploy method

Docker

Z-Wave JS UI version

10.1.5.c36c719

ZwaveJS version

15.0.5

Describe the bug

After 12-48 hours connectivity to controller seems to die and not self heal. ERROR Z-WAVE-SERVER: Message dropped because of an unexpected error: The serial port is not open! (ZW0202) ZWaveError: Message dropped because of an unexpected error: The serial port is not open! (ZW0202)

Using ZNET IP controllers. zwave-js-ui: 10.1.5.c36c719 zwave-js: 15.0.5 All dockers on the same Ubuntu 20.x

Of four devices/Zwave UI dockers, the two that have been upgraded to 15.0.5 are showing this behavior. (one wired, one wifi) Each of the two networks at issue are using a different zwave chip, while same chips on other two networks are unaffected. The two still running 14.x are not running without issues.

A reset from UI recovers with no issues.

Time: 2025-04-16 20:34:53.781

zwavejs_2025-04-16.log z-ui_2025-04-16.log

To Reproduce

12-48 hours normal running. No specific time of day or activity seems to correlate.

Expected behavior

Detect connection failure and attempt reconnect or reset.

Additional context

No response

systems-monitor avatar Apr 18 '25 15:04 systems-monitor

"The two still running 14.x are running without issues." Correction to above

systems-monitor avatar Apr 18 '25 16:04 systems-monitor

Here also. HAOS on a NUC with a aeotec Gen 5+ stick(ZW090). Here also everything on the latest versions.

Image

freemann avatar Apr 20 '25 08:04 freemann

Same issue on my RPi 4 HA installation with Aeotec Gen5+ Z Stick. Everything's at the latest version. Seems to fail randomly - anywhere between 4 and 18 hours.

lucidicious avatar Apr 21 '25 10:04 lucidicious

Sorry for the late reply guys, I have to ask @AlCalzone if he is aware of this issue because seems something on driver side

robertsLando avatar Apr 22 '25 07:04 robertsLando

Maybe this is related; https://github.com/home-assistant/addons/issues/3962

freemann avatar Apr 22 '25 08:04 freemann

There have been a few issues about this popping up everywhere. So far it's not clear if this is due to the changes in Z-Wave JS, or OS updates, or both.

AlCalzone avatar Apr 22 '25 09:04 AlCalzone

@AlCalzone Maybe changes in serialport module as well?

robertsLando avatar Apr 22 '25 09:04 robertsLando

At least one person has started seeing those issues on an older Z-Wave JS version too now 🤷🏻‍♂️

AlCalzone avatar Apr 22 '25 09:04 AlCalzone

I have driver issues on everything newer than

zwave-js-ui: 9.25.0.38e48a8 zwave-js: 13.10.1 home id: 3914736960 home hex: 0xe9562540

(using tcp://192.168.11.x:4001 over ser2net on a Pi3 with Razberry7 Head)

Oekel avatar Apr 22 '25 15:04 Oekel

I rolled back one of my two dockers having the issue to zwave-js-ui: 9.31.0.6c80945 zwave-js: 14.3.10 No issues on that one for four days. tonight I will roll it forward to the last pre 10 release and test for the rest of the week

systems-monitor avatar Apr 22 '25 16:04 systems-monitor

There have been a few issues about this popping up everywhere. So far it's not clear if this is due to the changes in Z-Wave JS, or OS updates, or both.

Seeing as I can reproduce in stand alone dockers on a different OS. I assume OS updates are not related. Also I have not updated the ubuntu they are running on or any other packages since Jan. My env has been stable since before this started occurring.

systems-monitor avatar Apr 22 '25 16:04 systems-monitor

Maybe this is related; home-assistant/addons#3962

possible, but I think a symptom or other problem with resiliency in the integration.

On my two ZWJS failing connection to the controller, I see the integration going down only 25-30% of the time

systems-monitor avatar Apr 22 '25 16:04 systems-monitor

This is horror..... The serial port closed again this afternoon and it took several restarts of the addon and eventually haos to get all devices back online again :| :(

freemann avatar Apr 27 '25 13:04 freemann

@AlCalzone what are the best logs we can collect to help debug this? I can't see any pattern to its occurrence. Restarting the add-on used to restore the serial port, however like @freemann has mentioned, it can take a few restart attempts and unplugging the stick to get it back online.

lucidicious avatar Apr 27 '25 13:04 lucidicious

what are the best logs we can collect to help debug this?

  • I suppose the relevant dmesg entries that should show a device disconnection:
    Fr Mai  2 16:52:42 2025] usb 3-6.3.1.2.1: USB disconnect, device number 44
    
    (use the -T option to show human readable timestamps)
  • The driver logs on loglevel debug from the same timeframe.
  • If using Linux, which kernel version you have, can be found with uname -a:
    inux X1 6.11.0-21-generic
            ^^^^^^^^^
    
  • Which specific Z-Wave controller you're using
  • Whether you're using a powered USB 2.0 hub between the stick and your host
  • Whether soft-reset is enabled or disabled in the Z-Wave JS options

AlCalzone avatar Apr 28 '25 07:04 AlCalzone

I assume OS updates are not related

I wouldn't rule that out. The changes to USB enumeration in the Linux kernel that can cause some of these problems with non-compliant devices have been backported to all stable kernel branches, the oldest AFAIK being 6.1.52.

AlCalzone avatar Apr 28 '25 07:04 AlCalzone

Kernel: 6.12.23-haos Zwave controller: AEOTEC Gen 5+ stick(ZW090) USB Connection: directly with extension cord of 30cm, no Powered USB host Soft Reset: enabled

Driver log enabled.

Yesterday at 13:58 zwave became unavailable; Image

I added some Supervisor logs from yesterday; supervisor_2025-04-28T08-57-33.168Z.log

Around that time the logs show that the USB device is disconnected.... I was not at home at the time so....

freemann avatar Apr 28 '25 09:04 freemann

I added some Supervisor logs from yesterday;

Those show the device being removed and reconnected:

[32m2025-04-27 13:57:49.110 INFO (MainThread) [supervisor.hardware.monitor] Detecting remove hardware /dev/ttyACM0 - /dev/serial/by-id/usb-0658_0200-if00[0m
[32m2025-04-27 13:57:49.116 INFO (MainThread) [supervisor.hardware.monitor] Detecting remove hardware /dev/bus/usb/001/011 - None[0m
[32m2025-04-27 13:57:51.484 INFO (MainThread) [supervisor.hardware.monitor] Detecting add hardware /dev/bus/usb/001/012 - None[0m
[32m2025-04-27 13:57:51.491 INFO (MainThread) [supervisor.hardware.monitor] Detecting add hardware /dev/ttyACM0 - /dev/serial/by-id/usb-0658_0200-if00[0m

Would be good to have the corresponding driver logs to compare what Z-Wave JS did there.

directly with extension cord of 30cm, no Powered USB host

Do you have a powered USB hub you could use to test whether this avoids the issue?

Soft Reset: enabled

Can you disable it and see if this avoids the issue?

AlCalzone avatar Apr 28 '25 09:04 AlCalzone

I disabled Soft Reset.

I don't have a powered USB hub at the moment, will see what i can do. For now I removed the usb cable and plugin the AEOTEC stick directly in the SS USB port of my NUC and rebooted the NUC(NUC8i3BEK).

freemann avatar Apr 28 '25 10:04 freemann

For now I removed the usb cable and plugin the AEOTEC stick directly in the SS USB port

Thats no bueno --> https://zwave-js.github.io/zwave-js/#/troubleshooting/first-steps?id=controller-placement

The point of the USB cable is to reduce interference from the USB port. The point of the USB hub is to work around hardware that's not USB3-compliant.

AlCalzone avatar Apr 28 '25 10:04 AlCalzone

@AlCalzone what are the best logs we can collect to help debug this? I can't see any pattern to its occurrence. Restarting the add-on used to restore the serial port, however like @freemann has mentioned, it can take a few restart attempts and unplugging the stick to get it back online.

This may be a separate issue than my initial.

  1. Soft reset always works in my case
  2. My controllers are connected over IP on a Rpi

systems-monitor avatar Apr 28 '25 14:04 systems-monitor

Also confirmed stable for 3 days on 9.33.1 If all goes well I will bump a network to the first 10.x

systems-monitor avatar Apr 28 '25 14:04 systems-monitor

@systems-monitor https://github.com/zwave-js/zwave-js/releases/tag/v15.1.1 contains a possible fix for your problems:

More resilient recovery from disconnected TCP serial ports (https://github.com/zwave-js/zwave-js/pull/7748)

This hasn't been released in the HA addon yet, but I've unplugged my network-based controller a lot while testing this.

AlCalzone avatar Apr 28 '25 14:04 AlCalzone

@systems-monitor https://github.com/zwave-js/zwave-js/releases/tag/v15.1.1 contains a possible fix for your problems:

More resilient recovery from disconnected TCP serial ports (zwave-js/zwave-js#7748)

This hasn't been released in the HA addon yet, but I've unplugged my network-based controller a lot while testing this.

On it TY.

Just spun up one zwave-js-ui: 10.3.3.3331be3 zwave-js: 15.1.3

systems-monitor avatar Apr 28 '25 15:04 systems-monitor

Connect the aeotec with a modded powered USB hub(lost power supply and soldered a USB A directly to it). Rebooted HAOS and waited.... Waited.... waited and after 10min my zwave network was still offline. Enabled soft reset, rebooted HAOS and the network was back online in several minutes. So disabling soft reset is not working for me.

freemann avatar Apr 28 '25 16:04 freemann

I have had two soft resets in the first 30 min. They did successfully recover, but the 14.3.12 JS was not going down but once or twice a week. Will continue to monitor.

2025-04-28T15:52:26.264Z DRIVER   all queues idle
2025-04-28T15:52:53.379Z DRIVER   one or more queues busy
2025-04-28T15:52:53.381Z DRIVER » [Node 021] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ callback id:      152
                                  └─[SupervisionCCGet]
                                    │ session id:      35
                                    │ request updates: true
                                    └─[MultilevelSwitchCCSet]
                                        target value: 0
                                        duration:     default
2025-04-28T15:52:53.382Z SERIAL » 0x011600a900010015086c01a304260100ff250000000098f3                  (24 bytes)
2025-04-28T15:52:54.400Z CNTRLR   The controller is unresponsive
2025-04-28 15:52:54.401 INFO Z-WAVE: Controller status: Controller is unresponsive
2025-04-28T15:52:54.402Z DRIVER   Attempting to recover unresponsive controller by restarting it...
2025-04-28T15:52:54.402Z CNTRLR   Performing soft reset...
2025-04-28T15:52:54.403Z DRIVER » [REQ] [SoftReset]
2025-04-28T15:52:54.403Z SERIAL » 0x01030008f4                                                         (5 bytes)
2025-04-28T15:52:55.439Z CNTRLR   Soft reset failed: Timeout while waiting for an ACK from the controller (ZW020
                                  0)
2025-04-28T15:52:55.440Z DRIVER   Attempting to recover unresponsive controller by reopening the serial port...
2025-04-28 15:52:55.442 ERROR Z-WAVE-SERVER: Timeout while waiting for an ACK from the controller (ZW0200)
ZWaveError: Timeout while waiting for an ACK from the controller (ZW0200)
    at Driver.sendMessage (/usr/src/app/node_modules/zwave-js/src/lib/driver/Driver.ts:6863:23)
    at Driver.sendCommandInternal (/usr/src/app/node_modules/zwave-js/src/lib/driver/Driver.ts:7059:28)
    at Driver.sendSupervisedCommand (/usr/src/app/node_modules/zwave-js/src/lib/driver/Driver.ts:7105:27)
    at Driver.sendCommand (/usr/src/app/node_modules/zwave-js/src/lib/driver/Driver.ts:7160:30)
    at Proxy.set (/usr/src/app/node_modules/@zwave-js/cc/src/cc/MultilevelSwitchCC.ts:246:20)
    at Proxy.set (/usr/src/app/node_modules/@zwave-js/cc/src/cc/MultilevelSwitchCC._validateArgs.ts:13:30)
    at Proxy.<anonymous> (/usr/src/app/node_modules/@zwave-js/cc/src/cc/MultilevelSwitchCC.ts:323:17)
    at ZWaveNode2.setValue (/usr/src/app/node_modules/zwave-js/src/lib/node/Node.ts:555:39)
    at NodeMessageHandler.handle (/usr/src/app/node_modules/@zwave-js/server/dist-esm/lib/node/message_handler.js:23:43)
    at Client.receiveMessage (/usr/src/app/node_modules/@zwave-js/server/dist-esm/lib/server.js:134:100)
2025-04-28T15:52:55.443Z DRIVER   all queues idle
2025-04-28T15:52:56.802Z DRIVER   Serial port reopened. Returning to normal operation and hoping for the best...
2025-04-28T15:52:56.802Z CNTRLR   The controller is no longer unresponsive
2025-04-28 15:52:56.803 INFO Z-WAVE: Controller status: Controller is Ready
2025-04-28T15:52:57.972Z SERIAL « 0x010401a90052                                                       (6 bytes)
2025-04-28T15:52:57.973Z DRIVER   Dropping message because it contains invalid data
2025-04-28T15:52:57.974Z SERIAL » [NAK]                                                                   (0x15)
2025-04-28T15:52:57.979Z SERIAL « 0x011d00a99800007601b1a97f7f7f02010303000000010200007f7f7f7f7fbc    (31 bytes)
2025-04-28T15:52:57.979Z SERIAL » [ACK]                                                                   (0x06)
2025-04-28T15:52:57.980Z DRIVER « [REQ] [SendDataBridge]

systems-monitor avatar Apr 28 '25 16:04 systems-monitor

Every serial command between the controller and Z-Wave JS must be acknowledged. This does not happen here:

2025-04-28T15:52:53.382Z SERIAL » 0x011600a900010015086c01a304260100ff250000000098f3                  (24 bytes)
2025-04-28T15:52:54.400Z CNTRLR   The controller is unresponsive

This looks like actual packet loss. Given that reopening the "serialport" recovers things, I'd say that the TCP connection dropped. It's a bit odd though that Z-Wave JS does not detect this in the 30s prior. In my tests, it would do so within 5-10 seconds of me unplugging the ethernet cable of the serial-to-TCP bridge.

What devices are you using as bridges?

AlCalzone avatar Apr 29 '25 07:04 AlCalzone

//

What devices are you using as bridges?

All ZNETs. The one I bumped to the latest ver is the only one on wired connection. and it is on the same router as the Dockers, Of the four, I expect it to have the best connection.

This ZNET has a Silicon Labs 700 controller on it.

I do see controller unresponsive frequently, 1-2 times a day is normal for this one. But ZWJS almost always recovers with no issue. I have been using znets for 10 years, but only since Jan have I been on ZWJS, and had access to much better logs for troubleshooting.

After watching logs this year my gut is its not a dropped packet, but the Rpi or controller is to busy.

Also on the original issue for this problem, I have observed 24hrs now with no soft reset required on the new ver. Fingers crossed. I am bumping a second ZNET with a 500 chip, on wifi, and is my least stable, to the newest ver now.

systems-monitor avatar Apr 29 '25 14:04 systems-monitor

i dont know if this will be helpful, but i think i'm seeing the same issue here. Unforunately i've done a LOT of changes recently to my home setup to know what is/isnt true. Maybe something sticks out as a wider problem based on distro or hardware choice though....

started off with JSUI via docker on host1. runs watchtower with monthly updates so i should've been 3-4weeks old version. have had this inplace and untouched for about a year.

I've been doing hardware changes so moved zwave stick to host2 about 2 weeks ago. Copied the docker cfg and rync'd the permanent files over (JSUI db etc). docker pull obtained latest version. i started seeing issues and felt very much like packet loss/interference problems but figured this was due to dongle placement so i started moving around the stick on a USB extension with no improvement. Iin hindsight, this deployment seemed ok initially with good reliable comms for 10-15mins and then i'd see failures after the first command was sent. a device would get the "turn on", but would then immediately drop off the network and turn to unavailable.

I gave up with host2 and went with a completely fresh standalone deployment on an rpi3 over the weekend. Again, docker but with a clean DB. Only thing i carried across (manual copy & paste) were the various keys in the JSUI zwave area.

There seemed to be much improvement where zwave devices were remaining online for longer and not about 10-15mins until failure. Untill i came back a few hours later to find the usb stick completely unresponsive and re-plugging would not bring it back. I'd have to fully restart the rpi3 for it to even detect the usb stick again. this has happened 3-4 times in maybe 48hours.

I've got no logs yet, but currently its been running ok for ~19hours.

I'm running a fairly basic simple deployment. host1: proxmox VM, running debian + docker-ce with passthru USB. Host2: same as host1, but running as LXC + docker-ce with passthru USB. rpi3: rpiOS 64bit lite, docker-ce as per their recommended "debian" install notes.

zwave bridge: Aeotec zstick 7 (700-series). iirc, FW 7.21.3 on host1 & 2, Upgraded to 7.21.6 on the rpi3 but still saw atleast 1 hard lock with new fw. All docker versions are ce versions, using the official JSUI container with minimal config. "out the box" defaults except for encryption keys and the port 3000 service (i forget the name; to allow Homeassistant to connect).

Zwave devices are only the Secure SRT321 (battery) & SSR302 (mains) heating controls so it's been hard to test other devices to see if i was suffering from a failing device - SSR302 comms have been ultra flaky but thats also been the easier quickest device to test with.

Tau512 avatar Apr 29 '25 15:04 Tau512

And another zwave crash... downloaded 10k of addon log lines, but the crash is not in the log file. The AEOTEC is still connected with the powered USB HUB.

How to get the dmesg from HAOS?

freemann avatar May 01 '25 06:05 freemann