addons icon indicating copy to clipboard operation
addons copied to clipboard

Silicon Labs Multiprotocol addon failing intermittently causing HAP, Thread and Zigbee issues

Open evakq8r opened this issue 1 year ago • 114 comments

Describe the issue you are experiencing

Every few days to a week, the Silicon Labs Multiprotocol addon will stop communicating and indicate 'resource temporarily unavailable'. yet does not make mention of which resource this is. When this addon stops, it breaks my light integration (Nanoleaf via Thread using HAP) and Zigbee sensors via ZHA.

Restarting the addon doesn't fix the issue, nor does restarting home assistant. Usually requires a complete reboot of the host, and even then sometimes it will repeatedly indicate 'resource temporarily unavailable'.

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

Which add-on are you reporting an issue with?

Silicon Labs Multiprotocol

What is the version of the add-on?

2.3.2

Steps to reproduce the issue

Wish I knew, as it fails whenever it wants to (sometimes 3am in the morning, sometimes 5pm in the afternoon).

System Health information

System Information

version core-2023.8.4
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.4
os_name Linux
os_version 6.1.37
arch x86_64
timezone Australia/Adelaide
config_dir /config
Home Assistant Community Store
GitHub API ok
GitHub Content ok
GitHub Web ok
GitHub API Calls Remaining 5000
Installed Version 1.32.1
Stage running
Available Repositories 1335
Downloaded Repositories 37
AccuWeather
can_reach_server ok
remaining_requests 14
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud failed to load: unreachable
Home Assistant Supervisor
host_os Home Assistant OS 11.0.dev20230705
update_channel beta
supervisor_version supervisor-2023.08.3
agent_version 1.5.1
docker_version 23.0.6
disk_total 30.8 GB
disk_used 23.3 GB
healthy true
supported true
board ova
supervisor_api ok
version_api ok
installed_addons Samba share (10.0.2), Network UPS Tools (0.12.0), Matter Server (4.9.0), Silicon Labs Multiprotocol (2.3.2), Mosquitto broker (6.2.1), Zigbee2MQTT (1.32.2-1), Custom deps deployment (1.3.3), Home Assistant Google Drive Backup (0.111.1), ESPHome (2023.8.2), Terminal & SSH (9.7.1), Studio Code Server (5.10.1), PS5 MQTT (1.3.1), Whisper (1.0.0), Piper (1.3.2), Advanced SSH & Web Terminal (15.0.7), Silicon Labs Flasher (0.2.0)
Dashboards
dashboards 3
resources 24
views 11
mode storage
Recorder
oldest_recorder_run 16 August 2023 at 08:32
current_recorder_run 25 August 2023 at 20:34
estimated_db_size 933.92 MiB
database_engine sqlite
database_version 3.41.2

Anything in the Supervisor logs that might be useful for us?

Nothing relating to these issues.

Anything in the add-on logs that might be useful for us?

Logger: bellows.zigbee.application
Source: /usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py:643
First occurred: 10:49:25 (4040 occurrences)
Last logged: 17:00:02

ControllerApplication reset unsuccessful: ConnectionRefusedError(111, "Connect call failed ('172.30.32.1', 9999)")
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 640, in _reset_controller_loop
    await self._reset_controller()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 661, in _reset_controller
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 133, in connect
    self._ezsp = await bellows.ezsp.EZSP.initialize(self.config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 164, in initialize
    await ezsp.connect(use_thread=zigpy_config[conf.CONF_USE_THREAD])
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 181, in connect
    self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
    protocol, _ = await _connect(config, application)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
    transport, protocol = await zigpy.serial.create_serial_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 31, in create_serial_connection
    transport, protocol = await loop.create_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

Additional information

HAOS is hosted in a VM on my Unraid server. My Unraid server is still able to see and interact with the USB device(s) when HAOS fails to. For context:

image

I have the SkyConnect USB as well as a Sonoff Zigbee 3.0 USB Dongle Plus V2” (model “ZBDongle-E”). I mainly use the Skyconnect for everything, and the Sonoff is a recent purchase. Both are flashed with the latest version of the MultiPAN firmware.

I have tried both the stable and beta version of HAOS, no change to the outcome. Sometimes it works for several days, sometimes it fails > 5 times a day.

evakq8r avatar Aug 26 '23 07:08 evakq8r

Update:

Only way to restore Home Assistant to a functional state was to unplug the SkyConnect from its USB port, replug, then reboot Unraid. Simply turning the VM for HAOS off and on nor just rebooting the host without the replug was not enough.

FWIW, I have a PCIe USB Hub card in the server, however the SkyConnect USB is in the motherboards USB2.0 port, rather than the hub, to try and limit the hub being a cause.

evakq8r avatar Aug 26 '23 15:08 evakq8r

@agners I would like to report some similar experience that may not be entirely related to the same error you are receiving. However, I am having similar symptoms. Recently I added my HA SkyConnect to the Apple Border Router thread network via using the TLC Dataset provided from Nanoleaf. I have my devices all on one mesh network and things work great when they are up. For some reason my SkyConnect will just shut off. I believe it is being overloaded somehow. The only solution to the problem is to unplug and replug the device. Below is what Silicon Labs Multiprotocol addon logs look like.

core_silabs_multiprotocol_2023-09-01T00-59-20.370Z.log

BEFORE CRASH OF SKYCONNECT image

AFTER CRASH OF SKYCONNECT image

spartandrew18 avatar Sep 01 '23 01:09 spartandrew18

@spartandrew18 it seems that the otbr-agent is continuing to restart in a loop?

Do you maybe have the logs from when things first failed? You can get more logs through the CLI:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol

You can also redirect into a location where you have access to so you can upload the logs:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol > /config/multiprotocol.log

agners avatar Sep 01 '23 07:09 agners

New Update:

Silicon Labs Multiprotocol addon just crashed:

WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted
WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary
WARNING in function 'system_send_open_endpoint_ack' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #122 : Failed to acknowledge the open request for endpoint #12. Bad file descriptor

Repeats itself every 100ms, which caused my HA VM to hang in responsiveness. Had to force the VM to reboot from Unraid. The addon attempts to startup after the VM/HAOS reboot, but same thing occurs:

[23:30:56:630555] Info : New client connection using library v4.3.1.0
otbr-agent[212293]: 53d.04:20:11.933 [C] Platform------: CPC endpoint open failed
otbr-agent[212293]: 53d.04:20:11.933 [C] Platform------: mCpcBusSpeed = 115200
otbr-agent[212293]: 53d.04:20:11.933 [C] Platform------: Init() at radio.cpp:121: Failure
close cpc endpoint: Resource temporarily unavailable
[23:30:57] INFO: otbr-agent ended with exit code 1 (signal 0)...
OTBR_FORWARD_INGRESS  all opt    in * out wpan0  ::/0  -> ::/0  
Chain OTBR_FORWARD_INGRESS (0 references)
target     prot opt source               destination         
DROP       all      anywhere             anywhere             PKTTYPE = unicast
DROP       all      anywhere             anywhere             match-set otbr-ingress-deny-src src
ACCEPT     all      anywhere             anywhere             match-set otbr-ingress-allow-dst dst
DROP       all      anywhere             anywhere             PKTTYPE = unicast
ACCEPT     all      anywhere             anywhere            
otbr-ingress-deny-src
otbr-ingress-deny-src-swap
otbr-ingress-allow-dst
otbr-ingress-allow-dst-swap
OTBR_FORWARD_EGRESS  all opt    in wpan0 out *  ::/0  -> ::/0  
Chain OTBR_FORWARD_EGRESS (0 references)
target     prot opt source               destination         
ACCEPT     all      anywhere             anywhere            
[23:30:57] INFO: OTBR firewall teardown completed.
[23:30:57:782940] Info : Client disconnected

Thread and Zigbee (as expected) have failed:-

ZHA:

Logger: zigpy.application
Source: /usr/local/lib/python3.11/site-packages/zigpy/application.py:198
First occurred: 23:25:23 (8 occurrences)
Last logged: 23:30:49

Couldn't start application
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 195, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 138, in connect
    await ezsp.connect(use_thread=self.config[CONF_USE_THREAD])
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 179, in connect
    self._gw = await bellows.uart.connect(self._config, self, use_thread=use_thread)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 414, in connect
    protocol, _ = await _connect(config, application)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 385, in _connect
    transport, protocol = await zigpy.serial.create_serial_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/serial.py", line 36, in create_serial_connection
    transport, protocol = await loop.create_connection(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1085, in create_connection
    raise exceptions[0]
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 1069, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 973, in _connect_sock
    await self.sock_connect(sock, address)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 628, in sock_connect
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 668, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [Errno 111] Connect call failed ('172.30.32.1', 9999)

HomeKit Bridge (for the Thread lights via HAP) do not show any useful logs, however all 30 integrations are in constant reload mode.

Rebooting the VM did not resolve the issue. Unraid can still see the NabuCasa Skyconnect plugged in and responding:

image

lsusb -d 10c4:ea60 -v

Bus 005 Device 002: ID 10c4:ea60 Silicon Labs CP210x UART Bridge
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 
  bDeviceSubClass         0 
  bDeviceProtocol         0 
  bMaxPacketSize0        64
  idVendor           0x10c4 Silicon Labs
  idProduct          0xea60 CP210x UART Bridge
  bcdDevice            1.00
  iManufacturer           1 Nabu Casa
  iProduct                2 SkyConnect v1.0
  iSerial                 3 <redacted>
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength       0x0020
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0 
    bmAttributes         0x80
      (Bus Powered)
    MaxPower              100mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass       255 Vendor Specific Class
      bInterfaceSubClass      0 
      bInterfaceProtocol      0 
      iInterface              0 
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0040  1x 64 bytes
        bInterval               0
Device Status:     0x0000
  (Bus Powered)

The fix for this at the moment (this time around anyway) was a full server reboot, but it took two attempts and several manual reboots of the addon to make it work and talk with the USB, which is less than ideal.

For what it's worth, the SkyConnect USB is plugged directly into the motherboards USB port (ASRock Rack X570-2L2T) using the included extension cable that come with the unit, to avoid interference.

Any help with this would be appreciated. Thanks.

evakq8r avatar Sep 02 '23 14:09 evakq8r

@spartandrew18 it seems that the otbr-agent is continuing to restart in a loop?

Do you maybe have the logs from when things first failed? You can get more logs through the CLI:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol

You can also redirect into a location where you have access to so you can upload the logs:

ha host logs --boot 0 --lines 100000 -t addon_core_silabs_multiprotocol > /config/multiprotocol.log

multiprotocol.log

spartandrew18 avatar Sep 02 '23 18:09 spartandrew18

Thanks for the logs. Some interesting snippets:

Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: radio tx timeout
Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: RCP failure detected
Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: Trying to recover (1/100)
Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Wait for response timeout
Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: RCP failure detected
Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Trying to recover (2/100)
Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Wait for response timeout
Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: RCP failure detected
Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Trying to recover (3/100)
Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Wait for response timeout
Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: RCP failure detected
Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Trying to recover (4/100)
Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Wait for response timeout
Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: RCP failure detected
Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Trying to recover (5/100)
Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Wait for response timeout
Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: RCP failure detected
Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Trying to recover (6/100)
Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12
Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394752] WARNING : In function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12
Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394768] WARNING : In function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE
Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:495051] Info : Client disconnected

It seems the OTBR detects that the radio isn't communicating anymore (it can't receive frames any longer). The CPC daemon similarly detects a "retransmit timeout"

WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

The OTBR then ends up in a restart loop, presumably for the same reason (the radio stopped responding):

Aug 30 04:32:58 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58] INFO: Starting otbr-agent...
...
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58:732529] Info : New client connection using library v4.3.1.0
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary
...
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: close cpc endpoint: Success
Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:59] INFO: otbr-agent ended with exit code 1 (signal 0)...

The second error in the log looks very similar.

agners avatar Sep 04 '23 07:09 agners

Thanks for the logs. Some interesting snippets:


Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: radio tx timeout

Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: RCP failure detected

Aug 30 04:22:33 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:39.085 [W] Platform------: Trying to recover (1/100)

Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Wait for response timeout

Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: RCP failure detected

Aug 30 04:22:37 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:43.186 [W] Platform------: Trying to recover (2/100)

Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Wait for response timeout

Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: RCP failure detected

Aug 30 04:22:41 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:47.282 [W] Platform------: Trying to recover (3/100)

Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Wait for response timeout

Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: RCP failure detected

Aug 30 04:22:45 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:51.378 [W] Platform------: Trying to recover (4/100)

Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Wait for response timeout

Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: RCP failure detected

Aug 30 04:22:49 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:55.474 [W] Platform------: Trying to recover (5/100)

Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Wait for response timeout

Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: RCP failure detected

Aug 30 04:22:53 homeassistant addon_core_silabs_multiprotocol[474]: otbr-agent[306]: 06:19:59.570 [W] Platform------: Trying to recover (6/100)

Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

Aug 30 04:22:55 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE

Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394752] WARNING : In function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:394768] WARNING : In function 'core_set_endpoint_in_error' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1125 : Setting ep#12 in error, new state: SL_CPC_STATE_ERROR_DESTINATION_UNREACHABLE

Aug 30 04:22:56 homeassistant addon_core_silabs_multiprotocol[474]: [00:22:55:495051] Info : Client disconnected

It seems the OTBR detects that the radio isn't communicating anymore (it can't receive frames any longer). The CPC daemon similarly detects a "retransmit timeout"


WARNING in function 're_transmit_timeout' in file /usr/src/cpc-daemon/server_core/core/core.c at line #1801 : Retransmit limit reached on endpoint #12

The OTBR then ends up in a restart loop, presumably for the same reason (the radio stopped responding):


Aug 30 04:32:58 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58] INFO: Starting otbr-agent...

...

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:58:732529] Info : New client connection using library v4.3.1.0

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #176 : Property-get::PROP_ENDPOINT_STATE aborted

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: WARNING in function 'property_get_single_endpoint_state_and_reply_to_pending_open_callback' in file /usr/src/cpc-daemon/server_core/system_endpoint/system_callbacks.c at line #210 : Could not read endpoint state on the secondary

...

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: close cpc endpoint: Success

Aug 30 04:32:59 homeassistant addon_core_silabs_multiprotocol[474]: [00:32:59] INFO: otbr-agent ended with exit code 1 (signal 0)...

The second error in the log looks very similar.

Is it something to do with my hardware setup? Or is it possible it can fixed via software?

spartandrew18 avatar Sep 04 '23 15:09 spartandrew18

Further update; new crash, same outcome but slight variation in the logs:

ZHA:-

Logger: bellows.zigbee.application
Source: runner.py:186
First occurred: 10:25:19 (5 occurrences)
Last logged: 10:26:19

Watchdog heartbeat timeout: TimeoutError()

Silicon Labs Multiprotocol:-

[10:54:39] INFO: Starting zigbeed...
[10:54:39] INFO: Setup OTBR firewall...
[10:54:39] INFO: Starting otbr-agent...
otbr-agent[95918]: [NOTE]-AGENT---: Running 0.3.0
otbr-agent[95918]: [NOTE]-AGENT---: Thread version: 1.3.0
otbr-agent[95918]: [NOTE]-AGENT---: Thread interface: wpan0
otbr-agent[95918]: [NOTE]-AGENT---: Radio URL: spinel+cpc://cpcd_0?iid=2&iid-list=0
otbr-agent[95918]: [NOTE]-ILS-----: Infra link selected: enp5s0
otbr-agent[95918]: [INFO]-NCP-----: OpenThread log level changed to 5
otbr-agent[95918]: 50d.15:48:26.792 [C] Platform------: CPC init failed. Ensure radio-url argument has the form 'spinel+cpc://cpcd_0?iid=<1..3>'
otbr-agent[95918]: 50d.15:48:26.792 [C] Platform------: Init() at cpc_interface.cpp:96: Failure
close cpc endpoint: Resource temporarily unavailable
[10:54:41] INFO: otbr-agent ended with exit code 1 (signal 0)...

Specifically the following line:

otbr-agent[95918]: 50d.15:48:26.792 [C] Platform------: CPC init failed. Ensure radio-url argument has the form 'spinel+cpc://cpcd_0?iid=<1..3>

I don't believe have seen this specific error, but everything else is the same.

Restarting the SLM addon shows the following logs:

s6-rc: info: service mdns: starting
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service mdns successfully started
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/check-cpcd-shm.sh
[11:20:14] INFO: Starting mDNS Responder...
Default: mDNSResponder (Engineering Build) (Aug 23 2023 18:27:14) starting
cont-init: info: /etc/cont-init.d/check-cpcd-shm.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service banner: starting
-----------------------------------------------------------
 Add-on: Silicon Labs Multiprotocol
 Zigbee and OpenThread multiprotocol add-on
-----------------------------------------------------------
 Add-on version: 2.3.2
 You are running the latest version of this add-on.
 System: Home Assistant OS 11.0.dev20230705  (amd64 / qemux86-64)
 Home Assistant Core: 2023.9.0
 Home Assistant Supervisor: 2023.08.3
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
[11:20:15] INFO: Checking /dev/ttyUSB0 identifying SkyConnect v1.0 from Nabu Casa.
[11:20:15] INFO: Starting universal-silabs-flasher with /dev/ttyUSB0
2023-09-07 11:20:15 homeassistant universal_silabs_flasher.flash[185] INFO Extracted GBL metadata: NabuCasaMetadata(metadata_version=1, sdk_version='4.3.1', ezsp_version=None, ot_rcp_version=None, fw_type=<FirmwareImageType.RCP_UART_802154: 'rcp-uart-802154'>, baudrate=460800)
2023-09-07 11:20:15 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.GECKO_BOOTLOADER at 115200 baud
2023-09-07 11:20:17 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.CPC at 460800 baud
2023-09-07 11:20:21 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.CPC at 115200 baud
2023-09-07 11:20:26 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.CPC at 230400 baud
2023-09-07 11:20:30 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.EZSP at 115200 baud
2023-09-07 11:20:35 homeassistant universal_silabs_flasher.flasher[185] INFO Probing ApplicationType.SPINEL at 460800 baud
Error: Failed to probe running application type
s6-rc: warning: unable to start service universal-silabs-flasher: command exited 1
/run/s6/basedir/scripts/rc.init: warning: s6-rc failed to properly bring all the services up! Check your logs (in /run/uncaught-logs/current if you have in-container logging) for more information.
/run/s6/basedir/scripts/rc.init: fatal: stopping the container.
s6-rc: info: service mdns: stopping
s6-rc: info: service banner: stopping
Default: mDNSResponder (Engineering Build) (Aug 23 2023 18:27:14) stopping
s6-rc: info: service banner successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
[11:20:40] INFO: mDNS ended with exit code 4 (signal 0)...
s6-rc: info: service mdns successfully stopped

The fix now is a physical unplug and replug of the USB into the server, then a full host restart. Simple reboot of the VM did not change the outcome.

It also looks like others are reporting similar issues (#3193) so it is not an isolated incident.

Devs, please advise what further information you require to resolve/investigate this. Physical host restarts to resolve a VM issue are far from ideal (in my case anyway) and we'd like to be able to assist to resolve this problem as soon as practical.

evakq8r avatar Sep 07 '23 01:09 evakq8r

Have some with this problem one ESP module and can try using it as serial port over Ethernet ? It was working before but have not testing latest versions but it shall being fixed if having one separate RCP module connected to the ESP.

MattWestb avatar Sep 08 '23 18:09 MattWestb

If you would like to try experimental firmware for the SkyConnect that should eliminate the need for physical resets (source https://github.com/NabuCasa/silabs-firmware-builder/pull/33), I've attached a pre-compiled copy here: skyconnect_rcp-uart-802154_4.3.1_wd.gbl.zip (zipped so it can be uploaded to GitHub, unzip it first).

If you have any step-by-step information for how to reproduce this issue, it would be very useful!

(CC @kylev, from #3193)

puddly avatar Sep 14 '23 15:09 puddly

Not much interesting about my setup (city setting, very busy 2.4Ghz band). I'm on 2023.9.2 with a mix of Wifi (Sengled, Cync, generic), zigbee (misc mfg), and 3 Nanoleaf bulbs.

Flashing with the web tool seemed to work, but the probed GBL metadata seems identical during boot. Should your build appear differently?

NabuCasaMetadata(metadata_version=1, sdk_version='4.3.1', ezsp_version=None, ot_rcp_version=None, fw_type=<FirmwareImageType.RCP_UART_802154: 'rcp-uart-802154'>, baudrate=460800)

simulti_firmware_3192.log

kylev avatar Sep 14 '23 21:09 kylev

Thanks!

The metadata will be identical, especially when probing. This also makes sure the addon doesn't re-flash the bundled firmware.

puddly avatar Sep 14 '23 21:09 puddly

If you would like to try experimental firmware for the SkyConnect that should eliminate the need for physical resets (source NabuCasa/silabs-firmware-builder#33), I've attached a pre-compiled copy here: skyconnect_rcp-uart-802154_4.3.1_wd.gbl.zip (zipped so it can be uploaded to GitHub, unzip it first).

If you have any step-by-step information for how to reproduce this issue, it would be very useful!

(CC @kylev, from #3193)

I've installed the experimental firmware and so far so good. Unfortunately I haven't had the failures again since my last batch of logs, but nothing has changed... so the inconsistencies aren't helping resolve the issue.

As for steps on replicating the issue, there-in lies the problem; I don't actually do anything to trigger the problems, they just happen. I've setup additional logging from the unRAID host in the event the registered USB ID disconnects or does something weird from its perspective, but HA doesn't give any more indication of what caused the failure other than what I provided on original report.

evakq8r avatar Sep 15 '23 08:09 evakq8r

Overnight, the container went into the cpcdcrash-loop with Property-get::PROP_ENDPOINT_STATE aborted and Property-get::PROP_ENDPOINT_STATE aborted errors. I restarted the container, and it entered the probing instead of detection stage and failed.

The new firmware didn't seem to have an effect.

kylev avatar Sep 15 '23 17:09 kylev

Perfect, thank you for the feedback. The watchdog isn't integrated tightly into CPC so it looks like just the CPC part is crashing, not the whole firmware. I'll post an updated one later next week.

puddly avatar Sep 15 '23 17:09 puddly

Just to ensure clarity: I had to un-plug and re-plug the SkyConnect. The new firmware didn't seem to have an effect.

Whatever initially breaks with the dongle triggers the cpcd crash loop but leaves the container up. Subsequently restarting the whole container results in the firmware flasher attempting to probe the device at multiple baudrates, failing, and crashing the container (and triggering watchdog restart).

After re-plugging the dongle, the container comes up on the next watchdog restart.

I suspect if you tweak the cpcd finish script to crash the container we'd see 1 cpcd crash followed by many probe failures.

kylev avatar Sep 15 '23 21:09 kylev

Seems reproducible: same symptoms this morning (crash loop cpcd, restarting container results in firmware flasher probe failures, replug SkyConnect fixes it).

I've just re-flashed the SkyConnect with your build again, just to ensure I didn't make an error last time. Will comment if anything changes.

kylev avatar Sep 16 '23 19:09 kylev

Any progress made on the fix?

spartandrew18 avatar Sep 25 '23 01:09 spartandrew18

To everyone having this problem, are you all using Nanoleaf devices? Because in my case, it's been a while since I had to restart the addon + unplug-replug my skyconnect. I don't know if it's pure luck because it's random, if it's the latest Nanoleaf firmware update (3.5.41) or the various HA updates I installed but it's stable a the moment.

theblackhole avatar Oct 10 '23 22:10 theblackhole

Nope. Not a single nanoleaf. Only hue, aqara, ikea and some thread devices.

DunklerPhoenix avatar Oct 11 '23 02:10 DunklerPhoenix

To everyone having this problem, are you all using Nanoleaf devices? Because in my case, it's been a while since I had to restart the addon + unplug-replug my skyconnect. I don't know if it's pure luck because it's random, if it's the latest Natoleaf firmware update (3.5.41) or the various HA updates I installed but it's stable a the moment.

Unfortunately that firmware only applies to the Matter over Thread Nanoleaf bulbs. The HomeKit over Thread (non-Matter) bulbs latest firmware is 1.6.49.

I've started to add Matter over Thread Nanoleaf bulbs to my mix of things, and out of the box they are 3.2.0, which the Android Nanoleaf indicates has a 'critical firmware upgrade', pointing to 3.5.37 as the FW in question. They all default to 3.5.41 once upgraded though.

evakq8r avatar Oct 11 '23 03:10 evakq8r

it's been a while since I had to restart the addon + unplug-replug my skyconnect.

I spoke too soon, I had to restart/unplug-replug today.

theblackhole avatar Oct 12 '23 23:10 theblackhole

it's been a while since I had to restart the addon + unplug-replug my skyconnect.

I spoke too soon, I had to restart/unplug-replug today.

Mine also failed yesterday as well. I've now split my Skyconnect to do just OTBR work (Sonoff ZBDongle-E to do the Zigbee work), as well as put both USB devices on extension cables to minimise interference.

Skyconnect crashed again about 6 hours after the split duties, and needed a replug to a different USB port to come back to life.

evakq8r avatar Oct 13 '23 05:10 evakq8r

Can you describe the Zigbee and Thread devices you have on your networks? Any Zigbee Green Power?

I've been able to replicate a crash and will try to get a firmware out that possibly mitigates it but there may be multiple concurrent bugs here causing issues.

puddly avatar Oct 13 '23 16:10 puddly

I have the same crashes and have:

Zigbee (values from ZHA): 1x FYRTUR block-out roller blind (IKEA of Sweden) 1x LCT012 (Hue) 3x LCT015 1x lumi.airmonitor.acn01 (aqara) 2x lumi.curtain.acn002 (aqara) 5x lumi.magnet.acn001 (aqara) 1x lumi.remote.b1acn01 (aqara) 1x lumi.sen_ill.mgl01 (xiaomi) 1x lumi.sensor_cube (aqara) 3x lumi.sensor_motion.aq2 (aqara) 2x lumi.vibration.aq1 (aqara) 6x lumi.weather (aqara) 1x Remote Control N2 (IKEA of Sweden) 1x RWL021 (hue) 4x TRADFRI bulb GU10 CWS 345lm (IKEA of Sweden) 2x TRADFRI on/off switch (IKEA of Sweden) 1x TRADFRI open/close remote (IKEA of Sweden) 1x TRADFRI remote control (IKEA of Sweden) 4x TRADFRI Signal Repeater (IKEA of Sweden) 12x TRADFRIbulbE14WWclear250lm (since 10 months not at power, but paired) (IKEA of Sweden) 1x TRADFRIbulbE27WSglobeopal1055lm (IKEA of Sweden) 1x TRADFRIbulbE27WWclear250lm (IKEA of Sweden)

Thread (via Homekit Controller): 8x Eve Thermo 20EBP1701 (eve systems) -> Takes sometimes hours to reconnect after restart of sky connect

Wifi (via Homekit Controller): 1x Netatmo Main Module 1x Indoor Module 1x Outdoor Module

DunklerPhoenix avatar Oct 13 '23 16:10 DunklerPhoenix

Think I'll need to close this issue out as my network topology has changed substantially since first logging. I've now split OTBR and Zigbee duties into two devices (OTBR being Skyconnect, Zigbee being a Sonoff ZBDongle-E until my second Skyconnect turns up). Skyconnect is now OpenThreadRCP only, Sonoff is now Zigbee only.

To answer your question though;

Zigbee: 18x Aqara motion sensors 1x Door lock 2x Zigbee magnetic door sensors 5x Zigbee USB Repeaters

Thread: 5x Nanoleaf Matter over Thread bulbs 26x Nanoleaf Homekit over Thread bulbs

I'm in the midst of replacing the HK bulbs with MoT, for various reasons.

I've removed the multiprotocol addon and reconfigured my Zigbee channel to be elsewhere (multiprotocol required the channels to be the same), and reinstalled the standalone OTBR addon in HACS.

evakq8r avatar Oct 13 '23 17:10 evakq8r

@DunklerPhoenix Id the Eve Thermo 20EBP1701 one sleeper / reduced thread device that can only being one child or one full future thread device that can being router / leader ?

MattWestb avatar Oct 13 '23 19:10 MattWestb

I'm not sure if I understand you correctly.

The Eve Thermo is just a child and cant be used as a router or leader. It's battery powered and so I think a reduced thread device. But I'm not sure. sry

https://www.evehome.com/de/eve-thermo

DunklerPhoenix avatar Oct 13 '23 20:10 DunklerPhoenix

@DunklerPhoenix OK do you have FTD device that can "holding" the network alive (leader) then the RCP is not online ? If not the RCP / addon must being the leader but its taking time for the network being OK after it was offline and getting the child getting back in it.

MattWestb avatar Oct 14 '23 05:10 MattWestb

Ahhh. No they cant hold the network alive. And I dont have any other leader than sky Connect

DunklerPhoenix avatar Oct 14 '23 07:10 DunklerPhoenix