core icon indicating copy to clipboard operation
core copied to clipboard

ZHA keep reinitializing every few minutes - all zigbee devices going unavailable

Open AnghusMcleod opened this issue 1 year ago • 24 comments

The problem

All zigbee devices keep going unavailable every 5-10 minutes since some days. They stay unavailable for some time each time. Looking at the ZHA integration, it is initializing each time this occurs.

Debug is enabled and a watchdog failure occurs each time claiming connectivity to the radio has been lost.

What version of Home Assistant Core has the issue?

core-2024.8.2

What was the last working version of Home Assistant Core?

core-2024.7.4

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/

Diagnostics information

home-assistant_2024-08-27T19-48-02.258Z.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-08-27 15:37:52.907 WARNING (MainThread) [zigpy.application] Watchdog failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1065, in request
    response = await response_future
               ^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 665, in _watchdog_loop
    await self.watchdog_feed()
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 651, in watchdog_feed
    await self._watchdog_feed()
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/zigbee/application.py", line 628, in _watchdog_feed
    await self._znp.request(c.SYS.Ping.Req())
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1061, in request
    async with async_timeout.timeout(
  File "/usr/local/lib/python3.12/site-packages/async_timeout/__init__.py", line 141, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.12/site-packages/async_timeout/__init__.py", line 228, in _do_exit
    raise asyncio.TimeoutError
TimeoutError
2024-08-27 15:37:52.922 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError()
2024-08-27 15:37:52.924 DEBUG (MainThread) [homeassistant.components.zha.helpers] Connection to the radio was lost: ConnectionLostEvent(event_type='zha_gateway_message', event='connection_lost', exception=TimeoutError())
2024-08-27 15:37:52.933 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop
2024-08-27 15:37:52.938 DEBUG (MainThread) [zigpy_znp.api] Sending request: SYS.ResetReq.Req(Type=<ResetType.Soft: 1>)
2024-08-27 15:37:52.939 DEBUG (MainThread) [zigpy_znp.api] Request has no response, not waiting for one.
2024-08-27 15:37:52.939 DEBUG (MainThread) [zigpy_znp.uart] Closing serial port

Additional information

Running on a RPi4b, using a SONOFF Zigbee 3.0 USB Dongle Plus, TI CC2652P + CP2102N Zigbee USB Stick, flashed to the latest Koenkk launchpad firmware (20230507). Stick is connected to a usb 2.0 port at the end of a 1.5m cable.

Zigbee network is on channel 11, and nearest WiFi AP is 2m away on on WiFi channel 11 (other end of 2.4Ghz spectrum).

Unifi spectrum analysis shows no noise on WiFi channel 1 (-96dbm) overlapping with Zigbee channel 11. Network is propagated by 4x Tradfri control outlets + 2x Aqara smart plugs (with monitoring) + 1xInovelli Blue 2-1, all of which were present with the network stable.

Tried rolling back to older version of core (7.4 and even 4.4 with no luck), rolling back forward changed nothing. Tried changing usb port and usb cable, no luck.

AnghusMcleod avatar Aug 27 '24 20:08 AnghusMcleod

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of zha can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign zha Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

home-assistant[bot] avatar Aug 27 '24 20:08 home-assistant[bot]

As far as ZHA is concerned, the radio is unresponsive:

2024-08-27 15:37:37.903 DEBUG (MainThread) [zigpy.application] Feeding watchdog
2024-08-27 15:37:37.904 DEBUG (MainThread) [zigpy_znp.api] Sending request: SYS.Ping.Req()

# No response for 25s
2024-08-27 15:37:52.907 WARNING (MainThread) [zigpy.application] Watchdog failure

If ZHA can't communicate with the radio, the firmware on the stick probably crashes. It's a firmware or hardware issue, which is why rolling back to an earlier Core release doesn't change anything.

puddly avatar Aug 27 '24 20:08 puddly

Understood and agreed; I was doing as much elimination testing as I could. But I also re-flashed the firmware using ZigStar TI CC2652P/P7 FW Flasher to the same Koenkk launchpad version to no avail.

AnghusMcleod avatar Aug 27 '24 20:08 AnghusMcleod

When re-flashing, do you do a complete chip erase? I would also try downgrading the firmware to an earlier release to see if it helps.

puddly avatar Aug 27 '24 20:08 puddly

I don't see an erase option in the ZigStar add-on; I guess I would need to do it off the Pi - but I could also try an older firmware at the same time. Will give that a go and report back.

AnghusMcleod avatar Aug 27 '24 20:08 AnghusMcleod

Erased and re-flashed with 2538-bsl, and sadly the same behaviour; safe to assume the radio is dead I guess... ordered a new one.

AnghusMcleod avatar Aug 28 '24 12:08 AnghusMcleod

@puddly Replaced radio, migrated and restored backup, and ZHA is still continually restarting exactly the same way.

AnghusMcleod avatar Aug 31 '24 16:08 AnghusMcleod

Same issue with RP4 and sonoff zigbee usb dongle but it happens about 2 or 3 times per day and started in May.

Enregistreur: homeassistant
Source: /usr/src/homeassistant/homeassistant/runner.py:148
S'est produit pour la première fois: 28 août 2024 à 20:47:11 (25 occurrences)
Dernier enregistrement: 18:46:59

Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1065, in request
    response = await response_future
               ^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1104, in request_callback_rsp
    await self.request(request, timeout=timeout, **response_params)
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1061, in request
    async with async_timeout.timeout(
  File "/usr/local/lib/python3.12/site-packages/async_timeout/__init__.py", line 141, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.12/site-packages/async_timeout/__init__.py", line 228, in _do_exit
    raise asyncio.TimeoutError
TimeoutError

and at same time, I also have this error message:

Enregistreur: homeassistant
Source: /usr/src/homeassistant/homeassistant/runner.py:148
S'est produit pour la première fois: 28 août 2024 à 20:47:26 (35 occurrences)
Dernier enregistrement: 18:47:00

Error doing job: Task exception was never retrieved (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1104, in request_callback_rsp
    await self.request(request, timeout=timeout, **response_params)
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1058, in request
    self._uart.send(frame)
    ^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'send'

tof92130 avatar Sep 01 '24 14:09 tof92130

I'm getting the exact same symptoms .. had the same issues earlier in the year and I ended up replacing my coordinator on advice to try to sort it (it didn't , it was a zha issue). I've posted my logs in a thread I made about it on the zigpy page https://github.com/zigpy/zigpy/issues/1465

thefunkygibbon avatar Sep 04 '24 21:09 thefunkygibbon

At the beginning the zha restart occurred every 10 minutes and I also found several ikea trafri plugs blocked.

I reinitialize the plugs several time and as I thought it could be an attempt to enter in my home network, I look the hacs extensions I used and their security.

WebRTC (use for frigate) seemed to be the weaker one so I decided to stop and remove it. Don't know if it is linked with but since my ZHA still reboots but there are several hours.

tof92130 avatar Sep 05 '24 05:09 tof92130

seems a little unlikely that all of ours (i can see a few others complaining about similar issues) are having the same issues, with different coordinators (even if they were the same, its a bit of a stretch that more than one person is having a hardware issue at the same time).

I'm running on a SLZB-06M connected via ethernet. whilst i'm not 100% that there isn't some kind of connection to the zigbee radio communication issue, I've been constantly checking with pings and i've not seen any dropped packets or even slightly elevated ping times.

below are the logs from my coordinator telling me all the times since early this morning where HA has reconnected to the coordinator (implying that zha crashed/restarted)

[03:11:19] taskZB | New client: 192.168.0.2 id: 0 [03:33:26] taskZB | Client disconnected, id: 0 [03:33:29] taskZB | New client: 192.168.0.2 id: 0 [04:02:26] taskZB | Client disconnected, id: 0 [04:03:04] taskZB | New client: 192.168.0.2 id: 0 [04:14:19] taskZB | Client disconnected, id: 0 [04:14:21] taskZB | New client: 192.168.0.2 id: 0 [05:50:34] taskZB | Client disconnected, id: 0 [05:50:36] taskZB | New client: 192.168.0.2 id: 0 [05:50:44] taskZB | Client disconnected, id: 0 [05:50:52] taskZB | New client: 192.168.0.2 id: 0 [05:51:00] taskZB | Client disconnected, id: 0 [05:51:13] taskZB | New client: 192.168.0.2 id: 0 [06:04:19] taskZB | Client disconnected, id: 0 [06:04:21] taskZB | New client: 192.168.0.2 id: 0 [06:30:43] taskZB | Client disconnected, id: 0 [06:30:45] taskZB | New client: 192.168.0.2 id: 0 [06:30:54] taskZB | Client disconnected, id: 0 [06:31:01] taskZB | New client: 192.168.0.2 id: 0 [06:38:28] taskZB | Client disconnected, id: 0 [06:38:30] taskZB | New client: 192.168.0.2 id: 0 [06:57:21] taskZB | Client disconnected, id: 0 [06:57:23] taskZB | New client: 192.168.0.2 id: 0 [06:57:28] taskZB | Client disconnected, id: 0 [06:57:40] taskZB | New client: 192.168.0.2 id: 0 [06:57:50] taskZB | Client disconnected, id: 0 [06:57:52] taskZB | New client: 192.168.0.2 id: 0 [07:53:54] taskZB | Client disconnected, id: 0 [07:53:56] taskZB | New client: 192.168.0.2 id: 0 [07:54:02] taskZB | Client disconnected, id: 0 [07:54:09] taskZB | New client: 192.168.0.2 id: 0 [07:54:20] taskZB | Client disconnected, id: 0 [07:54:22] taskZB | New client: 192.168.0.2 id: 0 [08:00:47] taskZB | Client disconnected, id: 0 [08:00:49] taskZB | New client: 192.168.0.2 id: 0 [08:00:59] taskZB | Client disconnected, id: 0 [08:01:02] taskZB | New client: 192.168.0.2 id: 0 [08:24:07] taskZB | Client disconnected, id: 0 [08:24:10] taskZB | New client: 192.168.0.2 id: 0 [08:24:18] taskZB | Client disconnected, id: 0 [08:24:26] taskZB | New client: 192.168.0.2 id: 0 [09:10:41] taskZB | Client disconnected, id: 0 [09:10:44] taskZB | New client: 192.168.0.2 id: 0 [09:10:53] taskZB | Client disconnected, id: 0 [09:11:00] taskZB | New client: 192.168.0.2 id: 0 [09:39:40] taskZB | Client disconnected, id: 0 [09:39:42] taskZB | New client: 192.168.0.2 id: 0 [09:40:26] taskZB | Client disconnected, id: 0 [09:40:28] taskZB | New client: 192.168.0.2 id: 0 [09:40:34] taskZB | Client disconnected, id: 0 [09:40:42] taskZB | New client: 192.168.0.2 id: 0 [09:47:16] taskZB | Client disconnected, id: 0 [09:47:19] taskZB | New client: 192.168.0.2 id: 0 [09:47:29] taskZB | Client disconnected, id: 0 [09:47:31] taskZB | New client: 192.168.0.2 id: 0 [09:47:41] taskZB | Client disconnected, id: 0 [09:47:44] taskZB | New client: 192.168.0.2 id: 0 [10:08:43] taskZB | Client disconnected, id: 0 [10:08:46] taskZB | New client: 192.168.0.2 id: 0 [10:08:56] taskZB | Client disconnected, id: 0 [10:08:58] taskZB | New client: 192.168.0.2 id: 0 [10:39:00] taskZB | Client disconnected, id: 0 [10:39:03] taskZB | New client: 192.168.0.2 id: 0 [10:39:09] taskZB | Client disconnected, id: 0 [10:39:16] taskZB | New client: 192.168.0.2 id: 0 [10:55:43] taskZB | Client disconnected, id: 0 [10:55:45] taskZB | New client: 192.168.0.2 id: 0 [11:23:54] taskZB | Client disconnected, id: 0 [11:23:56] taskZB | New client: 192.168.0.2 id: 0 [11:34:58] taskZB | Client disconnected, id: 0 [11:35:00] taskZB | New client: 192.168.0.2 id: 0 [11:35:08] taskZB | Client disconnected, id: 0 [11:35:15] taskZB | New client: 192.168.0.2 id: 0 [11:37:20] taskZB | Client disconnected, id: 0 [11:37:22] taskZB | New client: 192.168.0.2 id: 0 [11:44:27] taskZB | Client disconnected, id: 0 [11:44:29] taskZB | New client: 192.168.0.2 id: 0 [11:44:37] taskZB | Client disconnected, id: 0 [11:44:45] taskZB | New client: 192.168.0.2 id: 0 [11:46:18] taskZB | Client disconnected, id: 0 [11:46:21] taskZB | New client: 192.168.0.2 id: 0 [11:46:30] taskZB | Client disconnected, id: 0 [11:46:37] taskZB | New client: 192.168.0.2 id: 0

thefunkygibbon avatar Sep 05 '24 11:09 thefunkygibbon

At the beginning the zha restart occurred every 10 minutes and I also found several ikea trafri plugs blocked.

I reinitialize the plugs several time and as I thought it could be an attempt to enter in my home network, I look the hacs extensions I used and their security.

WebRTC (use for frigate) seemed to be the weaker one so I decided to stop and remove it. Don't know if it is linked with but since my ZHA still reboots but there are several hours.

So interesting that I had recently installed Alarmo and WebRTC from HACS and was still running <v2 there - I disabled and removed both those integrations, and updated HACS to 2.0.1, but no change. ZHA keeps restarting and insisting that the brand new radio is timing out - testing that radio on my Windows PC with the same usb cable, and I can find nothing wrong.

If the zha and zigpy folks have other / more debugs we can run, I would be glad to.

AnghusMcleod avatar Sep 05 '24 16:09 AnghusMcleod

ditto on doing anything i can do to help get to the bottom of this issue. not sure how having non-zigbee related integrations installed is relevant to zha restarting constantly. If a ZHA/zigpy expert can shed any light on that then for sure I will delete/disable any of my little or unused integrations to try to fix.

for what its worth, I too have Alarmo installed, and WebRTC from HACS and I have a bunch of Ikea tradfri bulbs and a couple of motion sensors (although they're no longer in use as I use Sonoff Motion sensors now). But everything above has been in use for a couple of years at least..... soooooo ???!

thefunkygibbon avatar Sep 06 '24 10:09 thefunkygibbon

Can you restart HA in safe mode and see if the issue persists?

dmulcahey avatar Sep 06 '24 11:09 dmulcahey

Can you restart HA in safe mode and see if the issue persists?

i''ve just downgraded firmware and radio on my coordinator to see if it makes any difference. so far its only been 20 mins so i'll wait a couple of hours at least before i try safe mode (didn't know it existed until now!) thanks

thefunkygibbon avatar Sep 06 '24 11:09 thefunkygibbon

ok so its been 1.5 hours now, with no disconnects. going by the history of many disconnects every hour, I'm going to assume that the firmware downgrade (note that I had already downgraded the zigbee radio firmware, it seems that the latest main firmware for the device is buggy) !!

thefunkygibbon avatar Sep 06 '24 12:09 thefunkygibbon

Does it still work ? Where did you find the firmware you mention ?

tof92130 avatar Sep 06 '24 14:09 tof92130

it wouldn't be relevant for you, I use a SMLight ethernet based coordinator

thefunkygibbon avatar Sep 06 '24 18:09 thefunkygibbon

My ZHA would not initialize anymore after a sudden power outage to my server. It would just keep switching between initializing and failed setup, or something like that.

I fixed it by just rebooting my entire server, even just HA reboots or replugging my SONOFF Zigbee 3.0 USB Dongle Plus didn't work.

Sjorsa avatar Sep 29 '24 10:09 Sjorsa

Seeing this issue on an HA Yellow using the built-in Zigbee hardware as well. Added logs to #124755.

asayler avatar Oct 03 '24 05:10 asayler

In my home, I removed one zigbee device and it stopped reinitializing zigbee for 2 days.

That device was a tradfri switch and visualisation showed it was a hub.

May be the device is not correctly working, when I try to reappear it, it was unabble to identify itself correctly.

=> problem solved: one device was responsable of the problem. I removed it and evertthing works.

tof92130 avatar Oct 03 '24 06:10 tof92130

Just rebooted my server again, and having the same issue. Another reboot does not work this time. Are there any tests I could do?

EDIT: It worked again after I replugged the Zigbee stick, and then restarted the server. Seems like the stick maybe crashes?

2024-10-07 20:36:16.611 DEBUG (MainThread) [homeassistant.components.zha] Failed to set up ZHA
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1065, in request
    response = await response_future
               ^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 151, in async_setup_entry
    await zha_gateway.async_initialize()
  File "/usr/local/lib/python3.12/site-packages/zha/application/gateway.py", line 272, in async_initialize
    await self._async_initialize()
  File "/usr/local/lib/python3.12/site-packages/zha/application/gateway.py", line 255, in _async_initialize
    await self.application_controller.startup(auto_form=True)
  File "/usr/local/lib/python3.12/site-packages/zigpy/application.py", line 234, in startup
    await self.connect()
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/zigbee/application.py", line 103, in connect
    await znp.connect()
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 738, in connect
    self.capabilities = (await self._skip_bootloader()).Capabilities
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 717, in _skip_bootloader
    return await self.request(c.SYS.Ping.Req())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zigpy_znp/api.py", line 1061, in request
    async with async_timeout.timeout(
  File "/usr/local/lib/python3.12/site-packages/async_timeout/__init__.py", line 141, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.12/site-packages/async_timeout/__init__.py", line 228, in _do_exit
    raise asyncio.TimeoutError
TimeoutError

Sjorsa avatar Oct 07 '24 18:10 Sjorsa

In my home, I removed one zigbee device and it stopped reinitializing zigbee for 2 days.

That device was a tradfri switch and visualisation showed it was a hub.

May be the device is not correctly working, when I try to reappear it, it was unabble to identify itself correctly.

=> problem solved: one device was responsable of the problem. I removed it and evertthing works.

Is it possible that ZHA's handling of devices not correctly working is not quite right? Devices will die out all the time, but I'd hope that ZHA wouldn't just fall over if this is the case. A log indicating which device(s) is/are broken would be much more helpful!

GrumpyMetalGuy avatar Oct 08 '24 12:10 GrumpyMetalGuy

In my home, I removed one zigbee device and it stopped reinitializing zigbee for 2 days. That device was a tradfri switch and visualisation showed it was a hub. May be the device is not correctly working, when I try to reappear it, it was unabble to identify itself correctly. => problem solved: one device was responsable of the problem. I removed it and evertthing works.

Is it possible that ZHA's handling of devices not correctly working is not quite right? Devices will die out all the time, but I'd hope that ZHA wouldn't just fall over if this is the case. A log indicating which device(s) is/are broken would be much more helpful!

My ZHA network is now working perfectly. This device was not dead but I think to was badly transmitting information. I tried to reappeared it, ZHA found the device but the appearing did not succeed.

tof92130 avatar Oct 08 '24 14:10 tof92130

In my home, I removed one zigbee device and it stopped reinitializing zigbee for 2 days. That device was a tradfri switch and visualisation showed it was a hub. => problem solved: one device was responsable of the problem. I removed it and evertthing works.

Is it possible that ZHA's handling of devices not correctly working is not quite right?

My ZHA network is now working perfectly. This device was not dead but I think to was badly transmitting information.

My zigbee network has worked fine for several years, but since I switched from deconz to zha two weeks ago, I've had the same issue described here.

And, I also have an IKEA 10w Driver that functions somewhat as a router, so I've just unplugged it this morning to see if it makes a difference. Fingers crossed.

teleksterling avatar Oct 18 '24 02:10 teleksterling

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.