core icon indicating copy to clipboard operation
core copied to clipboard

ZHA Disconnects from devices frequently and needs to e reset

Open grainsoflight opened this issue 1 year ago • 12 comments

The problem

Several times a day zigbee will stop controlling devices and need to be reset to resume functionality. This issue seems to have appeared with the most recent OS update

What version of Home Assistant Core has the issue?

core-2023.12.3

What was the last working version of Home Assistant Core?

core-2023.11

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Zigbee

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha

Diagnostics information

config_entry-zha-81b947f1e0f590e0cd175fead6f19fde.json.txt

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

grainsoflight avatar Dec 22 '23 05:12 grainsoflight

Hey there @dmulcahey, @adminiuga, @puddly, @thejulianjes, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of zha can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign zha Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation zha source (message by IssueLinks)

home-assistant[bot] avatar Dec 22 '23 05:12 home-assistant[bot]

Can you enable debug logging for the ZHA integration, let it run until it crashes and you have to reload, and then upload the log? You will probably have to compress it first.

puddly avatar Dec 22 '23 06:12 puddly

home-assistant_zha_2023-12-22T05-41-39.977Z.log.zip

Last 24 hours of debug logs, I dont know exactly when it failed but it did at least 2 times

grainsoflight avatar Dec 22 '23 06:12 grainsoflight

Hi,

Seems to be related to https://github.com/home-assistant/core/issues/105445 and https://github.com/home-assistant/core/issues/105449.

In my case, updating to 2023.12.3 seems to have solved the problem... (which appeared with one of the 2023.11 versions)

cvermot avatar Dec 22 '23 11:12 cvermot

I have the same problem of re-initializing over and over again and loosing devices while this reinitialization-loop repeats.

As mentioned it might be connected to #105445 #105449 and #105506

My logs show a watchdog-timeout:

`2023-12-22 12:47:04.192 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError() 2023-12-22 12:47:04.194 WARNING (MainThread) [zigpy.application] Watchdog failure Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 68, in command return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 661, in _watchdog_loop await self._watchdog_feed() File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 893, in _watchdog_feed (res,) = await self._ezsp.readCounters() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 67, in command async with asyncio_timeout(EZSP_CMD_TIMEOUT): File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in aexit raise TimeoutError from exc_val TimeoutError 2023-12-22 12:47:04.281 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError() 2023-12-22 12:47:04.285 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError() 2023-12-22 12:47:04.285 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop 2023-12-22 12:47:04.286 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication 2023-12-22 12:47:04.291 DEBUG (Thread-209) [aiosqlite] executing functools.partial(<function PersistingListener._set_isolation_level.. at 0x7f5c4667a0>) 2023-12-22 12:47:04.292 DEBUG (Thread-209) [aiosqlite] operation functools.partial(<function PersistingListener._set_isolation_level.. at 0x7f5c4667a0>) completed 2023-12-22 12:47:04.300 DEBUG (Thread-209) [aiosqlite] executing functools.partial(<built-in method execute of sqlite3.Connection object at 0x7f627a96c0>, 'PRAGMA wal_checkpoint;', []) 2023-12-22 12:47:04.304 DEBUG (MainThread) [bellows.uart] Connection lost: None 2023-12-22 12:47:04.304 DEBUG (MainThread) [bellows.uart] Closed serial connection`

markbeee avatar Dec 22 '23 12:12 markbeee

I am having the same issue since 2023.12.3 Zigbee keeps breaking down and not recovering on it's own.

CommanderROR9 avatar Dec 23 '23 07:12 CommanderROR9

When mine fails, the integration does not show an error, it just doesnt work. I have noticed that other people have been having issues with ZHA but they tend to have errors thrown back in the GUI

grainsoflight avatar Dec 23 '23 07:12 grainsoflight

Hey, Exactly the same problems from version 2023.12.0. This is starting to seriously annoy me...

sidwin9 avatar Dec 25 '23 10:12 sidwin9

For me it appears that the issue gets triggered if I use Matter/Thread functionality. If I don't, then rhe initialisation loop doesn't seem to happen or at least is more rare.

CommanderROR9 avatar Dec 25 '23 10:12 CommanderROR9

After the last update i have to keep reloading zigbee for anything to work. Then it stops working after sometime. Then I have to reload

pringleprng avatar Dec 25 '23 11:12 pringleprng

Any news ?

khzd avatar Dec 25 '23 17:12 khzd

This will be fixed by https://github.com/home-assistant/core/pull/106147, which is currently in dev and will be included in the next Core bugfix release.

puddly avatar Dec 25 '23 17:12 puddly

Hello, the problem is still present despite update 2023.12.4 and in addition, some devices no longer work. Please revert to the ZHA version present in core 2023.11.0

sidwin9 avatar Dec 28 '23 05:12 sidwin9

ZHA is still reinitializing with 2023.12.4, devices still lost/getting lost.

From the logs:

2023-12-28 07:42:23.111 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError() 2023-12-28 07:42:23.112 WARNING (MainThread) [zigpy.application] Watchdog failure Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 68, in command return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 657, in _watchdog_loop await self._watchdog_feed() File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 893, in _watchdog_feed (res,) = await self._ezsp.readCounters() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/init.py", line 215, in _command return await self._protocol.command(name, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 67, in command async with asyncio_timeout(EZSP_CMD_TIMEOUT): File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in aexit raise TimeoutError from exc_val TimeoutError 2023-12-28 07:42:23.198 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError() 2023-12-28 07:42:23.204 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError() 2023-12-28 07:42:23.204 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop 2023-12-28 07:42:23.206 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication

markbeee avatar Dec 28 '23 07:12 markbeee

Still an issue but not as bad? Devices seem to work but some plugs take time to switch but gives an offline msg. Yet It works but slowly.

pringleprng avatar Dec 28 '23 08:12 pringleprng

Well, if 2023.12.4 doesn't fix it then I guess my decision to move over to Zigbee2mqtt again (although the process is nerve wracking every time) was the right one.

CommanderROR9 avatar Dec 28 '23 08:12 CommanderROR9

Still present despite the update 2023.12.4, New to HA in the last six months, and questioning that decision.

JFBruns avatar Dec 28 '23 13:12 JFBruns

For Zigbee, nothing works correctly since version 2023.12.0. I carried out tests with zigbee2mqtt and everything works perfectly, the problem is ZHA. I use a Sonoff zigbee 3.0 version P dongle. It is no. longer even possible to detect new hardware

sidwin9 avatar Dec 28 '23 14:12 sidwin9

For Zigbee, nothing works correctly since version 2023.12.0. I carried out tests with zigbee2mqtt and everything works perfectly, the problem is ZHA. I use a Sonoff zigbee 3.0 version P dongle. It is no. longer even possible to detect new hardware

Still present despite the update 2023.12.4, New to HA in the last six months, and questioning that decision.

Guys we are trying to figure out what the issue is. These comments aren’t helpful without full debug logs. We understand you are frustrated (we are too). Without the debug output we aren’t going to be able to track this down.

dmulcahey avatar Dec 28 '23 15:12 dmulcahey

Errors on devices are random. Looks like no more than 3 devices can be controlled.

sidwin9 avatar Dec 28 '23 15:12 sidwin9

ZHA is still reinitializing with 2023.12.4, devices still lost/getting lost.

From the logs:

2023-12-28 07:42:23.111 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError() 2023-12-28 07:42:23.112 WARNING (MainThread) [zigpy.application] Watchdog failure Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 68, in command return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 657, in _watchdog_loop await self._watchdog_feed() File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 893, in _watchdog_feed (res,) = await self._ezsp.readCounters() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/init.py", line 215, in _command return await self._protocol.command(name, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 67, in command async with asyncio_timeout(EZSP_CMD_TIMEOUT): File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in aexit raise TimeoutError from exc_val TimeoutError 2023-12-28 07:42:23.198 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError() 2023-12-28 07:42:23.204 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError() 2023-12-28 07:42:23.204 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop 2023-12-28 07:42:23.206 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication

Any chance you’d be willing to disable all other integrations (core and custom) and restart? We’re trying to determine if we’re having delay issues in the event loop…

dmulcahey avatar Dec 28 '23 16:12 dmulcahey

Hello, Following a restart of HA, all my zigbee devices became unavailable. Can you deploy a previous operational version of ZHA so that we can have functional home automation again? Home Assistant became unusable due to ZHA.

sidwin9 avatar Dec 29 '23 05:12 sidwin9

ZHA is still reinitializing with 2023.12.4, devices still lost/getting lost. From the logs: 2023-12-28 07:42:23.111 WARNING (MainThread) [bellows.zigbee.application] Watchdog heartbeat timeout: TimeoutError() 2023-12-28 07:42:23.112 WARNING (MainThread) [zigpy.application] Watchdog failure Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 68, in command return await future ^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 657, in _watchdog_loop await self._watchdog_feed() File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 893, in _watchdog_feed (res,) = await self._ezsp.readCounters() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/init.py", line 215, in _command return await self._protocol.command(name, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 67, in command async with asyncio_timeout(EZSP_CMD_TIMEOUT): File "/usr/local/lib/python3.11/asyncio/timeouts.py", line 111, in aexit raise TimeoutError from exc_val TimeoutError 2023-12-28 07:42:23.198 DEBUG (MainThread) [zigpy.application] Connection to the radio has been lost: TimeoutError() 2023-12-28 07:42:23.204 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Connection to the radio was lost: TimeoutError() 2023-12-28 07:42:23.204 DEBUG (MainThread) [zigpy.application] Stopping watchdog loop 2023-12-28 07:42:23.206 DEBUG (MainThread) [homeassistant.components.zha.core.gateway] Shutting down ZHA ControllerApplication

Any chance you’d be willing to disable all other integrations (core and custom) and restart? We’re trying to determine if we’re having delay issues in the event loop…

In the meantime ALL my ZHA-devices went unavailable (=> it will be a lot of work to reintegrate many of the devices).

zha_initializing

ZHA keeps reinitializing. What ever changed with the latest release, made it even worse. I would suggest to roll back to the latest working versions regarding ZHA, and then have a thorough look at the diffs from 2023.11 to 2023.12. Anything else is like seeking the needle in the haystack. Just my two cents. In my experience and humble opinion watchdog and timeout events are serious and nothing for quick workarounds, tests and any kind of patchwork. But of course that is up to you as codeowners.

EDIT: Downgraded to 2023.11 as ZHA & HA became unusable. Many ZHA-devices came back live quick, some seem to be lost in the ZigBee-H(e)aven ;)

EDIT2: GOOD NEWS: Many of my ZHA-devices came back (~70%) - IKEA end-devices (with not-so-good/poor RSSI/LQI?) are not willing yet to do that as well. But I had the same problems with RaspBee/ConBee the years before. Might be an isolated IKEA-Firmware-case.

markbeee avatar Dec 29 '23 07:12 markbeee

+1, HA 1-2 times per day loses the ability to communicate with zigbee devices. Needs restart, then everything works again. Have a lot of devices from different vendors, pre ~2023.12 I had a very stable experience with HA for several months (except for the suggested intro raspberry Pi setup trashing the SD card after a few months, killing the system completely)

Had hoped the .4 release would fix it, but no. If I knew how to revert to pre .12 version I would.

dlinq avatar Dec 29 '23 14:12 dlinq

  • Make a backup
  • Install Terminal & SSH add-on
  • In the terminal type: ha core update --version 2023.11.3

Disclaimer: I just switched to HA from Node-RED (> 7 years) as my smart home system and I'm declaring me myself as HA-noob.

So there might be much more to think about when updating-downgrading.

markbeee avatar Dec 29 '23 15:12 markbeee

Hello, I did a test by installing from scratch a new HA 2023.12.4 platform with ZHA and everything works correctly (adding new devices, checking all added devices). On my normal HA, by resetting ZHA with a new network nothing works. There may be another element or configuration that disrupts the proper functioning of ZHA.

sidwin9 avatar Dec 29 '23 19:12 sidwin9

By downgrading to version 2023.11.3, the problems persist, nothing works.

sidwin9 avatar Dec 29 '23 19:12 sidwin9

How can ZHA be completely and cleanly reset? By removing the integration, traces of ZHA remain.

sidwin9 avatar Dec 29 '23 19:12 sidwin9