zigbee2mqtt icon indicating copy to clipboard operation
zigbee2mqtt copied to clipboard

Devices stop communicating randomly

Open Alfy1080 opened this issue 1 year ago • 201 comments

What happened?

Hello. I am not a zigbee expert so apologies if i provide incomplete information. I will try my best to mention everything.

Since 2 days ago my z2m instance in home assistant started acting up randomly.

At random times, one or multiple zigbee devices stop executing commands. ex1: Aqara TRV does not change the target temperature when i try to change it from either home assistant or z2m interface, the last seen status keeps increasing as if there is no communication between the TRV and the coordinator. ex2: Philips Hue Lightstrip does not switch on or off whenever i attempt to toggle it either from home assistant or z2m.

I have set the logging level to debug and when i try to send a command to the stuck device, the command shows up in the logs without any error whatsoever.

Power cycling the stuck device or pressing the pairing button (where applicable) does nothing. The only thing that seems to get my zigbee network back up and running temporarily is restarting z2m which makes me think that this is caused by something in z2m and not the devices that fail. Right after the z2m restart, all stuck devices start communicating again for a while until at some point either the same devices or others present the same behaviour as before.

My setup: Zigbee dongle: Home Assistant SkyConnect flashed with the latest firmware available through the web flasher here https://skyconnect.home-assistant.io/firmware-update/ Zigbee2Mqtt: Latest addon version available for Home Assistant (1.33.2-1) Home Assistant Core version: 2023.11.2 Home Assistant Supervisor version: 2023.11.3 OS: Debian 12 Server: Dell OptiPlex 9020 Micro, Core i7-4790t 3.90GHz, 16GB DDR3, SSD

What did you expect to happen?

I expected that no device will get in a frozen state where i can not issue commands to it or receive state changes from it. At least not as often as once every few minutes/hours

How to reproduce it (minimal and precise)

There is no replication steps that i can imagine. This issue can happen even when no zigbee device receives any command at all. I even left home for a few hours and after i left i restarted zigbee2mqtt to make sure it's all in working order and does not get any commands from anyone since nobody was home. When i returned home one of the Aqara TRVs was stuck.

Zigbee2MQTT version

1.33.2

Adapter firmware version

7.2.2.0 build 190

Adapter

Home Assistant SkyConnect

Debug log

log.txt

Alfy1080 avatar Nov 17 '23 16:11 Alfy1080

Here you can see i have updated the temperature on 3 Aqara TRVs:

Living_Room_TRV: 21.5 Kids_Room_TRV: 17.5 Bedroom_TRV: 22.5

debug 17-11-2023 18:24:29: Received MQTT message on 'zigbee2mqtt/Living_Room_TRV/set/occupied_heating_setpoint' with data '21.5' debug 17-11-2023 18:24:29: Publishing 'set' 'occupied_heating_setpoint' to 'Living_Room_TRV' debug 17-11-2023 18:24:29: Received MQTT message on 'zigbee2mqtt/Kids_Room_TRV/set/occupied_heating_setpoint' with data '17.5' debug 17-11-2023 18:24:29: Publishing 'set' 'occupied_heating_setpoint' to 'Kids_Room_TRV' debug 17-11-2023 18:24:29: Received MQTT message on 'zigbee2mqtt/Bedroom_TRV/set/occupied_heating_setpoint' with data '22.5' debug 17-11-2023 18:24:29: Publishing 'set' 'occupied_heating_setpoint' to 'Bedroom_TRV' info 17-11-2023 18:24:29: MQTT publish: topic 'zigbee2mqtt/Bedroom_TRV', payload '{"away_preset_temperature":null,"battery":100,"calibrate":null,"calibrated":null,"child_lock":"UNLOCK","device_temperature":27,"internal_heating_setpoint":30,"last_seen":"2023-11-17T18:24:29+02:00","linkquality":164,"local_temperature":23.5,"occupied_heating_setpoint":23,"power_outage_count":0,"preset":"manual","schedule":null,"schedule_settings":null,"sensor":"external","setup":false,"system_mode":"heat","update":{"installed_version":2590,"latest_version":2590,"state":"idle"},"update_available":null,"valve_alarm":false,"valve_detection":"ON","voltage":3300,"window_detection":"OFF","window_open":null}' info 17-11-2023 18:24:29: MQTT publish: topic 'zigbee2mqtt/Bedroom_TRV', payload '{"away_preset_temperature":null,"battery":100,"calibrate":null,"calibrated":null,"child_lock":"UNLOCK","device_temperature":27,"internal_heating_setpoint":30,"last_seen":"2023-11-17T18:24:29+02:00","linkquality":164,"local_temperature":23.5,"occupied_heating_setpoint":22.5,"power_outage_count":0,"preset":"manual","schedule":null,"schedule_settings":null,"sensor":"external","setup":false,"system_mode":"heat","update":{"installed_version":2590,"latest_version":2590,"state":"idle"},"update_available":null,"valve_alarm":false,"valve_detection":"ON","voltage":3300,"window_detection":"OFF","window_open":null}' debug 17-11-2023 18:24:30: Received Zigbee message from 'Bedroom_TRV', type 'attributeReport', cluster 'hvacThermostat', data '{"occupiedHeatingSetpoint":2250}' from endpoint 1 with groupID 0 info 17-11-2023 18:24:30: MQTT publish: topic 'zigbee2mqtt/Bedroom_TRV', payload '{"away_preset_temperature":null,"battery":100,"calibrate":null,"calibrated":null,"child_lock":"UNLOCK","device_temperature":27,"internal_heating_setpoint":30,"last_seen":"2023-11-17T18:24:30+02:00","linkquality":168,"local_temperature":23.5,"occupied_heating_setpoint":22.5,"power_outage_count":0,"preset":"manual","schedule":null,"schedule_settings":null,"sensor":"external","setup":false,"system_mode":"heat","update":{"installed_version":2590,"latest_version":2590,"state":"idle"},"update_available":null,"valve_alarm":false,"valve_detection":"ON","voltage":3300,"window_detection":"OFF","window_open":null}'

Out of these 3 TRVs only the Bedroom_TRV actually executed the command. The other two completely ignored it and their last seen status did not update when i changed the target temperature:

image Attached another full log after changing the target temperatures: log.txt

Alfy1080 avatar Nov 17 '23 16:11 Alfy1080

I have a similar problem with the Aqara thermostats (#19342). Sometimes the communication seems to fail, however Z2M does not notice that and pretends everything is right.

Ra72xx avatar Nov 18 '23 04:11 Ra72xx

Forgot to mention that I have fully erased zigbee2mqtt from my system, reinstalled and reconfigured z2m from scratch, re-paired all my devices again to my coordinator and the exact same issue is still happening randomly. As a workaround I have set up an automation in home assistant to restart the z2m addon every 30 minutes just to make sure my TRVs don't stay stuck for too long and causes my heating to run forever. I have also ordered a Sonoff Dongle-P to rule out the possibility that the SkyConnect is broken, but that will arrive in one or two days so I'm still waiting to test that. Also replaced my USB extension cable to make sure i'm not using a faulty cable but that didn't improve the situation either.

Alfy1080 avatar Nov 19 '23 01:11 Alfy1080

I have reflashed my skyconnect dongle and re-paired everything from scratch, just to rule out the possibility of the last firmware update flashing improperly and corrupting my dongle, but the issue is still there. I have attached the latest debug log here. z2mlog.txt

Alfy1080 avatar Nov 20 '23 10:11 Alfy1080

EDIT3: 03/24/2024: still stable! 👍


EDIT2 03/15/2024: Seems to getting solved, I'm now testing with z2m 1.36.0-dev commit: 56feb77 'edge', and I updated the Sonoff dongle-E's firmware to revision: 7.4.1.0, results are here. With the EZSP driver it's pretty stable now for the first day.


EDIT: This seems to go unnoticed? I couldn't find a 'normal' way to downgrade..
So, I just restored HA from before Oct. 1st, but I had days of work to re-pair all zigbee devices, most of them needed to be re-paired multiple times.

It's now running Z2M v1.33.0 again, and as I expected, without any issue.


Initial issue post:

What happened? Similar issue here, but with "Sonoff Dongle Plus E" I'm aware of it's experimental state, but it has run just fine without any issue, for the last six months....

It started when the add-on was updated to v.1.33.2, but it took a while before I noticed, and realized this v.1.33.2 update can contain bugs for my setup. I checked my wifi & zigbee channels, wifi 1 and zigbee 20 should be fine. Can't discover useful info in the warnings / errors in the logs. I updated to Edge, but it results in exactly the same behaviour.

Sometimes restarting z2m resolves the unresponsiveness, sometimes one device stays unresponsive. And, quite often 5 to 8 devices suddenly have a Offline status, but when I turn each of them on/off in the frontend, the stautus is online suddenly. I can't operate them as entity in the normal HA interface, only using the switches in z2m frontend until the status is online again.

I hope my herdsman log will reveal anything.

What did you expect to happen? I expected my zigbee devices to respond as they should

How to reproduce it (minimal and precise) Install update v1.33.2 should do the trick

Zigbee2MQTT version v.1.33.2 and since nov. 18th: 1.33.2-dev commit: ad4bed8

Adapter firmware version 6.10.3.0 build 297

Adapter Sonoff Zigbee 3.0 USB Dongle Plus ZBDongle-E (with 50cm extension cable, USB2)

Debug log herdsman-log.txt

Device log error-log.txt

Devices TS011F: 12 TS0501A: 4 lumi.sensor_wleak.aq1: 4 lumi.sensor_magnet.aq2: 4 lumi.weather: 4 lumi.sensor_motion.aq2: 3 TRADFRIbulbE14WWclear250lm: 3 01MINIZB: 1 TS0201: 1 TS0601: 1 TRADFRI control outlet: 1 (Router devices: 21, End-devices: 17)

PeterKawa avatar Nov 20 '23 19:11 PeterKawa

I only recently switched from Conbee II / Deconz to Skyconnect / Z2M and so I can't judge if the problem appeared only with recent versions of Z2M. The setup was more stable with Deconz (which I didn't expect)! I also have improved my Zigbee situation by putting the stick on an 2.0 USB hub and a 2m cable away from the rest of the setup. Nevertheless, I have quite a lot of such occurrences (devices getting "out of sync" with what state Z2M thinks they are, devices getting offline or dropping silently out of the network).. Sometimes operating such devices from the frontend makes them available again, but not always this is recognized by Home Assistant. Sometimes they have to be re-paired. Some types of devices seem to be prone to this problem, e.g. the Aqara thermostats (or the problem is more obvious with a heating schedule in autumn than with random lights or sensors). If this problem was, as you said, not present in previous versions, I hope that it is not a fundamental problem...

Ra72xx avatar Nov 21 '23 06:11 Ra72xx

UPDATE: My new Sonoff Zigbee Dongle-P arrived yesterday and I have replaced my Skyconnect with it and re-paired everything to it. So far since yesterday I had no devices that got stuck the same way as they did on the Skyconnect dongle. I only had an issue with some roller shade motors which was fixed by flashing the latest firmware on the sonoff dongle and re-pairing the motors in z2m.

So it seems to me that at least in my case, the issue is caused by a combination of Skyconnect and zigbee2mqtt version since as i've said previously, this issue started happening recently even though i have been running the same setup in terms of dongle, routers and client devices for over a year now.

Alfy1080 avatar Nov 21 '23 11:11 Alfy1080

That would be extremely annoying as I switched 70 devices from Deconz to Z2M/Skyconnect only recently to get rid of intermittent instability and seemingly now I traded that in for persistent instability. (Yes, I know, Skyconnect is only not yet officially supported...). EDIT: Something similar here: #19648

Ra72xx avatar Nov 21 '23 12:11 Ra72xx

I have the same issue, using Sonoff Zigbee Dongle-E

Devices affected: TS110E_1gang_1 TS110E_2gang_1 TS0001_power TS0001_switch_module

Randomly stops responding to commands but seems to be reporting state. A restart of zigbee2mqtt temporarily resolves it

Aleborg avatar Nov 21 '23 20:11 Aleborg

I tried to change QoS, but this doesn't help (however, as QoS is only between MQTT and Z2M, I admit I did not really expect anything from this setting).

Ra72xx avatar Nov 22 '23 04:11 Ra72xx

Maybe related discussion: https://github.com/Koenkk/zigbee2mqtt/discussions/19763

Ra72xx avatar Nov 22 '23 04:11 Ra72xx

I experience the exact same problem after updating to the same Z2M version. I also use a Skyconnect as coordinator, and I am also having troubles with my Aqara TRV.

A restart of Z2M fixes the issue.

JonasSL avatar Nov 23 '23 09:11 JonasSL

Having a similar issue, randomly one of my devices would not respond to commands coming from Z2M, results in a timeout error: https://github.com/Koenkk/zigbee2mqtt/issues/13993#issuecomment-1824743866 - coordinator is the highly praised SLZB-0 (based on stable chipset CC2652P recommended by Z2M, not the experimental EFR32MG21 chipset).

helgek avatar Nov 23 '23 17:11 helgek

Hey everyone! I have the same issue. The devices (in my case only IKEA LED2005R5 (3x)) randomly disconnect and show up as unavailable - and offline in z2m. However those devices still react to commands to a zigbee group that they are placed in. So, while I cannot control the devices directly, I can control them via a zigbee group, which is super weird to me.

Can anyone recreate this?

Thanks & Cheers

Zigbee2MQTT version 1.33.2 commit: unknown Coordinator type EZSP v9 Coordinator revision 7.1.1.0 build 273 Frontend version 0.6.142 Zigbee-herdsman-converters version 15.106.0 Zigbee-herdsman version 0.21.0

Home Assistant Sky Connect as a coordinator.

lennartgrunau avatar Nov 26 '23 18:11 lennartgrunau

Is there any way to push issues like this? The random communication loss is something which is a complete deal breaker for me, as I can no longer rely on my Zigbee network for lights, radiators and switches of all kind. Seemingly every day I have to count my devices and check the thermostats if they are really operating etc.

However this and quite a few other bug reports concerning similar problems I've been monitoring on this list for the last few days don't seem to get any developer attention and are probably simply getting lost because of newer issues.

IMHO a network problem which results in loosing communication to network members without even noticing the user but pretending everything is fine is one of the worst things which can happen.

Ra72xx avatar Nov 27 '23 05:11 Ra72xx

i experienced the same with my setup (Aquara Thermostats & Sonoff Dongle Plus) running on 1.33.2

So a downgrade to 1.33.0 fixes the random unresponsiveness to commands? im mostly a "fail forward person", so is there anything i can do except downgrading? looking into the logs, i see some warnings, i dont know if they are related:

warn  2023-11-27 08:56:15: zigbee-herdsman-converters:aqara_trv: Unknown key 641 =4��44
warn  2023-11-27 08:56:15: zigbee-herdsman-converters:aqara_trv: Unknown key 643 = 0
warn  2023-11-27 08:56:15: zigbee-herdsman-converters:aqara_trv: Unknown key 644 = 0

digitalkaoz avatar Nov 27 '23 08:11 digitalkaoz

I also see those messages with the strange keys some times, but not regularly.

Ra72xx avatar Nov 27 '23 15:11 Ra72xx

i am in on this too. just moved from seperate docker HA / Z2M to HAOS with Z2M addon under proxmox.

Also switched from ConbeeII to ZBDongle (EZSP v12) with 7.3.1.0 build 176.

Ikea lamp which worked fine for at least one year now gets randomly unavailable. Toogling state in Z2M guide revokes it though. sometimes it also comes back alone.

trackhacs avatar Nov 30 '23 20:11 trackhacs

Unfortunately, nobody's interested in the problem :-(

Ra72xx avatar Dec 01 '23 04:12 Ra72xx

@digitalkaoz wrote:

So a downgrade to 1.33.0 fixes the random unresponsiveness to commands?

In my case it did, Robert, it works flawless as it did before, no issues at all, running v1.33.0.

@Ra72xx Hmmm... The update's log v1.34.0-1 of today Dec. 1st, shows nothing about this issue or a fix for it (yet). I'm not updating for sure.

PeterKawa avatar Dec 01 '23 15:12 PeterKawa

i just updated - i give feedback if behavior improved.

trackhacs avatar Dec 01 '23 18:12 trackhacs

I updated earlier today, no difference, would say that it got a little worse...

Aleborg avatar Dec 01 '23 18:12 Aleborg

Had the same problem when i updated to v1.34.0-1 from 1.33.0-1. My Aqara TRV's would not respond to anything, while my Ikea Tradfri lights would still work. Im using the Sonoff Dongle Plus V2. Restoring my backup of v1.33.0-1 makes them respond again.

xEcho1 avatar Dec 02 '23 11:12 xEcho1

yes - can unfortunately confirm - its better with - but happens and is a problem.

Zigbee2MQTT version
[1.34.0](https://github.com/Koenkk/zigbee2mqtt/releases/tag/1.34.0) commit: [unknown](https://github.com/Koenkk/zigbee2mqtt/commit/unknown)
Coordinator type
EZSP v12
Coordinator revision
7.3.1.0 build 176
Coordinator IEEE Address
0xe0798dfffe741678
Frontend version
0.6.147
Zigbee-herdsman-converters version
15.130.1
Zigbee-herdsman version
0.25.0

trackhacs avatar Dec 04 '23 10:12 trackhacs

Has anybody tried simply downgrading to the Docker container 1.33.0 and keeping the database? I have no backup to restore...

Ra72xx avatar Dec 04 '23 10:12 Ra72xx

@Ra72xx that didnt work for me, i had to repair all devices

digitalkaoz avatar Dec 04 '23 11:12 digitalkaoz

So we have to wait until somebody takes note of this problem (which seemingly doesn't happen) and fixes it going forward. If I have to re-pair everything, there is some temptation to give ZHA a chance... I started using Z2M only a few weeks ago instead of Deconz, and this problem is the famous first impression :-(.

Ra72xx avatar Dec 04 '23 11:12 Ra72xx

@Ra72xx Bad timing I guess.....Like I wrote in my original post, z2m runs rock solid here on v1.33.0 and older versions since last April. v1.33.2 the first crap update I encountered. While the add-on is maintained by volunteers, we can't demand anything, but I agree it is pretty odd there's no response whatsoever yet. Cheers.

PeterKawa avatar Dec 07 '23 04:12 PeterKawa

I'd like to provide logs of any kind in order to help. However, the problem occurs too often, but not often enough to pinpoint the exact time when to look at the logs... Unfortunately, this bug report is by now buried far back in the Github issue list, and similar newer issues don't get feedback, too.

Ra72xx avatar Dec 07 '23 17:12 Ra72xx

JFYI: I was able to downgrade zigbee2mqtt from 1.34.0-1 to 1.33.0-1 by

  1. Stop the add-on, disable autostart, auto-update & watchdog
  2. In terminal, delete the current zigbee2mqtt container
  3. Remove the current 1.34.* image
  4. Pull the 1.33.0-1 image
  5. Retag the 1.33 image to 1.34
  6. Start the addon via home assistant

Z2M reports 1.33 in its about tab.

With my Nabu Casa Sky Connect, it even reconnected all devices and I didn't need to re-pair anything.

Of course: If you attempt this super dirty hack, please make a backup and try at your own risk. All complaints about a failed downgrade may be sent to dev/null :)

lennartgrunau avatar Dec 07 '23 20:12 lennartgrunau