zigbee2mqtt icon indicating copy to clipboard operation
zigbee2mqtt copied to clipboard

All devices gone after power outage

Open Kodemikkel opened this issue 3 years ago • 54 comments

What happened?

After waking up from a night's sleep, I noticed in HA that some devices were unavailable. After some more looking around, I found all my Z2M devices were unavailable.

Heading over to my Z2M docker container running on an Unraid host, I saw none of my devices were listed anywhere, and "Devices", "Dashboard" and "Map" were all blank, although my settings were still the same. I do suspect that there was a power outage while I was sleeping, as all the lights had turned on and after booting my PC, there were some indications that it had lost power. After power cycling a device I had nearby and setting Z2M to allow joining, the device appeared as previously configured without needing to change anything manually.

Now I have power cycled/reset all my devices, and they are all fully functional as they were before. What is weird is that I have, on several occasions, unintentionally, removed power from my running Unraid host without any issues at all. This leads me to my questions:

  • Does anyone know why this happened, especially when the host has lost power on several occasions earlier without any issues?
  • Has anyone else experienced this issue before?
  • And if so, how did you fix it?

I am lucky to only have about 30 devices on my network, and resetting them doesn't take that long, although it is still boring and tedious. I could only imagine how it would be for someone with 100s of devices on their network.

What did you expect to happen?

Not losing all my devices after a power outage.

How to reproduce it (minimal and precise)

No idea. As mentioned my host has lost power several times earlier without issues.

Zigbee2MQTT version

1.23.0-dev commit: afe94a7

Adapter firmware version

0x26720700

Adapter

ConBee2

Debug log

07MAR22 11:20:18.txt 08MAR22 08:30:21.txt

The first log is dated 07MAR22 11:20:18 and I assume the last line is right before the power is lost. (I can't find any useful information in this) The second log is dated 08MAR22 08:30:21 and I assume it would be from when the power came back. The shutdown at the end of this log is me trying to restart it.

Kodemikkel avatar Mar 08 '22 11:03 Kodemikkel

If having one system in production with light then user real Zigbee light switches and binding them to the Zigbee Light groups so they is always working if the host system or internal internet is having problem. Implanting light "HA way" you can always getting problem and all is not working and with Light groups its only one device that is falling and not 100% of the system.

The reason you need repower the device and having joining enable is that the coordinator is have its frame counter for the network key is out off sync or the system have restored one old backup of its after coming back after the power problem => all devices in the network is blocking then they thing its one replay attack (normal Zigbee security).

MattWestb avatar Mar 08 '22 19:03 MattWestb

But how come if I manually cut the power to the host, everything works fine? I've never had that issue before when my host loses and restores power.

And also, HA does not really have anything to do with this, as the Z2M docker is running completely separately and it was the Z2M docker that had the issue.

Edit: Added some more information in the reply.

Kodemikkel avatar Mar 08 '22 20:03 Kodemikkel

I had something similar happen to my docker-based instance 3 days ago after rebooting my server. Zigbee2MQTT version 1.24.0-dev (commit #c49f546) zigbee-herdsman (0.14.20)

twsl avatar Mar 11 '22 10:03 twsl

Had the same issue today :(

Really a pitty that the system is not self-healing.

Zigbee2MQTT version
1.24.0-dev commit: [f7c6207](https://github.com/Koenkk/zigbee2mqtt/commit/f7c6207)
Coordinator type
ConBee2/RaspBee2
Coordinator revision
0x26720700
Coordinator IEEE Address
0x00212effff0656b9
Frontend version
0.6.77

eloo avatar Mar 14 '22 12:03 eloo

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] avatar Apr 14 '22 00:04 github-actions[bot]

Same issue today. Power loss on both Z2M and lights / devices and a lot of devices gone when Z2M reboots. I have manually cut the power before without consecuences.

adelaiglesia avatar Apr 14 '22 01:04 adelaiglesia

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] avatar May 15 '22 00:05 github-actions[bot]

is this really stale? afaik there was no fix yet?

eloo avatar May 16 '22 15:05 eloo

just had the same issue. Almost all devices gone after power outage, only 2 were present in z2m, both Hue motion sensors, the rest of 20 devices gone. I managed to get most of them back by resetting them but I have 4 Philips Hue outdoor lights that don't want to rejoin by themselves and I can't reset them. I'll need to remove them from the wall, get the serial number, add them to Philips Hue app and I think I'll keep them there, don't want to get the ladder out if a power outage happens again.

Using zzh stick. Also tried updating to latest coordinator firmware and keeping "Allow join" on all the time with the hope that some devices will rejoin by themselves but those 4 lights never did. Some IKEA buttons did rejoin when I pressed them but not all.

mihaiblaga89 avatar Jun 03 '22 22:06 mihaiblaga89

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] avatar Jul 04 '22 00:07 github-actions[bot]

AFAIK not stale

eloo avatar Jul 04 '22 08:07 eloo

What do you mean with "gone"?

  • Are the devices missing from the z2m frontend OR
  • Are non of the devices controllable?

Koenkk avatar Jul 04 '22 15:07 Koenkk

What do you mean with "gone"?

  • Are the devices missing from the z2m frontend OR
  • Are non of the devices controllable?

Both

Kodemikkel avatar Jul 04 '22 15:07 Kodemikkel

Yep for me it was also the same.. nearly all devices were not visible in Z2M frontend nor controllable in Homeassistent. I had also need to repair them.

Maybe there is some unwritten state in Z2M which creates an inconsistency when the Z2M system is killed hardly?

eloo avatar Jul 05 '22 07:07 eloo

Did something special happen around during crash? Z2M won't just empty its data/database.db file by itself

Koenkk avatar Jul 05 '22 14:07 Koenkk

hmm i can not remember any special expect the power outage..

just checked what the database.db is.. and afaik this is just a json file? maybe this file just gets corrupted from time to time?

as no proper database is used the "database.db" is lacking corruption prevention. Maybe it would make sense to use something more robust here like a sqlite database?

further the last_seen is also stored in this file. so this file is going to have a lot write operations which could lead to a corruption while a power outage

eloo avatar Jul 05 '22 18:07 eloo

What do you mean with "gone"?

  • Are the devices missing from the z2m frontend OR
  • Are non of the devices controllable?

Hi, sorry for the lack of clarity in my last response. The devices were present in z2m frontend but unreachable (all of them, 91 devices). None of the devices or groups were controllable. Coordinator were reflashed with same version but with no effect. Repairing all devices was necessary.

Just for your info, 80/91 devices are power line operated (not battery devices). Just in case that helps. I'm going to search if i have Database.db to share it in this conversation.

The workaround i have deployed is to connect Zigbee2mqtt machine to an UPS 🤣. At this time i think that z2m was writting just in the right moment and got corrupted. Only if i find the file we will know.

Thank you for your time

adelaiglesia avatar Jul 05 '22 18:07 adelaiglesia

@eloo

as no proper database is used the "database.db" is lacking corruption prevention.

there is some corruption prevention, the db is first written to a temp path and then renamed (https://github.com/Koenkk/zigbee-herdsman/blob/f1c6a3887e9d7a763e9ec981543881716c75c5ff/src/controller/database.ts#L75). I agree that sqlite may be a better option but its also more complicated (and we are not sure yet this causes the issue).

further the last_seen is also stored in this file. so this file is going to have a lot write operations which could lead to a corruption while a power outage

not every last_seen state will rewrite the db, this is done occasionally

@adelaiglesia what did you see in the log when sending messages to the devices?

Koenkk avatar Jul 06 '22 15:07 Koenkk

I don't know if it's the same but I describe my situation: stop ( maybe ungraceful ) of the container then all sensor are still paired but they receive no signal. I have only aquara devices: windows, temperature and water. Only the windows sensors when toggled works again; for the temperature/water sensors to work again I need to press the button ( not re-pair )

hitokiri8x avatar Jul 26 '22 20:07 hitokiri8x

I just had the same problem after a short power outage Z2M no longer showed any devices in the web interface, however the log looked normal. The new database.db only contained 3 lines, 1 for the coordinator and 2 empty groups:

{"id":1,"type":"Coordinator","ieeeAddr":"0x...." ... }
{"id":2,"type":"Group","groupID":1,"members":[],"meta":{}}
{"id":3,"type":"Group","groupID":2,"members":[],"meta":{}}

All devices were gone. Luckily I was able to restore a backup from Home Assistant which contained the database.db. After restoring and restarting the addon all worked again no need for a lengthy repairing of all devices.

tripplet avatar Aug 17 '22 12:08 tripplet

@tripplet okay.. that makes is more clear that the problem seems to be related to the database.db as restoring will fix it.

@Koenkk maybe as a quickfix the database.db can be duplicated every time? so maybe the old version will be just renamed with .bak or something like this? so we can easily restore every time

eloo avatar Aug 17 '22 13:08 eloo

I will check if I can come up with an easy recovery solution. Something like:

on save of db:

  • copy old db to something like database.db.bak as you suggested
  • write db to temp file with a closing mark at the end
  • copy db from temp file to database.db (if this only completes partially we get this issue)
  • if z2m starts next time it will check if it can find the closing mark such that it knows the db is complete, if not it will use the database.db.bak file if present

Koenkk avatar Aug 17 '22 16:08 Koenkk

@Koenkk sounds like a good solution. i also like the idea of the self healing check šŸ‘

eloo avatar Aug 17 '22 19:08 eloo

Checking the code again, I do not think corruption is the issue here.

  • Z2M writes the new database to database.db.tmp (this is kinda the backup already)
  • Z2M then does a rename call (database.db.tmp -> database.db)

Code: https://github.com/Koenkk/zigbee-herdsman/blob/6d43a67eefc5dd97641014c11a5295c3b6e518be/src/controller/database.ts#L82

If I read https://stackoverflow.com/questions/7054844/is-rename-atomic:

But rename() is still atomic in a very important sense: if you use it to overwrite a file, then you will end up with either the old or the new version and nothing else.

So either the old db is there or the new one, at least not a partial one like in https://github.com/Koenkk/zigbee2mqtt/issues/11759?notification_referrer_id=NT_kwDOACwkNbIzMjMzOTcyNjQ0OjI4OTI4NTM&notifications_after=Y3Vyc29yOjI1#issuecomment-1217976444 .

I think the bug here is that z2m clears the database for some reason. But to debug this I need the herdsman debug logging of the run where the db is cleared

See https://www.zigbee2mqtt.io/guide/usage/debug.html on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

Koenkk avatar Aug 18 '22 16:08 Koenkk

Interesting seems to be a very rare problem as I have been running Z2M for years now without problems on two system and this was the first time it appeared. Admittedly there were only a handful times where there was a power outage or I had to hard reboot the system.

Could it be that the previous write in https://github.com/Koenkk/zigbee-herdsman/blob/6d43a67eefc5dd97641014c11a5295c3b6e518be/src/controller/database.ts#L82 was not synced to disk (still in some write cache) but the rename made it to the disk journal before the power outage as it is atomic. See https://stackoverflow.com/a/22823144 Not quite sure how this is handled in filesystems :shrug: (For reference I’m using XFS).

Maybe a https://nodejs.org/api/fs.html#fsfsyncfd-callback is necessary before the rename is executed? Not sure if this would slow down some execution?

So either the old db is there or the new one, at least not a partial one like in ...

The new database.db looked syntactically Ok It was like Z2M decided for some reason start from scratch, but no idea why, could be because of a corrupted file but might have other causes.

With a large enough user base even the rares errors start to appear at some point :smile: Thanks for looking into the problem

tripplet avatar Aug 18 '22 19:08 tripplet

@Koenkk okay.. but maybe having a persistent backup would also be a good a idea so we could restore our instances if the bug happens agains. at least renaming a file is much easier than repair all devices :D

Note that this is only logged to STDOUT and not to log files.

i guess is going to be problem.. as we see the issue when an power outage happens. then the stdout is also lost

eloo avatar Aug 19 '22 06:08 eloo

@eloo in that case there should be the database.db.tmp file

i guess is going to be problem.. as we see the issue when an power outage happens. then the stdout is also lost

With this I mean the logging of the startup after the power outage (not before)

Koenkk avatar Aug 19 '22 18:08 Koenkk

Same issue just happened to me. running unraid with z2m docker. I had to hard reboot my server and lost all my devices. DB is empty and z2m is showing 0 devices connected.

markmghali avatar Aug 24 '22 19:08 markmghali

I just ran into the same issue. I'm also running the zigbee2mqtt docker container on UnRaid as the person above me and a couple other people (don't want to unnecessary tag anyone) have also mentioned.

I actually only recently seen this behavior (a couple times now), since I started to use the CC2652RB instead of the CC2531 which I switched to about a month ago. I'm not sure if that is related at all.

The last line in my log before the outage is the following (so nothing out of the ordinary):

debug 2022-09-10 04:18:55: Saving state to file /app/data/state.json

After the outage I noticed that zigbee2mqtt wasn't running at all, because I had renamed a device in the UI yesterday, but forgot to adjust its name in the availability_passlist, so it didn't run for about 5 hours. Now obviously this isn't related to the issue at all, this was just my mistake, but maybe the length of the outage could be related (I honestly have no idea, just putting this out there).

After I changed its name in the availability_passlist in my configuration.yaml zigbee2mqtt booted up as expected, but no devices were to be found. There is something interesting in the log though:

Using '/app/data' as data directory
Zigbee2MQTT:debug 2022-09-10 09:35:53: Loaded state from file /app/data/state.json
Zigbee2MQTT:info  2022-09-10 09:35:53: Logging to console and directory: '/app/data/log/2022-09-10.09-35-53' filename: log.txt
Zigbee2MQTT:debug 2022-09-10 09:35:53: Removing old log directory '/app/data/log/2022-09-03.15-12-32'
Zigbee2MQTT:info  2022-09-10 09:35:54: Starting Zigbee2MQTT version 1.27.2 (commit #7dc48fb)
Zigbee2MQTT:info  2022-09-10 09:35:54: Starting zigbee-herdsman (0.14.53)
Zigbee2MQTT:debug 2022-09-10 09:35:54: Using zigbee-herdsman with settings: '{"adapter":{"concurrent":null,"delay":null,"disableLED":false},"backupPath":"/app/data/coordinator_backup.json","databaseBackupPath":"/app/data/database.db.backup","databasePath":"/app/data/database.db","network":{"channelList":[11],"extendedPanID":FOO,"networkKey":"HIDDEN","panID":BAR},"serialPort":{"path":"/dev/serial/by-id/usb-Silicon_Labs_slae.sh_cc2652rb_stick_-_slaesh_s_iot_stuff_00_12_4B_00_25_9B_66_57-if00-port0"}}'
Zigbee2MQTT:info  2022-09-10 09:36:04: zigbee-herdsman started (resumed)
Zigbee2MQTT:info  2022-09-10 09:36:04: Coordinator firmware version: '{"meta":{"maintrel":1,"majorrel":2,"minorrel":7,"product":1,"revision":20220219,"transportrev":2},"type":"zStack3x0"}'
Zigbee2MQTT:debug 2022-09-10 09:36:04: Zigbee network parameters: {"channel":11,"extendedPanID":"FOO","panID":BAR}
Zigbee2MQTT:info  2022-09-10 09:36:04: Currently 0 devices are joined:
Zigbee2MQTT:warn  2022-09-10 09:36:04: `permit_join` set to  `true` in configuration.yaml.
Zigbee2MQTT:warn  2022-09-10 09:36:04: Allowing new devices to join.
Zigbee2MQTT:warn  2022-09-10 09:36:04: Set `permit_join` to `false` once you joined all devices.
Zigbee2MQTT:info  2022-09-10 09:36:04: Zigbee: allowing new devices to join.
Zigbee2MQTT:info  2022-09-10 09:36:05: Connecting to MQTT server at mqtt://192.168.1.10:1883
Zigbee2MQTT:debug 2022-09-10 09:36:05: Using MQTT anonymous login
Zigbee2MQTT:info  2022-09-10 09:36:05: Connected to MQTT server
Zigbee2MQTT:info  2022-09-10 09:36:05: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"online"}'
Zigbee2MQTT:info  2022-09-10 09:36:05: Started frontend on port 0.0.0.0:9442
Zigbee2MQTT:info  2022-09-10 09:36:05: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"online"}'
Zigbee2MQTT:debug 2022-09-10 09:40:53: Saving state to file /app/data/state.json

It mentions this file: /app/data/database.db.backup as the databaseBackupPath which isn't there at all. The coordinator_backup.json also mentioned in the log does exist however.

After the startup my database.db looked like the following:

{
  "id": 1,
  "type": "Coordinator",
  "ieeeAddr": "0x00124b00259b6657",
  "nwkAddr": 0,
  "manufId": 0,
  "epList": [
    2,
    1
  ],
  "endpoints": {
    "1": {
      "profId": 260,
      "epId": 1,
      "devId": 5,
      "inClusterList": [],
      "outClusterList": [],
      "clusters": {},
      "binds": [],
      "configuredReportings": [],
      "meta": {}
    },
    "2": {
      "profId": 257,
      "epId": 2,
      "devId": 5,
      "inClusterList": [],
      "outClusterList": [],
      "clusters": {},
      "binds": [],
      "configuredReportings": [],
      "meta": {}
    }
  },
  "interviewCompleted": true,
  "meta": {},
  "lastSeen": null,
  "defaultSendRequestWhen": "immediate"
}

Note: the epList and endpoints contain more, but I removed them to not make it too long.

Also... there is no database.db.tmp at all. It would've been helpful if there was an actual backup file (e.g. database.db.backup as mentioned in the log or the database.db.tmp file as mentioned in this thread), which I could simply rename as some people have mentioned previously.

I now have to pair all of my devices again which is the third time in about a month (since I've gotten the CC2652RB). I might even wait for a possible fix to do that as it is quite annoying to get to some of the devices.

kevintuhumury avatar Sep 10 '22 08:09 kevintuhumury

Actually, this just happened to me after updating from 1.26 to 1.27.2, using Docker containers and no power outage or reboots/hangs/anything else.

In fact, server is connected to UPS so this isn't even a possibility.

Somehow, it just happened with two devices at this home (Tuya WHD02/TS0001)

But it happened to another home where I have more Zigbee devices when updating from 1.25 to 1.26, one dimmer was working just perfect and right after the update, it was just gone, as said, with no power cuts or anything else.

I was thinking about some kind of syncing/changes made on the DB at the moment of updating the Docker container. :shrug:

joaquinvacas avatar Sep 16 '22 21:09 joaquinvacas