zigbee2mqtt icon indicating copy to clipboard operation
zigbee2mqtt copied to clipboard

Zigbee2MQTT service stops working - FATAL ERROR

Open pauloon opened this issue 2 years ago • 99 comments

What happened?

After last update I've noticed Z2M is stopping service with a fatal error, out of the blue.

I'm using HASS.IO all updated, in a i5 machine with 8 Gb do memory and 128 GB SSD.

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

Just leave it running. After one or two days it stops working (service drops).

Zigbee2MQTT version

1.28.2 commit: unknown

Adapter firmware version

20220219

Adapter

SONOFF USB Dongle

Debug log

I did not have the debug log active when this happened. Posting normal log. ... Zigbee2MQTT:info 2022-11-07 11:45:15: MQTT publish: topic 'zigbee2mqtt/Smart Plug 15', payload '{"child_lock":"UNLOCK","current":0.04,"energy":6.45,"indicator_mode":"off/on","last_seen":"2022-11-07T11:45:13-03:00","linkquality":102,"power":0,"power_outage_memory":"restore","state":"ON","update":{"state":"idle"},"update_available":false}' <--- Last few GCs ---> [7:0x7fa21c7993c0] 61013533 ms: Mark-sweep 2044.2 (2085.3) -> 2042.2 (2085.3) MB, 2087.8 / 0.0 ms (average mu = 0.133, current mu = 0.010) allocation failure scavenge might not succeed [7:0x7fa21c7993c0] 61015639 ms: Mark-sweep 2044.3 (2085.3) -> 2042.2 (2085.3) MB, 2082.8 / 0.0 ms (average mu = 0.074, current mu = 0.011) allocation failure scavenge might not succeed <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

pauloon avatar Nov 07 '22 15:11 pauloon

Hi,

Just had the same problem again. Attached the log file: log.txt

Please, help. Thanks, Paulo.

pauloon avatar Nov 08 '22 12:11 pauloon

I reinstalled Z2M all from zero and today it happened again:

2022-11-09T17:11:10.996Z zigbee-herdsman:adapter:zStack:znp:AREQ <-- AF - incomingMsg - {"groupid":0,"clusterid":61184,"srcaddr":55137,"srcendpoint":1,"dstendpoint":1,"wasbroadcast":0,"linkquality":61,"securityuse":0,"timestamp":11657102,"transseqnumber":0,"len":10,"data":{"type":"Buffer","data":[9,72,2,0,193,1,4,0,1,0]}} 2022-11-09T17:11:10.996Z zigbee-herdsman:controller:log Received 'zcl' data '{"frame":{"Header":{"frameControl":{"frameType":1,"manufacturerSpecific":false,"direction":1,"disableDefaultResponse":false,"reservedBits":0},"transactionSequenceNumber":72,"manufacturerCode":null,"commandIdentifier":2},"Payload":{"seq":49408,"dpValues":[{"dp":1,"datatype":4,"data":{"type":"Buffer","data":[0]}}]},"Command":{"ID":2,"parameters":[{"name":"seq","type":33},{"name":"dpValues","type":1011}],"name":"dataReport"}},"address":55137,"endpoint":1,"linkquality":61,"groupID":0,"wasBroadcast":false,"destinationEndpoint":1}' Zigbee2MQTT:debug 2022-11-09 14:11:10: Received Zigbee message from 'Presença 2', type 'commandDataReport', cluster 'manuSpecificTuya', data '{"dpValues":[{"data":{"data":[0],"type":"Buffer"},"datatype":4,"dp":1}],"seq":49408}' from endpoint 1 with groupID 0 <--- Last few GCs ---> [7:0x7f45b57b73c0] 84373002 ms: Scavenge (reduce) 2042.8 (2083.1) -> 2041.9 (2083.1) MB, 2.5 / 0.0 ms (average mu = 0.105, current mu = 0.019) allocation failure [7:0x7f45b57b73c0] 84373008 ms: Scavenge (reduce) 2043.0 (2083.1) -> 2042.1 (2083.1) MB, 2.7 / 0.0 ms (average mu = 0.105, current mu = 0.019) allocation failure [7:0x7f45b57b73c0] 84373015 ms: Scavenge (reduce) 2043.0 (2083.1) -> 2042.1 (2083.3) MB, 2.6 / 0.0 ms (average mu = 0.105, current mu = 0.019) allocation failure <--- JS stacktrace ---> FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

@Koenkk ... can you, please, help?

Thank you. Paulo.

pauloon avatar Nov 09 '22 17:11 pauloon

What node version are you using?

Koenkk avatar Nov 09 '22 18:11 Koenkk

What node version are you using?

Thanks for helping me. Not sure what you mean by "node". Sorry for my ignorance. Are you talking about NodeRed ? I'm using version 13.5.2.

Follows attached a diagnostic file with some info about my system. config_entry-mqtt-5f9bbf81bfa29cb6923246ba774884c4.json.txt

Please, let me know.

Thanks, Paulo.

pauloon avatar Nov 09 '22 18:11 pauloon

I mean node.js (try running node --version on your system)

Koenkk avatar Nov 10 '22 16:11 Koenkk

I mean node.js (try running node --version on your system)

Hi @Koenkk , I tried and it does not work: image

Maybe because I'm using Hass.io and NodeRed?

Thanks, Paulo.

pauloon avatar Nov 11 '22 13:11 pauloon

how are you running z2m, via the addon?

Koenkk avatar Nov 11 '22 16:11 Koenkk

how are you running z2m, via the addon?

Dear @Koenkk , Yes, I've installed it via Add-On. image

See if this helps... image

Thanks, Paulo.

pauloon avatar Nov 11 '22 16:11 pauloon

If I turn off the watchdog it stops after a few hours. Just happened again... We are in a group of more than 2.000 HA users and more people are getting this too. Looks like Hass.io users...

pauloon avatar Nov 11 '22 17:11 pauloon

@Koenkk .... Look! With help from some friends, i think I got the node.js version: image

Will wait for your instructions.

Thanks, Paulo.

pauloon avatar Nov 11 '22 18:11 pauloon

Can you provide the herdsman debug logging, from starting z2m till the error?

Koenkk avatar Nov 12 '22 08:11 Koenkk

Hi @Koenkk ,

Of course. Is that the one I get from Settings > System > Logs > Z2M, copy and save? image

Is there a way to access former logs? I'm waiting for it to crash again to get the log.

Thanks, Paulo.

pauloon avatar Nov 12 '22 16:11 pauloon

I have not seen this screen before, how big is the log? (does it include everything from start)

Koenkk avatar Nov 13 '22 08:11 Koenkk

I have not seen this screen before, how big is the log? (does it include everything from start)

Dear @Koenkk , This is the screen I mentioned before. I'm not sure if it includes all of it. I just grabbed all of its contents so you can see: log.txt Please let me know if there are any other ways to get the logs. I could get the logs from the former crashes.

I'm still waiting so it crashes again but since I'm not home these days maybe it "cooled down" a little.

Thanks, Paulo.

pauloon avatar Nov 13 '22 13:11 pauloon

These are not the herdsman debug logging. (they contain zigbee-herdsman like in https://github.com/Koenkk/zigbee2mqtt/issues/14853#issuecomment-1312524794).

See https://www.zigbee2mqtt.io/guide/usage/debug.html#home-assistant-os-supervised-addon how to get the logging.

Koenkk avatar Nov 13 '22 14:11 Koenkk

These are not the herdsman debug logging. (they contain zigbee-herdsman like in #14853 (comment)).

See https://www.zigbee2mqtt.io/guide/usage/debug.html#home-assistant-os-supervised-addon how to get the logging.

Dear @Koenkk ,

I GOT ANOTHER CRASH! I copied the entire log: log2.txt

Please, let me know.

Thanks, Paulo.

pauloon avatar Nov 13 '22 18:11 pauloon

can you turn off Espelho 2 4gang and see if the error still occurs?

Koenkk avatar Nov 14 '22 19:11 Koenkk

can you turn off Espelho 2 4gang and see if the error still occurs?

I will remove it and restart Z2M, ok?

Thanks, Paulo.

pauloon avatar Nov 14 '22 20:11 pauloon

Ok 👍

Koenkk avatar Nov 15 '22 18:11 Koenkk

I seem to have the same issue here (I opened an issue at https://github.com/Koenkk/zigbee2mqtt/issues/14992 before finding this one). I'm also using node version v16.17.1, so perhaps it has something to do with that indeed? I'm still trying to find the logs.

By the way, for me it happens much more often, sometimes 4 to 5 times on one evening.

XanderTenBoden avatar Nov 15 '22 20:11 XanderTenBoden

Ok 👍

I got another crash today, so I guess it is not "Espelho 2 4gang" fault. Now I reseted it again and I'm testing one more.

pauloon avatar Nov 15 '22 21:11 pauloon

Mine crashed as well. The Z2M logs gave me the following error:

Error: SRSP - AF - dataRequestExt after 6000ms
    at Timeout._onTimeout (/app/node_modules/zigbee-herdsman/src/utils/waitress.ts:64:35)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)

These are not the herdsman debug logging. (they contain zigbee-herdsman like in #14853 (comment)).

See https://www.zigbee2mqtt.io/guide/usage/debug.html#home-assistant-os-supervised-addon how to get the logging.

I am not able to retrieve the herdsman logs using this method, because the docker container just seems to crash in my case. So when I run docker ps, there isn't a z2m one running after the crash.

XanderTenBoden avatar Nov 16 '22 08:11 XanderTenBoden

Dear @Koenkk ,

After I removed the Espelho 2 4gang, it is still happening. Follows the last log extracted: log4.txt

Please, let me know.

Thanks, Paulo.

pauloon avatar Nov 16 '22 21:11 pauloon

@pauloon does the docker container also completely stop in your case?

XanderTenBoden avatar Nov 16 '22 21:11 XanderTenBoden

@pauloon does the docker container also completely stop in your case?

Hi @XanderTenBoden I think the docker is crashing too. One time that I ran docker ps and didn't see it. I'm going to let it happen again, and check to make sure.

Thanks, Paulo.

pauloon avatar Nov 16 '22 23:11 pauloon

@pauloon What I did is turning off the watchdog and then wait for it to stop working. As soon as that did, all my zigbee devices in lovelace turned to "unavailable", and the zigbee2mqtt add-on wasn't running anymore.

It seems like the issue has something to do with the z2m docker eating too much RAM. I've enabled the memory sensors and this is what I'm seeing in the history:

https://tenboden.com/ram.png

I do have a lot of devices connected, about 90. How many do you have? It might explain why the crashes are more frequently for me if you have less than that connected? 🤔

XanderTenBoden avatar Nov 17 '22 08:11 XanderTenBoden

@pauloon What I did is turning off the watchdog and then wait for it to stop working. As soon as that did, all my zigbee devices in lovelace turned to "unavailable", and the zigbee2mqtt add-on wasn't running anymore.

It seems like the issue has something to do with the z2m docker eating too much RAM. I've enabled the memory sensors and this is what I'm seeing in the history:

https://tenboden.com/ram.png

I do have a lot of devices connected, about 90. How many do you have? It might explain why the crashes are more frequently for me if you have less than that connected? 🤔

@XanderTenBoden , I have 51 devices. I'm also keeping watchdog disabled, otherwise it restarts Z2M. I'll try to check the memory history, same as you did.

Thanks, Paulo.

pauloon avatar Nov 17 '22 11:11 pauloon

@pauloon I also see Presença 4 and Presença 2 are spamming a lot, can you also try to remove those?

Koenkk avatar Nov 17 '22 16:11 Koenkk

@pauloon I also see Presença 4 and Presença 2 are spamming a lot, can you also try to remove those?

Hi @Koenkk ,

I will try turn them off, to stop the traffic. These are mmWave presence sensors, which I have four. I know they push a lot of data from one expose called "target_distance" that pushes every second, but ONLY when there is someone being detected. As far as I understood it is not possible to turn off this reporting. I added this expose in the "filters" so it reduces MQTT data (I don't use it). I'm guessing these are not guilty, because the crashes that happened a few days ago I was traveling, and they were reporting much less data.

I'll keep you posted.

Thanks, Paulo.

pauloon avatar Nov 17 '22 19:11 pauloon

@Koenkk ,

Just got another crash. Very fast, this time.

Follows the log: log5.txt

I just had restarted HA and Z2M after turning off the sensors.

@XanderTenBoden , Indeed, the contained crashed too. But I did not get the same memory spique that you got. image

Please, let me know, Thanks, Paulo.

pauloon avatar Nov 17 '22 21:11 pauloon