zigbee2mqtt icon indicating copy to clipboard operation
zigbee2mqtt copied to clipboard

SkyConnect with new "ember" driver: Network extremely slow. OOM error in logs

Open Ra72xx opened this issue 10 months ago β€’ 8 comments

What happened?

I try to switch my SkyConnect from standard 'ezsp' to 'ember' driver. Therefore I updated to the current 7.4.1.0 firmware as recommended in the respective thread. Everything works as before as long as I use the 'ezsp' driver, but as soon as I switch to 'ember', the network is extremely slow. E.g. switching a light on takes about 1 to 2 minutes until it reacts. The startup log shows an "OUT OF MEMORY" error.

Apr 18 16:56:41 Obelix zigbee2mqtt[1460139]: ======== Ember Adapter Starting ========
Apr 18 16:56:41 Obelix zigbee2mqtt[1460139]: ======== EZSP starting ========
Apr 18 16:56:41 Obelix zigbee2mqtt[1460139]: ======== ASH NCP reset ========
Apr 18 16:56:41 Obelix zigbee2mqtt[1460139]: ======== ASH starting ========
Apr 18 16:56:42 Obelix zigbee2mqtt[1460139]: ======== ASH connected ========
Apr 18 16:56:42 Obelix zigbee2mqtt[1460139]: ======== ASH started ========
Apr 18 16:56:42 Obelix zigbee2mqtt[1460139]: ======== EZSP started ========
Apr 18 16:56:42 Obelix zigbee2mqtt[1460139]: [EzspConfigId] Failed to SET "APS_UNICAST_MESSAGE_COUNT" TO "32" with status=ERROR_OUT_OF_MEMORY.
Apr 18 16:56:42 Obelix zigbee2mqtt[1460139]: [STACK STATUS] Network up.
Apr 18 16:56:42 Obelix zigbee2mqtt[1460139]: [INIT TC] NCP network matches config.
Apr 18 16:56:42 Obelix zigbee2mqtt[1460139]: [CONCENTRATOR] Started source route discovery. 1248ms until next broadcast.
Apr 18 16:56:43 Obelix zigbee2mqtt[1460139]: Zigbee2MQTT:error 2024-04-18 16:56:43: Entity 'homeassistant/binary_sensor' is unknown

The "homeassistant/binary_sensor" error also happens with the 'ezsp' driver (don't know why).

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

No response

Zigbee2MQTT version

1.36.1 commit: ffc2ff1

Adapter firmware version

7.4.1.0 build 0

Adapter

Skyconnect EZSP v13

Setup

Z2M in Podman container

Debug log

No response

Ra72xx avatar Apr 18 '24 15:04 Ra72xx

The config that fails to set is not a problem, it just uses the in-firmware value instead.

Did you let it run for a few hours before retrying after the switch? It could just be the network that needs to rebuild itself for some reason. Do you have any offline devices? You will have to provide debug logs of the actual issue otherwise.

Nerivec avatar Apr 20 '24 12:04 Nerivec

I tried it several times in the last two weeks and it was always the same: The network was extremely slow after the update, switching lights took minutes. Window sensors, however, reacted at once in Z2M. When switching back to ezsp after several hours, everything was fine at once. However, I don't have the time for logs right now, will look into it some other time.

Ra72xx avatar Apr 30 '24 10:04 Ra72xx

@Nerivec As the new 1.37 release finally offers the possibility to get herdsman logs easily, I captured a few minutes debug log during which I switched random lights on and off, both from HA and from the Z2M ui. Switching lights was once again extremely slow (minutes later). Strange enough, window and temperature sensors seem to work OK. Current firmware is 7.4.2.

log.zip

EDIT: Just for the fun of it, I checked how long it takes from switching the light on to reaction: 3min 30 secs! I did not even know that stuff can take so long without a timeout.

Ra72xx avatar May 02 '24 09:05 Ra72xx

I noticed that when starting with the ember driver, the Z2M container often fails with the following error at the first attempt, however works after the second (automatic) restart:

Mai 02 12:14:13 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:13] error:         zh:ember:uart:ash: Received ERROR from NCP while connecting, with code=RESET_WATCHDOG.
Mai 02 12:14:13 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:13] error:         zh:ember:uart:ash: ASH disconnected | NCP status: ASH_NCP_FATAL_ERROR
Mai 02 12:14:16 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:16] warning:         zh:ember:uart:ash: Frame(s) in progress cancelled in [1ac1020b0a527e]
Mai 02 12:14:16 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:16] error:         zh:ember:uart:ash: Received unexpected reset from NCP, with reason=RESET_SOFTWARE.
Mai 02 12:14:16 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:16] error:         zh:ember:uart:ash: ASH disconnected: ASH_ERROR_NCP_RESET | NCP status: ASH_NCP_FATAL_ERROR
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:29] error:         zh:ember:ezsp: ERROR Transaction failure; status=undefined. Last: [FRAME: ID=0:"VERSION" Seq=0 Len=4].
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:29] error:         zh:ember: !!! NCP FATAL ERROR reason=undefined. ATTEMPTING RESET... !!!
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:29] error:         z2m: Error while starting zigbee-herdsman
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:29] error:         z2m: Failed to start zigbee
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:29] error:         z2m: Check https://www.zigbee2mqtt.io/guide/installation/20_zigbee2mqtt-fails-to-start.html for possible solutions
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:29] error:         z2m: Exiting...
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]: [2024-05-02 12:14:29] error:         z2m: Error: 
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at Ezsp.ezspVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:1301:19)
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at EmberAdapter.emberVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:1642:66)
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at EmberAdapter.initEzsp (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:802:9)
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at EmberAdapter.start (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:2660:24)
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at Controller.start (/app/node_modules/zigbee-herdsman/src/controller/controller.ts:124:29)
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at Zigbee.start (/app/lib/zigbee.ts:62:27)
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at Controller.start (/app/lib/controller.ts:108:27)
Mai 02 12:14:29 Obelix zigbee2mqtt[521643]:     at start (/app/index.js:107:5)

Ra72xx avatar May 02 '24 10:05 Ra72xx

I think something is not going right during startup sequence. The send queue is way too high (117 in your first debug line, 211 by the end). Seems something is sending far too many commands, and the driver can't catch up fast enough.


Since you mentioned using 7.4.2, did you use one from here to avoid hardware flow control issues? https://github.com/darkxst/silabs-firmware-builder/tree/ember-nohw/firmware_builds/skyconnect (use with rtscts: false)


Not strictly related (or possibly entirely related πŸ˜„), but looks like you have these devices in a non-zigbee type group (group that simply sends X commands for X number of devices in it). I'd advise to use a Zigbee group, that will send only 1 command, should lighten the load quite a bit. (You can do the same for any other groups you have like that one.)

0xd85def11a10049b0
0xd85def11a1006a50
0xd85def11a100466a
0xd85def11a100469e
0xd85def11a10015b8
0xd85def11a100458f
0xd85def11a1003268
0xd85def11a1002561

Nerivec avatar May 02 '24 12:05 Nerivec

I think something is not going right during startup sequence. The send queue is way too high (117 in your first debug line, 211 by the end). Seems something is sending far too many commands, and the driver can't catch up fast enough.

I don't know what could go wrong here. I use Z2M in a Podman container on x86-64 hardware, no fancy stuff, system load <10% most of the time, USB2.0 port and long cable to put the stick away from interference, etc.

As soon as I replace ember with ezsp in the config and restart, everything is fine again - at once.

Since you mentioned using 7.4.2, did you use one from here to avoid hardware flow control issues? https://github.com/darkxst/silabs-firmware-builder/tree/ember-nohw/firmware_builds/skyconnect (use with rtscts: false)

Yes, I use the file ncp-uart-hw-v7.4.2.0-skyconnect-115200.gbl from this location. I assume that this is with "hw" flow control enabled, however disabling hardware rts/cts in Z2M didn't seem not to help in my case (tried both variants).

EDIT: Clicking on the link you gave does redirect to the "normal" firmware folder, the folder path seems to be different than the one given in the text?!

Not strictly related (or possibly entirely related πŸ˜„), but looks like you have these devices in a non-zigbee type group (group that simply sends X commands for X number of devices in it). I'd advise to use a Zigbee group,

How did you find out about that group? I never intentionally created a non-Zigbee-group, however it might be some vendor specific stuff. Can you pinpoint which device is the culprit or point out to me for what log entry to watch? Maybe this could explain some of the problems.

Thanks for your support and your work on a better SkyConnect driver!

Ra72xx avatar May 02 '24 12:05 Ra72xx

I don't have that redirect. The gbl file you linked is the right one though. Make sure to set rtscts: false or just remove the line. (PS: 'hw' in the name is just the automatic naming from the firmware build, these don't have it to avoid issues described here.)

Search for this line in your logs: zh:controller:endpoint: ZCL command 0xd85def11a10049b0 You will see they are often followed with the same type line with the IEEE addresses I mentioned before. This is firing 8 commands instantly every time (seems to be every few seconds too!). It could just be an automation that triggers for each of these at the same time though, not necessarily a group (it just is often the case with HA users). In that case, you will want to create a Zigbee group, and trigger that group in your automation instead; should result in the same thing, except you will trigger 1 command instead of 8 each time.

Device 61960 (0xf208) is also triggering a lot of delivery failures. You might want to check that one more closely.

Nerivec avatar May 02 '24 13:05 Nerivec

I just did one more try with the new driver and rtscts disabled:

serial:
  port: /dev/ttyACM0
  adapter: ember
  rtscts: false

Doesn't seem to help, first launch of Z2M failed nevertheless, second one was OK. Slow reaction again...

Mai 02 15:03:24 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:24] error:         zh:ember:uart:ash: Received ERROR from NCP while connectin>
Mai 02 15:03:24 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:24] error:         zh:ember:uart:ash: ASH disconnected | NCP status: ASH_NCP_>
Mai 02 15:03:26 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:26] warning:         zh:ember:uart:ash: Frame(s) in progress cancelled in [1a>
Mai 02 15:03:26 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:26] error:         zh:ember:uart:ash: Received unexpected reset from NCP, wit>
Mai 02 15:03:26 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:26] error:         zh:ember:uart:ash: ASH disconnected: ASH_ERROR_NCP_RESET |>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:39] error:         zh:ember:ezsp: ERROR Transaction failure; status=undefined>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:39] error:         zh:ember: !!! NCP FATAL ERROR reason=undefined. ATTEMPTING>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:39] error:         z2m: Error while starting zigbee-herdsman
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:39] error:         z2m: Failed to start zigbee
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:39] error:         z2m: Check https://www.zigbee2mqtt.io/guide/installation/2>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:39] error:         z2m: Exiting...
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]: [2024-05-02 15:03:39] error:         z2m: Error: 
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at Ezsp.ezspVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:1301:>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at EmberAdapter.emberVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/e>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at EmberAdapter.initEzsp (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/ember>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at EmberAdapter.start (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAda>
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at Controller.start (/app/node_modules/zigbee-herdsman/src/controller/controller.ts:124:29)
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at Zigbee.start (/app/lib/zigbee.ts:62:27)
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at Controller.start (/app/lib/controller.ts:108:27)
Mai 02 15:03:39 Obelix zigbee2mqtt[591802]:     at start (/app/index.js:107:5)

Ra72xx avatar May 02 '24 13:05 Ra72xx

having the same issue. Most of my devices are working as expected (battery or not) but one remote (icasa) only was sending linkquality and lastseen.

I manually bound the endpoints, now i get a lot more (including action) BUT i cant get action_group, so basically i have the action but i dont know which button i pressed πŸ˜“

image image

nicandris avatar Jun 10 '24 08:06 nicandris

Same here. As @Ra72xx writes "As soon as I replace ember with ezsp in the config and restart, everything is fine again" When I do that everything works.

MKlitgaard2 avatar Jun 13 '24 08:06 MKlitgaard2

I routinely try to switch to the new driver when new Z2M versions appear, and it is always the same. It also seems not to get better when giving the network a few hours to "calm down" and/or unplugging the USB dongle after the change for some time. I understand that the old driver is deprecated, because it has lots of problems of its own, but I'm a bit concerned as I'm nevertheless still forced to use it as it at least works most of the time.

Ra72xx avatar Jun 14 '24 03:06 Ra72xx

Seems like other comments may not be related to this issue at all. @Ra72xx I'm not sure what's going on in your scenario, I don't have anyone else reporting these delays. Did you figure out what triggers those massive amounts of messages sent on start? I'm pretty sure fixing that, will fix the rest.

Nerivec avatar Jun 14 '24 11:06 Nerivec

Seems like other comments may not be related to this issue at all. @Ra72xx I'm not sure what's going on in your scenario, I don't have anyone else reporting these delays. Did you figure out what triggers those massive amounts of messages sent on start? I'm pretty sure fixing that, will fix the rest.

As you guessed, probably an Homeassistant automation which pushed - for every room and every thermostat on every sensor change - temperature values to the respective thermostats. I've modified the automation sincde (and it's no longer active currently). Moreover, I have nine of those devices: https://www.zigbee2mqtt.io/devices/6735_6736_6737.html . These seem to do some kind of polling in order to get the state of the relais (this directly operates the device and seemingly has to be polled to get the state if the switch has been manually operated). So I can't get rid of this Zigbee traffic.

Ra72xx avatar Jun 14 '24 17:06 Ra72xx

To get to over 100 in-flight messages right from the start, something is definitely wrong. Although, on its own, ember should be able to deal with that traffic (it did when I stress-tested it during dev, of course, delays become visible with so many), the problem is, more traffic of equivalent proportion seems to continue happening, which means ember can never catch up and get to an "idle" state where things can go back to "normal". I know you had some troubles with ezsp in the past, but this behavior, I don't think it could handle without regular crashes (lack of order coupled with way too much traffic). It raises the question of, is that behavior happening in ezsp at all (and if it is, how come it does not crash). Any chance you can provide a debug log of ezsp from the start of Z2M to at least 5-10min after that, so I can see how it behaves? (also keep one next time you try ember, I'll check it again for your changes)

Nerivec avatar Jun 14 '24 18:06 Nerivec

Ember at home, skyconnect with raspberry pi works beautifully. Ember at work, skyconnect with an Intel NUC, completely unusable and constant NCP restarts and timeouts.

Not sure if the different platforms are significant here but it’s weird that they behave so differently.

Both networks with over 100 nodes.

lawrencedudley avatar Jun 22 '24 15:06 lawrencedudley

@lawrencedudley Interesting. One of my setups is on Intel NUC (Dongle-E though), no issue at all. NUCs usually are far more stable than PIs too. Always the same error that triggers the restarts? Maybe an interference issue from the different location? Did you check the environment? I integrated scanner functions (channels/existing networks) in https://github.com/Nerivec/ember-zli (I haven't had reports of usage with SkyConnect yet, you can create an issue there if you encounter any).

Nerivec avatar Jun 22 '24 15:06 Nerivec

I’ll give it another go on the next release, I don’t think external factors are significant here as reverting to the EZSP driver on the same Skyconnect firmware makes everything happy again.

On Ember, doing an OTA update was a way of repeatedly causing restarts on the driver.

Happy to jump on a quick screen share if it helps you with tracking down what’s causing it or I can share logs.

lawrencedudley avatar Jun 22 '24 15:06 lawrencedudley

Looking into the logs, when using the old ezsp driver, I get the message "warning: zh:ezsp: Deprecated driver 'ezsp' currently in use...." at least every hour or so. Is this sign of a restart, too? I'll look into the new ember driver once again soon, however I'm a bit reluctant to break the at least somehow running setup as long as I don't have enough spare time to setup everything again from scratch if everything goes wrong ;-)

Ra72xx avatar Jun 22 '24 15:06 Ra72xx

@lawrencedudley Weirder still, OTA should be far more stable on ember. I actually had some recent feedback on a network of over 300 devices, where every device was updated over the course of a couple of weeks (half of it after the converters OTA refactoring in dev); that person had no issue. I'd have to say something is interfering in your case (could be device-specific). ezsp may bypass them due to the lack of implementation (which can cause other issues down the road). It would be great if you could share a debug log that shows what error specifically is thrown.

@Ra72xx No, Koenkk added a warning with last release that logs every hour. You can make an NVM backup with ember-zli or bellows before trying anything. That should allow you to restore the adapter if anything goes "too wrong".

Nerivec avatar Jun 22 '24 16:06 Nerivec

I’ll grab a debug log when I get home and stick it in here!

Like I said, at home, works great. OTA seems much happier running multiple OTA updates at the same time (3/4) where EZSP would have a bad time of it with more than one running concurrently.

lawrencedudley avatar Jun 22 '24 16:06 lawrencedudley

@Nerivec what's the easiest way for me to retain logs from Z2M in Home Assistant? It doesn't seem to respect the max_entries setting from system log so I'm only getting 50 entries and that's like 3 seconds worth.

lawrencedudley avatar Jun 22 '24 19:06 lawrencedudley

You can grab the file directly, it's inside the z2m config folder (use the file explorer add-on, or studio code server).

Nerivec avatar Jun 22 '24 20:06 Nerivec

Cool, got it. Not sure how I missed that!

I've attached a few logs.

log1:

There's a lot of `Delivery of MULTICAST failed for "65533" which feels like it might have some kind of retry logic on it in the ember driver, which may then be causing the network to be so busy it falls over?

log2:

zh:ember:uart:ash: NOT READY - Signaling NCP feels like a bad thing as well.

log3:

This also a bit worrying:

[2024-06-20 23:54:39] error: 	zh:ember:uart:ash: Received ERROR from NCP, with code=ERROR_EXCEEDED_MAXIMUM_ACK_TIMEOUT_COUNT.
[2024-06-20 23:54:39] error: 	zh:ember:uart:ash: ASH disconnected | NCP status: ASH_NCP_FATAL_ERROR
[2024-06-20 23:54:39] error: 	zh:ember:uart:ash: Error while parsing received frame, status=ASH_NCP_FATAL_ERROR.
[2024-06-20 23:54:39] error: 	zh:ember: !!! NCP FATAL ERROR reason=HOST_FATAL_ERROR. ATTEMPTING RESET... !!!

log.log log1.log log2.log

Switching it back to EZSP works like a dream though.

lawrencedudley avatar Jun 24 '24 09:06 lawrencedudley

Try the new release of Z2M - it's been way more stable for me and this issue has pretty much disappeared unless you OTA more than one device at once.

lawrencedudley avatar Jul 02 '24 06:07 lawrencedudley

You were testing with the latest dev (not release) I believe @lawrencedudley That has a lot more refactoring that came along with the v8 support. It will go under test in dev for a month before it's in release though, as it does change a lot of stuff :wink:

Nerivec avatar Jul 02 '24 10:07 Nerivec

Yes, the latest stable release does not help me with my problem, unfortunately.

Ra72xx avatar Jul 02 '24 14:07 Ra72xx

Here my logfile with the currenct stable version and the "ember" driver. ember.log As a sidenote, I was no longer able to access the web ui when the ember driver and debug log were enabled.

Ra72xx avatar Jul 02 '24 15:07 Ra72xx

It seems to sometimes take a surprisingly long time for the web UI to come back alive.

Have you tried edge @Ra72xx?

lawrencedudley avatar Jul 02 '24 15:07 lawrencedudley

I'm just experimenting with the latest-dev docker image and there is no real difference, everything is delayed. Web ui seems not to be accessible. And the first restart of Zigbee2MQTT always ends up with errors like "zh:ember: !!! ADAPTER FATAL ERROR reason=HOST_FATAL_ERROR.", only the second restart works. Back to ezsp and at once it works again.

Ra72xx avatar Jul 02 '24 15:07 Ra72xx

Hate to say it but I'm now seeing similar!

lawrencedudley avatar Jul 02 '24 19:07 lawrencedudley