node-zwave-js icon indicating copy to clipboard operation
node-zwave-js copied to clipboard

Zen14 seems to lose track of end points above 0

Open TheKorn2 opened this issue 10 months ago • 15 comments

Checklist

  • [X] I am not using Home Assistant. Or: a developer has told me to come here.
  • [X] I have checked the troubleshooting section and my problem is not described there.
  • [X] I have read the changelog and my problem is not mentioned there.

Deploy method

Snap

Z-Wave JS UI version

9.9.0.09b107a

ZwaveJS version

12.4.4

Describe the bug

This seems to be a very latent bug. (happy happy joy joy)

Occasionally zwave-js / zwave-js-ui seems to lose track of Zen14 end points above 0. (I'm not sure if it happens with other items, I've just noticed it with my Zen14s.)

See attached screenshot. Device was interviewed correctly a couple of months ago, and was working perfectly for those couple months using mqtt commands to any and all end points (0, 1, 2). All of a sudden, zwave-js/zwave-js-ui insist that end points 1 and 2 don't exist, despite clearly being there for the interview and also working previously.

Note that while in the failure state, end point 0 continues to function normally. In this device's case, a command to end point 0 makes that change to end points 1 and 2. So if I send a command to turn on end point 0, the device itself turns on end points 1 and 2 and then update their states in the UI! But if I then go into the UI and try to issue a command directly to end point 1 or 2, I'll get the "end point 1 (or 2) doesn't exist" message.

Moreso, this is not the first time I've run into this bug. Previously I've "fixed" it by re-interviewing the device, and the zen14 will come back exactly the same and it'll work again for a few weeks using end points 1 and 2. But re-interviewing is not fixing the bug it's just papering over it, so I'm raising it at this point as this has the hallmarks of being an ugly one to track down and fix.

lost_track_of_end_points

To Reproduce

  • Add Zen14 to zwave-js
  • Use for a few weeks/months
  • Notice end points 1 or 2 are not working, while end point 0 continues to work

Expected behavior

Commands to end points 1 and 2 should continue to function as in the past.

Additional context

Given the long times between occurrences of this bug, I'm reticent to restart zwave-js/zwave-js-ui because I'm unsure if it'll come back in the failed state and we'll lose the opportunity for useful debugging.

Again, this can be "fixed" by re-interviewing the device, but considering I've encountered this same bug a few times now I'm willing to forego this working for a while to try and debug, just not exactly sure where/how to proceed from here and what would be useful.

TheKorn2 avatar Mar 26 '24 07:03 TheKorn2

Please make a driver log, loglevel debug and attach it here as a file (drag & drop into the text field).

robertsLando avatar Mar 26 '24 08:03 robertsLando

OK so fun times!

The act of turning on debugging caused end points 1 and 2 to disappear.

its_GONE

Tried turning on end point 0, which then caused end points 1 and 2 to re-appear in the UI.

and now its back!

Fortunately (well, kind-of), it's still in the failure state so I was able to grab a driver log. It's 512KB raw, so zipping it up to drop here.

still_broken

zwavejs_2024-03-26.zip

TheKorn2 avatar Mar 26 '24 08:03 TheKorn2

(more info... turned on driver log. saw end points 1 & 2 missing. turned on end point 0, saw end points 1 & 2 reappear in on state. Then went to end points 1 and 2 and tried to turn them off, got end points not found messages as shown. Then copied log, and returned debugging to normal settings.)

TheKorn2 avatar Mar 26 '24 08:03 TheKorn2

@AlCalzone Any ideas? I can't keep holding off on resetting this thing forever. (And it's still happening.)

TheKorn2 avatar Apr 24 '24 13:04 TheKorn2

@AlCalzone @robertsLando What do I have to do to even get someone to take a look at this? There's been no eyes on this since it's been moved off of UI.

TheKorn2 avatar May 19 '24 10:05 TheKorn2

I'll try to find some time next week. Have been pretty busy since I came back from parental leave.

AlCalzone avatar May 19 '24 11:05 AlCalzone

@TheKorn2 I'm sorry for that but there is nothing I can do and @AlCalzone has been busy as he said 🙏🏻

robertsLando avatar May 19 '24 14:05 robertsLando

I'll try to find some time next week. Have been pretty busy since I came back from parental leave.

Thanks Al!

TheKorn2 avatar May 19 '24 15:05 TheKorn2

So I finally had some time to look at this. According to your log, after controlling endpoint 0, endpoints 1 and 2 have their value updated

2024-03-26T08:14:01.128Z CNTRLR   [Node 089] [~] [Binary Switch] currentValue: false => true        [Endpoint 1]
2024-03-26T08:14:01.172Z CNTRLR   [Node 089] [~] [Binary Switch] targetValue: false => true         [Endpoint 1]
...
2024-03-26T08:14:01.199Z CNTRLR   [Node 089] [~] [Binary Switch] currentValue: false => true        [Endpoint 2]
2024-03-26T08:14:01.224Z CNTRLR   [Node 089] [~] [Binary Switch] targetValue: false => true         [Endpoint 2]

but that also means the values were there before. I'm afraid to diagnose this I need to see logs spanning from everything is fine until endpoints are missing.

AlCalzone avatar May 29 '24 10:05 AlCalzone

@AlCalzone Well I can get back to everything is fine by re-interviewing the node, that's easy enough. But I'm unclear which log files should be left on for logging. (It's a reasonably beefy x86, so leaving logging on shouldn't be too detrimental, I hope!)

TheKorn2 avatar May 29 '24 12:05 TheKorn2

Just the driver logs on level debug. You may need to increase the no. of days they are kept though.

AlCalzone avatar May 29 '24 18:05 AlCalzone

A watched pot never boils...

I turned on debugging but restricted it to node 89 only.

Wouldn't you know it, node 88 (also a Zen 14) decided to show symptoms instead.

So back to square 1, restricted the debug log for every zen14 and reinterviewed every one. One is going to boil sometime!

TheKorn2 avatar Jun 23 '24 21:06 TheKorn2

Don't filter the logs please. They become large but they compress well.

AlCalzone avatar Jun 24 '24 05:06 AlCalzone

@AlCalzone You want an unfiltered log for a busy 150-ish node network for two weeks? You sure about that?

TheKorn2 avatar Jun 24 '24 19:06 TheKorn2

Yep 😇

AlCalzone avatar Jun 24 '24 21:06 AlCalzone

@TheKorn2 any updates? If you need a place to upload the logs, you can contact me on Discord and I'll share a link to my NAS.

AlCalzone avatar Jul 12 '24 06:07 AlCalzone

Well a watched pot never boils.... Until you get pissed off enough and break out a blowtorch. :D

As of this morning, nada. So I decided to write a stress test to force some breakage -- turn on all zen14s, wait 2m, turn off all zen14s, wait 3m, repeat ad infinitum.

Long story short, I broke it, but oddly not in the way it was broken before. Now instead of dropping end points, it's somehow duplicated an extra! Can't help but think this has to be related to my original problem.

I'll have to figure out where the discord is and contact you there. I have all the logs from beginning to now tar'd up, it's 156MB compressed. That'll be fun to go through. Though I'd just start with July 12th and work backwards, since that's when I broke out the blowtorch. ;)

node89-what

TheKorn2 avatar Jul 13 '24 02:07 TheKorn2

(I'm going to let that routine run overnight and see if there's any additional breakage. Everything connected to those is turned off so it's not hurting anything.)

TheKorn2 avatar Jul 13 '24 02:07 TheKorn2

Nothing else went haywire overnight.

TheKorn2 avatar Jul 13 '24 13:07 TheKorn2

I believe your main issue is too much traffic in the network caused by reporting, see also https://zwave-js.github.io/node-zwave-js/#/troubleshooting/network-health?id=optimizing-the-reporting-configuration

In the logs from July 12th, I count 102592 incoming meter readings, that's 1.2 per second on average - far too much for a healthy Z-Wave network. For example, do you really need kVarh and kVar meter readings every 60 seconds?

Endpoint 3 on node 089 appears in a single command after a stretch of incoming reports that are repeated several times. Since you're not using encryption for that node, it's very likely that this frame was corrupted in just the right way so it is still considered valid.

This kind of repetition from several nodes at once usually means that those nodes or a common repeater is not receiving the acknowledgements from the controller due to too much noise, and then re-transmit a few times to try and get the command to its destination.

AlCalzone avatar Jul 17 '24 07:07 AlCalzone

I'D LOVE TO turn that off! (I can't think of a single place where I need/want kvar!)

Unfortunately, there's literally no knobs at all to turn with respect to reporting nor reporting frequency to turn on the Zen14.

image

Think you could possibly put a bug in Zooz' ear on this? I've tried before and the short version is they waste time with mega-less-than-helpful L1 suggestions for a month ("try excluding it from the network and having it re-join. Didn't work? Try it again!"), then abruptly terminate the conversation with "you're out of warranty, so sad too bad" once you manage to break through the brick wall and get it kicked upstairs. (Been down this road six times with them for previous issues.)

TheKorn2 avatar Jul 17 '24 11:07 TheKorn2

Also note that I'm super confused how those are reporting kvar at all, since the spec sheets don't mention that they do power monitoring at all! (I wouldn't mind a watt report every ten minutes, for example!)

(That may be the true root cause of the problem -- Zooz copying and pasting code from some other device that does do power monitoring, and not commenting out the PM code!)

TheKorn2 avatar Jul 17 '24 11:07 TheKorn2

Also also, where are you seeing these reports? If I look via mqtt, all I see are the messages I'd expect to see. (I kicked on endpoint 1 of node 89 here as an example to see if I could see what you're seeing.)

2024-07-17T06:34:52-0500 zwave/89/lastActive:{"time":1721216092478,"value":1720959805032}
2024-07-17T06:34:52-0500 zwave/89/37/1/targetValue:{"time":1721216092550,"value":true}
2024-07-17T06:34:52-0500 zwave/89/37/1/currentValue:{"time":1721216092552,"value":true}
2024-07-17T06:34:52-0500 zwave/89/lastActive:{"time":1721216092728,"value":1721216092643}
2024-07-17T06:34:53-0500 zwave/89/lastActive:{"time":1721216093053,"value":1721216093053}
2024-07-17T06:34:53-0500 zwave/89/37/1/currentValue:{"time":1721216093058,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/1/targetValue:{"time":1721216093063,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/1/duration:{"time":1721216093068,"value":{"value":0,"unit":"seconds"}}
2024-07-17T06:34:53-0500 zwave/89/37/0/currentValue:{"time":1721216093141,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/targetValue:{"time":1721216093146,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/duration:{"time":1721216093150,"value":{"value":0,"unit":"seconds"}}
2024-07-17T06:34:53-0500 zwave/89/37/0/currentValue:{"time":1721216093167,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/targetValue:{"time":1721216093172,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/duration:{"time":1721216093176,"value":{"value":0,"unit":"seconds"}}
2024-07-17T06:34:53-0500 zwave/89/lastActive:{"time":1721216093304,"value":1721216093164}

TheKorn2 avatar Jul 17 '24 11:07 TheKorn2

kVarh and kVar come from Node 122, not sure which device that is. Node 173 has around 36k reports alone, on average one every 2.3s

I'd look at the node statistics (incoming commands) and start with the ones with the highest numbers.

AlCalzone avatar Jul 17 '24 12:07 AlCalzone

Where do you see the statistics? I can't seem to find them. (Wouldn't be the first time I missed something so happy to be pointed in the right direction!)

node 122 is an aeotec gen5 home energy monitor. I'm surprised you're seeing that many reports as those numbers don't agree with what I'm seeing coming over my zniffer. I'm only showing a report sent once a minute. (Which is what I intended.) Each report has about twenty packets in it, but even if every packet was counted as a report that'd be 28.8k packets / day. About the only real problem I see with 122 is for some reason the controller occasionally ack floods it with a period of < 0.1s between retries, but don't think there's a dial I can twist for that.

image

node 173 is a zen25. It's a temporary plug for my 4th of july lights that I haven't gotten around to unplugging and taking out of the network. Weird thing is it's configured to report every 300s, not every 3s! But that didn't even exist in the network prior to about june 28th or so. (And won't exist in a few minutes just to cover the base.)

image

So I guess where do I find the statistics, so I can compare with what my zniffer is seeing? Thanks.

TheKorn2 avatar Jul 17 '24 13:07 TheKorn2

@TheKorn2 Just for info: Z-UI has it's own Zniffer UI now, try it :)

robertsLando avatar Jul 17 '24 13:07 robertsLando

Hover over the arrows in the node list or click the nodes in the network map

AlCalzone avatar Jul 17 '24 13:07 AlCalzone

@robertsLando I noticed that a while back, but I froze the snap when I opened this ticket so we wouldn't be dealing with changing code versions on top of everything else. (...and that version doesn't have it.)

@AlCalzone If you're cool with it, I'll unfreeze the package and update to the latest one.

TheKorn2 avatar Jul 17 '24 15:07 TheKorn2

Not sure if you've seen the reaction, but go for it.

AlCalzone avatar Jul 26 '24 12:07 AlCalzone

Oh, no, totally missed it. I'm on vacation now but I'll upgrade when I get back.

TheKorn2 avatar Jul 26 '24 13:07 TheKorn2