node-zwave-js
node-zwave-js copied to clipboard
Zen14 seems to lose track of end points above 0
Checklist
- [X] I am not using Home Assistant. Or: a developer has told me to come here.
- [X] I have checked the troubleshooting section and my problem is not described there.
- [X] I have read the changelog and my problem is not mentioned there.
Deploy method
Snap
Z-Wave JS UI version
9.9.0.09b107a
ZwaveJS version
12.4.4
Describe the bug
This seems to be a very latent bug. (happy happy joy joy)
Occasionally zwave-js / zwave-js-ui seems to lose track of Zen14 end points above 0. (I'm not sure if it happens with other items, I've just noticed it with my Zen14s.)
See attached screenshot. Device was interviewed correctly a couple of months ago, and was working perfectly for those couple months using mqtt commands to any and all end points (0, 1, 2). All of a sudden, zwave-js/zwave-js-ui insist that end points 1 and 2 don't exist, despite clearly being there for the interview and also working previously.
Note that while in the failure state, end point 0 continues to function normally. In this device's case, a command to end point 0 makes that change to end points 1 and 2. So if I send a command to turn on end point 0, the device itself turns on end points 1 and 2 and then update their states in the UI! But if I then go into the UI and try to issue a command directly to end point 1 or 2, I'll get the "end point 1 (or 2) doesn't exist" message.
Moreso, this is not the first time I've run into this bug. Previously I've "fixed" it by re-interviewing the device, and the zen14 will come back exactly the same and it'll work again for a few weeks using end points 1 and 2. But re-interviewing is not fixing the bug it's just papering over it, so I'm raising it at this point as this has the hallmarks of being an ugly one to track down and fix.
To Reproduce
- Add Zen14 to zwave-js
- Use for a few weeks/months
- Notice end points 1 or 2 are not working, while end point 0 continues to work
Expected behavior
Commands to end points 1 and 2 should continue to function as in the past.
Additional context
Given the long times between occurrences of this bug, I'm reticent to restart zwave-js/zwave-js-ui because I'm unsure if it'll come back in the failed state and we'll lose the opportunity for useful debugging.
Again, this can be "fixed" by re-interviewing the device, but considering I've encountered this same bug a few times now I'm willing to forego this working for a while to try and debug, just not exactly sure where/how to proceed from here and what would be useful.
Please make a driver log, loglevel debug
and attach it here as a file (drag & drop into the text field).
OK so fun times!
The act of turning on debugging caused end points 1 and 2 to disappear.
Tried turning on end point 0, which then caused end points 1 and 2 to re-appear in the UI.
Fortunately (well, kind-of), it's still in the failure state so I was able to grab a driver log. It's 512KB raw, so zipping it up to drop here.
(more info... turned on driver log. saw end points 1 & 2 missing. turned on end point 0, saw end points 1 & 2 reappear in on state. Then went to end points 1 and 2 and tried to turn them off, got end points not found messages as shown. Then copied log, and returned debugging to normal settings.)
@AlCalzone Any ideas? I can't keep holding off on resetting this thing forever. (And it's still happening.)
@AlCalzone @robertsLando What do I have to do to even get someone to take a look at this? There's been no eyes on this since it's been moved off of UI.
I'll try to find some time next week. Have been pretty busy since I came back from parental leave.
@TheKorn2 I'm sorry for that but there is nothing I can do and @AlCalzone has been busy as he said 🙏🏻
I'll try to find some time next week. Have been pretty busy since I came back from parental leave.
Thanks Al!
So I finally had some time to look at this. According to your log, after controlling endpoint 0, endpoints 1 and 2 have their value updated
2024-03-26T08:14:01.128Z CNTRLR [Node 089] [~] [Binary Switch] currentValue: false => true [Endpoint 1]
2024-03-26T08:14:01.172Z CNTRLR [Node 089] [~] [Binary Switch] targetValue: false => true [Endpoint 1]
...
2024-03-26T08:14:01.199Z CNTRLR [Node 089] [~] [Binary Switch] currentValue: false => true [Endpoint 2]
2024-03-26T08:14:01.224Z CNTRLR [Node 089] [~] [Binary Switch] targetValue: false => true [Endpoint 2]
but that also means the values were there before. I'm afraid to diagnose this I need to see logs spanning from everything is fine until endpoints are missing.
@AlCalzone Well I can get back to everything is fine by re-interviewing the node, that's easy enough. But I'm unclear which log files should be left on for logging. (It's a reasonably beefy x86, so leaving logging on shouldn't be too detrimental, I hope!)
Just the driver logs on level debug. You may need to increase the no. of days they are kept though.
A watched pot never boils...
I turned on debugging but restricted it to node 89 only.
Wouldn't you know it, node 88 (also a Zen 14) decided to show symptoms instead.
So back to square 1, restricted the debug log for every zen14 and reinterviewed every one. One is going to boil sometime!
Don't filter the logs please. They become large but they compress well.
@AlCalzone You want an unfiltered log for a busy 150-ish node network for two weeks? You sure about that?
Yep 😇
@TheKorn2 any updates? If you need a place to upload the logs, you can contact me on Discord and I'll share a link to my NAS.
Well a watched pot never boils.... Until you get pissed off enough and break out a blowtorch. :D
As of this morning, nada. So I decided to write a stress test to force some breakage -- turn on all zen14s, wait 2m, turn off all zen14s, wait 3m, repeat ad infinitum.
Long story short, I broke it, but oddly not in the way it was broken before. Now instead of dropping end points, it's somehow duplicated an extra! Can't help but think this has to be related to my original problem.
I'll have to figure out where the discord is and contact you there. I have all the logs from beginning to now tar'd up, it's 156MB compressed. That'll be fun to go through. Though I'd just start with July 12th and work backwards, since that's when I broke out the blowtorch. ;)
(I'm going to let that routine run overnight and see if there's any additional breakage. Everything connected to those is turned off so it's not hurting anything.)
Nothing else went haywire overnight.
I believe your main issue is too much traffic in the network caused by reporting, see also https://zwave-js.github.io/node-zwave-js/#/troubleshooting/network-health?id=optimizing-the-reporting-configuration
In the logs from July 12th, I count 102592 incoming meter readings, that's 1.2 per second on average - far too much for a healthy Z-Wave network. For example, do you really need kVarh and kVar meter readings every 60 seconds?
Endpoint 3 on node 089 appears in a single command after a stretch of incoming reports that are repeated several times. Since you're not using encryption for that node, it's very likely that this frame was corrupted in just the right way so it is still considered valid.
This kind of repetition from several nodes at once usually means that those nodes or a common repeater is not receiving the acknowledgements from the controller due to too much noise, and then re-transmit a few times to try and get the command to its destination.
I'D LOVE TO turn that off! (I can't think of a single place where I need/want kvar!)
Unfortunately, there's literally no knobs at all to turn with respect to reporting nor reporting frequency to turn on the Zen14.
Think you could possibly put a bug in Zooz' ear on this? I've tried before and the short version is they waste time with mega-less-than-helpful L1 suggestions for a month ("try excluding it from the network and having it re-join. Didn't work? Try it again!"), then abruptly terminate the conversation with "you're out of warranty, so sad too bad" once you manage to break through the brick wall and get it kicked upstairs. (Been down this road six times with them for previous issues.)
Also note that I'm super confused how those are reporting kvar at all, since the spec sheets don't mention that they do power monitoring at all! (I wouldn't mind a watt report every ten minutes, for example!)
(That may be the true root cause of the problem -- Zooz copying and pasting code from some other device that does do power monitoring, and not commenting out the PM code!)
Also also, where are you seeing these reports? If I look via mqtt, all I see are the messages I'd expect to see. (I kicked on endpoint 1 of node 89 here as an example to see if I could see what you're seeing.)
2024-07-17T06:34:52-0500 zwave/89/lastActive:{"time":1721216092478,"value":1720959805032}
2024-07-17T06:34:52-0500 zwave/89/37/1/targetValue:{"time":1721216092550,"value":true}
2024-07-17T06:34:52-0500 zwave/89/37/1/currentValue:{"time":1721216092552,"value":true}
2024-07-17T06:34:52-0500 zwave/89/lastActive:{"time":1721216092728,"value":1721216092643}
2024-07-17T06:34:53-0500 zwave/89/lastActive:{"time":1721216093053,"value":1721216093053}
2024-07-17T06:34:53-0500 zwave/89/37/1/currentValue:{"time":1721216093058,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/1/targetValue:{"time":1721216093063,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/1/duration:{"time":1721216093068,"value":{"value":0,"unit":"seconds"}}
2024-07-17T06:34:53-0500 zwave/89/37/0/currentValue:{"time":1721216093141,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/targetValue:{"time":1721216093146,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/duration:{"time":1721216093150,"value":{"value":0,"unit":"seconds"}}
2024-07-17T06:34:53-0500 zwave/89/37/0/currentValue:{"time":1721216093167,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/targetValue:{"time":1721216093172,"value":true}
2024-07-17T06:34:53-0500 zwave/89/37/0/duration:{"time":1721216093176,"value":{"value":0,"unit":"seconds"}}
2024-07-17T06:34:53-0500 zwave/89/lastActive:{"time":1721216093304,"value":1721216093164}
kVarh and kVar come from Node 122, not sure which device that is. Node 173 has around 36k reports alone, on average one every 2.3s
I'd look at the node statistics (incoming commands) and start with the ones with the highest numbers.
Where do you see the statistics? I can't seem to find them. (Wouldn't be the first time I missed something so happy to be pointed in the right direction!)
node 122 is an aeotec gen5 home energy monitor. I'm surprised you're seeing that many reports as those numbers don't agree with what I'm seeing coming over my zniffer. I'm only showing a report sent once a minute. (Which is what I intended.) Each report has about twenty packets in it, but even if every packet was counted as a report that'd be 28.8k packets / day. About the only real problem I see with 122 is for some reason the controller occasionally ack floods it with a period of < 0.1s between retries, but don't think there's a dial I can twist for that.
node 173 is a zen25. It's a temporary plug for my 4th of july lights that I haven't gotten around to unplugging and taking out of the network. Weird thing is it's configured to report every 300s, not every 3s! But that didn't even exist in the network prior to about june 28th or so. (And won't exist in a few minutes just to cover the base.)
So I guess where do I find the statistics, so I can compare with what my zniffer is seeing? Thanks.
@TheKorn2 Just for info: Z-UI has it's own Zniffer UI now, try it :)
Hover over the arrows in the node list or click the nodes in the network map
@robertsLando I noticed that a while back, but I froze the snap when I opened this ticket so we wouldn't be dealing with changing code versions on top of everything else. (...and that version doesn't have it.)
@AlCalzone If you're cool with it, I'll unfreeze the package and update to the latest one.
Not sure if you've seen the reaction, but go for it.
Oh, no, totally missed it. I'm on vacation now but I'll upgrade when I get back.