node-zwave-js icon indicating copy to clipboard operation
node-zwave-js copied to clipboard

S2 message decryption errors when setting config params on multiple devices

Open mattstocum opened this issue 3 years ago • 17 comments

Is your problem within Home Assistant (Core or Z-Wave JS Integration)?

NO, my problem is NOT within Home Assistant or the ZWave JS integration

Is your problem within ZWaveJS2MQTT?

NO, my problem is NOT within ZWaveJS2MQTT

Checklist

Describe the bug

What causes the bug?

Changing a configuration parameter in two S2 secured devices in short succession.

What do you observe?

Frequent messages in the logs about Duplicate commands and "Message authentication failed" for about a minute, slowing down the entire z-wave network.

What did you expect to happen?

Both devices should have the config parameter update and report back to the controller.

Steps to reproduce the behavior:

  1. Add two devices with S2 Security (in my case both are Access Control)
  2. Set a configuration parameter on one device, then immediately set a configuration parameter on the second.
  3. Wait about 10 seconds
  4. The logs will be flooded for about the next minute with error messages from both nodes.

Device information

Manufacturer: Allegion/Schlage Model name: BE469ZP Node ID in your network: 2 and 66

How are you using node-zwave-js?

  • [ ] zwavejs2mqtt Docker image (latest)
  • [ ] zwavejs2mqtt Docker image (dev)
  • [X] zwavejs2mqtt Docker manually built (please specify branches)
  • [ ] ioBroker.zwave2 adapter (please specify version)
  • [ ] HomeAssistant zwave_js integration (please specify version)
  • [ ] pkg
  • [ ] node-red-contrib-zwave-js (please specify version, double click node to find out)
  • [ ] Manually built from GitHub (please specify branch)
  • [ ] Other (please describe)

Which branches or versions?

version: node-zwave-js branch: 8.5.0 zwavejs2mqtt branch: 5.8.0.36f1b6b

Did you change anything?

yes (please describe)

If yes, what did you change?

Upgraded both locks from S0 to S2 security

Did this work before?

Yes (please describe)

If yes, where did it work?

If both locks where at S0 security there were no issues.

Attach Driver Logfile

zwavejs_2021-10-12.log

Around 18:02:29Z you can see an example of the behavior I'm seeing. Around 18:05:05Z you can see the same exact call, but with a 10 second delay between setting the config param on node 2 and node 66.

mattstocum avatar Oct 12 '21 18:10 mattstocum

I don't think this is related to the quick succession of the commands, might just be a coincidence.

For Node 66, the following happens:

  1. The response comes before the acknowledgement that the request was received by the node. I'm not sure why that causes decoding the response to fail - will need to investigate.
  2. Because the decoding failed, the driver requests a new nonce, to which the device incorrectly responds with its own nonce instead of re-transmitting the command including the nonce. This causes a bunch of weirdness, because the driver can't handle that incorrect response yet. https://github.com/zwave-js/node-zwave-js/issues/3305 tracks this.

From Node 2, the stick actually receives duplicate commands. These seem to be caused because the node sends another configuration report with Supervision encapsulation a couple of seconds after responding to the verification poll:

# poll
2021-10-12T18:02:32.094Z DRIVER » [Node 002] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ route:            0, 0, 0, 0
                                  │ callback id:      80
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 242
                                    └─[ConfigurationCCGet]
                                        parameter #: 4
...
# response
2021-10-12T18:02:37.381Z DRIVER « [Node 002] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 253
                                    └─[ConfigurationCCReport]
                                        parameter #: 4
                                        value size:  1
                                        value:       255
... pause ...
# 🤷🏻‍♂️
2021-10-12T18:02:40.005Z SERIAL « 0x011e00a8000102159f03fe00f4f797011499061ed6580d70dcf69c2e0100c493  (32 bytes)
2021-10-12T18:02:40.012Z DRIVER « [Node 002] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 254
                                    └─[SupervisionCCGet]
                                      │ session id:      22
                                      │ request updates: false
                                      └─[ConfigurationCCReport]
                                          parameter #: 4
                                          value size:  1
                                          value:       255

Then it gets impatient because it doesn't get a response within 1s, but the driver cannot respond yet, because it is waiting for a callback from the controller to the message it just sent to node 66. Notice how the serial data is exactly the same as the first one - although the 3rd is more likely a re-transmission of the 2nd due to routing issues:

2021-10-12T18:02:41.158Z SERIAL « 0x011e00a8000102159f03fe00f4f797011499061ed6580d70dcf69c2e0100c493  (32 bytes)
2021-10-12T18:02:41.160Z DRIVER   Dropping message with invalid payload
2021-10-12T18:02:41.160Z DRIVER « [Node 002] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
...
2021-10-12T18:02:41.437Z SERIAL « 0x011e00a8000102159f03fe00f4f797011499061ed6580d70dcf69c2e0100c493  (32 bytes)
2021-10-12T18:02:41.438Z DRIVER   Dropping message with invalid payload
2021-10-12T18:02:41.439Z DRIVER « [Node 002] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command
2021-10-12T18:02:41.439Z SERIAL » [ACK]                                                                   (0x06)
2021-10-12T18:02:41.469Z SERIAL « 0x011e00a8000102159f03fe00f4f797011499061ed6580d70dcf69c2e0100c493  (32 bytes)
2021-10-12T18:02:41.471Z DRIVER   Dropping message with invalid payload
2021-10-12T18:02:41.471Z DRIVER « [Node 002] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Duplicate command

Part of the reason is that the ACK from Node 66 takes very long

2021-10-12T18:02:37.387Z DRIVER » [Node 066] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ route:            0, 0, 0, 0
                                  │ callback id:      81
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 144
                                    └─[SupervisionCCReport]
                                        session id:          29
                                        more updates follow: false
                                        status:              Success
                                        duration:            0s
...
2021-10-12T18:02:37.400Z DRIVER « [RES] [SendDataBridge]
                                    was sent: true
... 4 seconds later ...
2021-10-12T18:02:41.237Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:     81
                                    transmit status: OK

IMO there are multiple things to be done:

  1. You need to figure out whats blocking the nodes so long - this will need a Zniffer to identify potential routing issues
  2. You should check the device configuration if it is possible to turn off these supervised Configuration CC Reports
  3. I need to make the driver recover better from the incorrect responses to nonce requests
  4. And I need to make more use of Supervision for SET commands, avoiding the verification polls.

AlCalzone avatar Oct 12 '21 21:10 AlCalzone

  1. I don’t have access to a zniffer, and am having trouble figuring out how to get one. Is it just a specific firmware flashed on a normal controller, or do you need specific hardware? Regardless, I’m fairly confident the delay is pretty consistent. The response to the get seems to come about 5 seconds after the controller sends it just about every time. I’ll do some more testing tomorrow, but swapping the order in which I send commands to the devices results in similar behavior. The key seems to be that I need to send a command to another device in that 5 second window, hence why adding a 10 second delay seems to clear everything up. I can hopefully generate logs for a few different scenarios tomorrow.

  2. If there is, it’s not documented. The lock will repeat sending the command forever until it gets a response. I’ll have to double check my old logs, but IIRC the delay between retransmissions was similar to the delay I’m seeing here.

3 & 4 I’d love to help here, but would probably be in over my head. I at least ca build custom zwavejs2mqtt containers off specific commits to test. Especially 4, I’m assuming that’s going to involve tracking which commands are in flight while waiting for supervision acknowledgments to avoid blocking the driver.

mattstocum avatar Oct 12 '21 23:10 mattstocum

Just looked at #3410. The retransmit time on the lock seems to be about 5 seconds while waiting for a supervision acknowledgment. Not sure if that’s a coincidence, or if that points to something else going on.

mattstocum avatar Oct 12 '21 23:10 mattstocum

I don’t have access to a zniffer, and am having trouble figuring out how to get one. Is it just a specific firmware flashed on a normal controller, or do you need specific hardware

You can use a 500-series stick like the UZB3 and use Silabs PC Programmer to flash the Zniffer firmware on there (this is not reversible though, so make sure you don't need the stick for something else). The firmware can be tricky to find and you'll need to download Simplicity Studio to get access to the software.

The response to the get seems to come about 5 seconds

I'm not talking about the response per se, I'm talking about the acknowledgement that the outgoing command was received. This cycle usually takes milliseconds like in the very first command sent in your log:

# driver tells stick to send command
2021-10-12T04:00:28.409Z DRIVER » [Node 058] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ route:            0, 0, 0, 0
                                  │ callback id:      134
                                  └─[MultilevelSwitchCCGet]

# stick tells driver that it sent the command
2021-10-12T04:00:28.418Z DRIVER « [RES] [SendDataBridge]
                                    was sent: true

# stick tells driver that the node acknowledged receiving the command
2021-10-12T04:00:28.524Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:     134
                                    transmit status: OK

Only after this cycle is done, the driver may send another message. The fact that this takes over 4 seconds either means there is a connectivity problem with the node(s) or a routing issue or a firmware bug on the node(s).

AlCalzone avatar Oct 13 '21 08:10 AlCalzone

I have an Aeotec Z-Stick Gen5 that I don't particularly need if you think that would work. The US version of the UZB3 seems a bit difficult to find right now. I think I have all of the required software and firmware to do this.

mattstocum avatar Oct 13 '21 14:10 mattstocum

I'm not certain if it will work, but a bit of googling should be able to tell you.

AlCalzone avatar Oct 14 '21 13:10 AlCalzone

I decided to buy a cheap stick that google claims will work as a zniffer, rather than sacrificing the Aeotec, so I won't be able to test this more until next week.

mattstocum avatar Oct 14 '21 16:10 mattstocum

I think I actually managed to get the stick flashed with the zniffer firmware. Is there anything specific I should be looking for, or should I just have it running while attempting to recreate the issue and then upload the trace here?

mattstocum avatar Oct 18 '21 13:10 mattstocum

just have it running while attempting to recreate the issue

AlCalzone avatar Oct 18 '21 14:10 AlCalzone

errors.zip lotsoferrors.txt

I'm honestly not sure what to make of these first two.

These two seem like they might be a bit more helpful: lockroutederrors.txt

lockroutederrors.zip

I really wasn't able to perfectly recreate the issue though.

mattstocum avatar Oct 19 '21 00:10 mattstocum

Ok, first things first - the file in errors.zip already tells a good story...

This is how communication over a hop should look like: grafik

Node 1 wants to send a command to 56 via node 4. Line 1: 1 -> 4 Line 2: 4 forwards to 56 Line 3: 56 sends 4 an acknowledgement Line 4: 4 knows 56 has received the command, sends 1 an acknowledgement.

Now the response (56 -> 1 via 4) already has problems: grafik Line 6: 56 -> 4 Line 7-9: 4 tries 3 times to reach 1, doesn't get an ACK Line 10: 4 tells 56 that it couldn't reach 1 Line 11: 56 tells 4 "ok, got it"

Then 56 tries again a couple of times via different routes, each with retransmission attempts, until it finally tries the direct connection: grafik Line 75: 56 -> 1 Line 76: 1 to 56 "got it" Line 77: 56 -> 1 Line 78: 1 to 56 "got it"

I'm pretty sure that some of those messages did reach the controller, since you're having these duplicated commands all over your log.

At the end you can see that the missing ACKs also cause retransmission of the S2 commands. Does not get retransmitted: grafik

Does get retransmitted due to missing ACK: grafik


I'm a little worried by the large amount of Routed Errors where the controller is involved (and the huge amount of CRC_ERRORs for the Zniffer, which might indicate that the Zniffer itself has interference problems). Maybe your controller placement is suboptimal?

Z-Wave sticks in particular are prone to interference by USB ports, especially by USB3 ports. We recommend putting the stick in a suitable location:

  • on an USB extension cord (this works wonders!)
  • away from other USB ports
  • away from metallic surfaces
  • and especially not in the back of a server rack

Once you've done that, heal the network as the routes may differ significantly afte such a change. The Zniffer should ideally be next to the controller (or next to a problematic node to see what it sees).

AlCalzone avatar Oct 19 '21 06:10 AlCalzone

Okay, I moved the controller a bit, and healed the two locks a few times. I had it on an extension cable already, but I think my HVAC ducts might have been causing some interference. The routed errors are gone now, but I'm still getting the decode errors if I change config settings on both locks quickly:

aftermove.log aftermove.zip

One of the locks also got into the state where it keeps sending the ConfigurationCCReport endlessly, but that really feels like it has to be a device issue.

mattstocum avatar Oct 20 '21 01:10 mattstocum

The Zniffer logs after moving look much better. I believe the decode errors are due to collisions where both the controller and the node try to send a command at nearly the same time, each invalidating the other's nonces.

I'm not sure why the command after agreeing on a nonce is detected as duplicate. Can you send me your S2 keys via e-mail so I can decrypt the Zniffer logs and figure it out?

Anyways this should get better when we use Supervision in more situations, removing the need for a followup verification GET.

One of the locks also got into the state where it keeps sending the ConfigurationCCReport endlessly, but that really feels like it has to be a device issue.

I agree. The commands use separate S2 sequence numbers, so they aren't retransmissions from the mesh. It seems that it really wants to know that the controller received the update, so it uses Supervision:

2021-10-20T00:59:10.153Z DRIVER « [Node 002] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 71
                                    └─[SupervisionCCGet]
                                      │ session id:      26
                                      │ request updates: false
                                      └─[ConfigurationCCReport]
                                          parameter #: 4
                                          value size:  1
                                          value:       255

And the controller does respond:

2021-10-20T00:59:10.164Z DRIVER » [Node 002] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ route:            0, 0, 0, 0
                                  │ callback id:      207
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 213
                                    └─[SupervisionCCReport]
                                        session id:          26
                                        more updates follow: false
                                        status:              Success
                                        duration:            0s
...
2021-10-20T00:59:10.174Z DRIVER « [RES] [SendDataBridge]
                                    was sent: true

and the command is received by the node:

2021-10-20T00:59:10.194Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:     207
                                    transmit status: OK

and it seems that it decodes correctly, or we'd get a NonceGet in return. But for some reason that doesn't stop the node from retrying.

AlCalzone avatar Oct 21 '21 09:10 AlCalzone

Thanks, just sent the email, unfortunately, line 63 is the first line I'm able to decrypt using the key. Before that it's complaining that the trace doesn't include the Nonce frame, or, starting at line 55, that the decryption key is incorrect (presumably due to it using a different Nonce).

I only seem to be seeing the issue with the nonces when talking to two different devices using S2 encryption in quick succession. I left Zniffer running all day yesterday and didn't see a single NonceGet from anything until I tried my test script that updates both locks.

Command to Node 002:

2021-10-20T00:57:23.572Z DRIVER » [Node 002] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ route:            0, 0, 0, 0
                                  │ callback id:      160
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 185
                                    └─[DoorLockCCOperationSet]
                                        target mode: Secured

Next command out from the stick is to Node 066:

2021-10-20T00:57:23.613Z DRIVER » [Node 066] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ route:            0, 0, 0, 0
                                  │ callback id:      161
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 118
                                    └─[ConfigurationCCGet]
                                        parameter #: 4

Now Node 002 is sending something, but the driver can't decode it:

2021-10-20T00:57:23.661Z DRIVER « [Node 002] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Security2CC_CannotDecode

This almost feels to me like the nonces are getting confused somehow. Almost as if the driver is trying to use the nonce for node 066 to decode the message from 002. I'm trying to step through the logs the Zniffer output but am mostly just confusing myself.

mattstocum avatar Oct 21 '21 14:10 mattstocum

The nonces are managed per node, so confusing them shouldn't be possible. In fact I think it is just a coincidence (or caused by the network traffic) that this happens when you control two nodes at once.

If you look at the Zniffer logs, you can see that the Zniffer cannot decode some of the commands either. Example:

2021-10-20T00:57:56.068Z DRIVER » [Node 066] [REQ] [SendDataBridge]
                                  │ source node id:   1
                                  │ transmit options: 0x25
                                  │ route:            0, 0, 0, 0
                                  │ callback id:      176
                                  └─[Security2CCMessageEncapsulation]
                                    │ sequence number: 128
                                    └─[ConfigurationCCGet]
                                        parameter #: 4
2021-10-20T00:57:56.075Z SERIAL « [ACK]                                                                   (0x06)
2021-10-20T00:57:56.076Z SERIAL « 0x010401a90152                                                       (6 bytes)
2021-10-20T00:57:56.076Z SERIAL » [ACK]                                                                   (0x06)
2021-10-20T00:57:56.077Z DRIVER « [RES] [SendDataBridge]
                                    was sent: true
2021-10-20T00:57:56.102Z SERIAL « 0x011d00a9b000000200bd7f7f7f7f01010300000000020100007f7f7f7f7f3b    (31 bytes)
2021-10-20T00:57:56.103Z SERIAL » [ACK]                                                                   (0x06)
2021-10-20T00:57:56.103Z DRIVER « [REQ] [SendDataBridge]
                                    callback id:     176
                                    transmit status: OK
2021-10-20T00:57:56.137Z SERIAL « 0x011e00a8000142159f0311009c1c629ea1eb3473bdb21b6449175076de00bd88  (32 bytes)
2021-10-20T00:57:56.139Z SERIAL » [ACK]                                                                   (0x06)
2021-10-20T00:57:56.140Z DRIVER « [Node 066] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Security2CC_CannotDecode
2021-10-20T00:57:56.140Z CNTRLR » [Node 066] Message authentication failed, cannot decode command. Requesting a 
                                  nonce...

These are lines 141..143 in the Zniffer log (line 143 is opened): grafik

And I think I know what's happening... At 2021-10-20T00:57:51.023Z in the driver log, a SupervisionCCGet is received with a S2 sequence number of 16 (0x10). This is line 126 in the Zniffer log. Line 137 (and repeated in 140) is another SupervisionCCGet, this time it has sequence number 17 (0x11), which appears to be sent just before the driver sends the ConfigurationCCGet in the above log. However the controller never received it until this:

2021-10-20T00:57:56.140Z DRIVER « [Node 066] [REQ] [BridgeApplicationCommand]
                                  └─[Security2CCMessageEncapsulation] [INVALID]
                                      error: Security2CC_CannotDecode

The Zniffer logs reveal that this again has sequence number 0x11, so this is the "lost" command. Because it is received after an outgoing command but was created before it, its nonce is now outdated, so the driver can no longer decode it (and it shouldn't).


So in short: These nodes repeatedly request verification from the controller that their reports were received, but they don't stop when they receive that verification. Meanwhile the driver sends a verification GET to make sure the SET had an effect. This is compounded by the fact that the SupervisionCCGets from the nodes seem to be re-transmitted, but only reach the controller after it has sent its command, causing a collision in S2 nonce state. And this is further compounded by the nodes not properly responding to the driver's Security2CCNonceReport. They should resend their last command with an SPAN extension, but just reply with their own Security2CCNonceReport.


Or even shorter: Really buggy device firmware fights with not yet battle-hardened driver. If I had a dollar for every time someone complained about these locks not working properly...

AlCalzone avatar Oct 21 '21 15:10 AlCalzone

This issue has not seen any recent activity and was marked as "stale 💤". Closing for housekeeping purposes... 🧹

Feel free to reopen if the issue persists.

zwave-js-assistant[bot] avatar Nov 08 '21 00:11 zwave-js-assistant[bot]

FYI, part of this will be fixed in the v10 release. We now briefly remember the nonce used for the last outgoing S2 encapsulated message and attempt decryption with that first if certain criteria are met. As a result you should no longer (or much less frequently) see the described collisions and the resulting re-synchronization weirdness. I think this situation (both parties sending at the same time) were never considered in the specs.

AlCalzone avatar Aug 16 '22 09:08 AlCalzone

Recent updates should have improved the reliability of S2 communication. I'm going to close this for now. Feel free to open a new issue with updated logs if this issue still persists.

AlCalzone avatar Mar 14 '23 20:03 AlCalzone