core icon indicating copy to clipboard operation
core copied to clipboard

Updating Eve Motion (Matter) Sensor to 3.2.1 fails

Open evakq8r opened this issue 1 year ago • 27 comments

The problem

Updating a Thread device fails with: "Target node did not process the update file"

Almost identical to https://github.com/home-assistant/core/issues/123086 that @agners looked into, however this time for the Eve Motion Sensors instead. I am unable to update from a new firmware available on the DCL.

What version of Home Assistant Core has the issue?

core-2024.8.1

What was the last working version of Home Assistant Core?

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Matter

Link to integration documentation on our website

https://www.home-assistant.io/integrations/matter/

Diagnostics information

matter-627ecc72d3e7436d2a8cfbc91b185267-Eve Motion 20EBY9901-602231a5f35f746243fc2ebdded8ab72.json

Example YAML snippet

-

Anything in the logs that might be useful for us?

2024-08-11 16:38:03.072 (MainThread) INFO [matter_server.server.device_controller] <Node:179> Update to software version 6650
2024-08-11 16:38:03.072 (MainThread) DEBUG [matter_server.server.device_controller] <Node:179> Check for updates.
2024-08-11 16:38:03.785 (MainThread) INFO [matter_server.server.device_controller] <Node:179> New software update found: 3.2.1 on UpdateSource.MAIN_NET_DCL (current 3.2.0).
2024-08-11 16:38:03.785 (MainThread) INFO [matter_server.server.device_controller] <Node:179> Downloading update from 'https://eve-updates.evehome.com/matter/eve_motion-3.2.1-matter-benc-enc-CF2853C8-8EA6-478C-87FA-3A5EAB5CFF30.bin'
2024-08-11 16:38:03.785 (MainThread) DEBUG [matter_server.server.ota.provider] Download update from 'https://eve-updates.evehome.com/matter/eve_motion-3.2.1-matter-benc-enc-CF2853C8-8EA6-478C-87FA-3A5EAB5CFF30.bin'.
2024-08-11 16:38:04.282 (MainThread) INFO [matter_server.server.ota.provider] Update file 'eve_motion-3.2.1-matter-benc-enc-CF2853C8-8EA6-478C-87FA-3A5EAB5CFF30.bin' downloaded to '/config/updates/179'
2024-08-11 16:38:04.283 (MainThread) INFO [matter_server.server.device_controller] <Node:179> Starting update using OTA Provider.
2024-08-11 16:38:04.283 (MainThread) INFO [matter_server.server.ota.provider] Starting OTA Provider
2024-08-11 16:38:04.283 (MainThread) INFO [matter_server.server.ota.provider] Commission and initialize OTA Provider
2024-08-11 16:38:04.399 (Dummy-2) INFO [chip.ChipDeviceCtrl] Established secure session with Device
2024-08-11 16:38:04.764 (Dummy-2) DEBUG [chip.storage] SetSdkKey: f/1/s/00000000000F1BE3 = b'\x150\x03\x10\xb8\x11\x8eC\xe6SQIH&\x01\xe5r\xd6\xf4\xdb0\x04 \x9b\xdf)Yj\x07M\xdf \xc1\xde\xa3v9\xd8\x98\xb45\xf9\x00\x84\xadc+R\x0c_\xa4\rB\xffz0\x05\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x18'
2024-08-11 16:38:04.764 (Dummy-2) DEBUG [chip.storage] SetSdkKey: g/s/uBGOQ+ZTUUlIJgHlctb02w== = b'\x15$\x01\x01&\x02\xe3\x1b\x0f\x00\x18'
2024-08-11 16:38:04.764 (Dummy-2) DEBUG [chip.storage] SetSdkKey: g/sri = b'\x16\x15$\x01\x01$\x02(\x18\x15$\x01\x01$\x02*\x18\x15$\x01\x01$\x02\x9a\x18\x15$\x01\x01$\x02\x9f\x18\x15$\x01\x01$\x02\xa7\x18\x15$\x01\x01$\x028\x18\x15$\x01\x01$\x02%\x18\x15$\x01\x01$\x02\xa6\x18\x15$\x01\x01$\x02A\x18\x15$\x01\x01$\x02<\x18\x15$\x01\x01$\x02\x8f\x18\x15$\x01\x01$\x02\x90\x18\x15$\x01\x01$\x02\xa5\x18\x15$\x01\x01$\x02\x80\x18\x15$\x01\x01$\x02\x8e\x18\x15$\x01\x01$\x02\xa0\x18\x15$\x01\x01$\x02\x81\x18\x15$\x01\x01$\x02S\x18\x15$\x01\x01$\x02N\x18\x15$\x01\x01$\x02\x83\x18\x15$\x01\x01$\x02\x8b\x18\x15$\x01\x01$\x02\x82\x18\x15$\x01\x01$\x02D\x18\x15$\x01\x01$\x02O\x18\x15$\x01\x01$\x02\x9b\x18\x15$\x01\x01$\x02\x84\x18\x15$\x01\x01$\x02\x8d\x18\x15$\x01\x01$\x02\xa8\x18\x15$\x01\x01$\x02\xa9\x18\x15$\x01\x01$\x02\x85\x18\x15$\x01\x01$\x02\xaa\x18\x15$\x01\x01$\x02\xab\x18\x15$\x01\x01$\x02\xad\x18\x15$\x01\x01$\x02\xae\x18\x15$\x01\x01$\x02\xaf\x18\x15$\x01\x01$\x02\xb0\x18\x15$\x01\x01$\x02\xb1\x18\x15$\x01\x01&\x02\xe1\x1b\x0f\x00\x18\x15$\x01\x01$\x02\xb2\x18\x15$\x01\x01$\x02\xb3\x18\x15$\x01\x01&\x02\xe3\x1b\x0f\x00\x18\x18'
2024-08-11 16:38:04.767 (Dummy-2) INFO [chip.ChipDeviceCtrl] Commissioning complete
2024-08-11 16:38:04.767 (MainThread) INFO [matter_server.server.ota.provider] OTA Provider App commissioned with node id 990179.
2024-08-11 16:38:04.904 (MainThread) DEBUG [matter_server.server.device_controller.mdns] Commissionable Matter node disappeared: AsyncServiceInfo(type='_matterc._udp.local.', name='9A186D80FED0BB57._matterc._udp.local.', addresses=[], port=None, weight=0, priority=0, server=None, properties={}, interface_index=None)
2024-08-11 16:38:05.782 (MainThread) DEBUG [matter_server.server.device_controller.mdns] Commissionable Matter node disappeared: AsyncServiceInfo(type='_matterc._udp.local.', name='87DAADCA4D0ED212._matterc._udp.local.', addresses=[], port=None, weight=0, priority=0, server=None, properties={}, interface_index=None)
2024-08-11 16:38:07.085 (MainThread) INFO [matter_server.server.ota.provider] Waiting for target node update state change
2024-08-11 16:38:07.247 (MainThread) INFO [matter_server.server.ota.provider] Update state changed from <UpdateStateEnum.kIdle: 1> to <UpdateStateEnum.kQuerying: 2>
2024-08-11 16:38:52.345 (MainThread) INFO [matter_server.server.ota.provider] Update state changed from <UpdateStateEnum.kQuerying: 2> to <UpdateStateEnum.kIdle: 1>
2024-08-11 16:38:52.345 (MainThread) INFO [matter_server.server.ota.provider] Cleaning up OTA provider
2024-08-11 16:38:52.345 (MainThread) INFO [matter_server.server.ota.provider] Terminating OTA Provider
2024-08-11 16:38:52.351 (MainThread) ERROR [matter_server.server.client_handler] [140534249185552] Error while handling: update_node (node 179): Target node did not process the update file
2024-08-11 16:39:02.353 (MainThread) DEBUG [matter_server.server.device_controller] <Node:179> Check for updates.
2024-08-11 16:39:03.039 (MainThread) INFO [matter_server.server.device_controller] <Node:179> New software update found: 3.2.1 on UpdateSource.MAIN_NET_DCL (current 3.2.0).

OTA Provider log: https://dpaste.org/thECy

Additional information

The previous report was on Matter Server 6.3.0, however am now on 6.4.1. The Aqara DCL update was dismissed as being invalid, I would really like to hope this is not the case here as I have just installed an additional 12 Eve Motion (Matter) sensors all throwing the 3.2.1 update prompt, and all failing for this issue.

Other Info:-

Matter/Thread Controllers:

  • 5x Google Hub Gen 2s

Network Setup:-

  • Router: OpnSense
  • Core Switch: TL-SG3428XMP
  • Other Switches: 1x TL-SG2008P and 3x TL-SG2008 (Standalone Mode)
  • APs: 2x EAP670s (Standalone Mode)
  • MDNS Configuration: Following https://www.derekseaman.com/2023/10/configuring-tp-link-igmp-mld-multicast-snooping.html as a source.

evakq8r avatar Aug 11 '24 07:08 evakq8r

Hey there @home-assistant/matter, mind taking a look at this issue as it has been labeled with an integration (matter) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of matter can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign matter Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


matter documentation matter source (message by IssueLinks)

home-assistant[bot] avatar Aug 11 '24 07:08 home-assistant[bot]

Hm, this confirms what I suspected in https://github.com/home-assistant/core/issues/123086#issuecomment-2268656315: There is an underlying problem in your setup which prevents updates. Since this time it's really another vendor (Eve) gives an important datapoint: This is definitely infrastructure related somehow, it really seems a Thread Border Router issue.

I actually intended to try out a Thread device update with a 2nd Gen Google Hub only on my end, but I don't have a updateable Thread device currently :cry:

At this point it really points to the 2nd Gen Google Hub, since a test device on your WiFi seems to receive the mDNS announcements. The next step would be to really look into the Google Hub to see if it processes/forwards the mDNS announcements correctly. I haven't debugged Google Hub Thread border router functionality so far. I'll try to figure out how we can gain more insight here, stay tuned.

agners avatar Aug 12 '24 07:08 agners

Not sure if this is helpful.

I did get my Eve motion and Eve Energy devices to all update. I have 5 Energy and 3 Motion. They updated from 3.2.0 to 3.2.1. My thread border routers are a Google Nest Hub 2, Google Nest Hub Max and a Home Assistant SkyConnect running Thread Only firmware and the Eve devices are the only Thread devices present. I will say it did take 3 or 4 attempts to get the Motion's (battery operated) devices to work and I had to wait about 5 mins from one finishing before starting the next one. I have no more updateable devices to test anything more though. That waiting 5 mins after one finishing before trying the next was a key for my setup.

Not sure if setup is important, but as I said mine is a Google Nest Hub 2 in the Bedroom, a Nest Hub Max in the Kitchen, the Home Assistant Yellow with the SkyConnect in the utility room connected to Ubuquiti Unifi switches and Unifi U6-Pro AP's . My main gateway/router is a Unifi USG-Pro4

austwhite avatar Aug 12 '24 08:08 austwhite

Hmm, I do have 4x OTBR devices I could technically try to use with the update (Sonoff ZBDongle-E flashed with OTBR only, 2x GL-iNET S20s and 1x GL-iNet S200). The problem I've found though is if I introduce any OTBR devices into the network, they cause a lot of instability. Removing them puts stability back into the network, but it takes several hours/days.

For what it's worth, I have 46 Matter over Thread devices in my network, broken down by:

  • 2x Aqara P2 Door Sensors
  • 5x Eve Energy Power Plugs (with another 12 on the way)
  • 12x Eve Motion Sensors (with another 4 on the way)
  • 18x Nanoleaf NL67 A19 Lights
  • 4x Nanoleaf NL54 GU10 Downlights
  • 1x Nanoleaf NL65 3.5" Downlight
  • 4x Nanoleaf NL68 Lightstrips

The Nanoleafs lights are (for the most part) stable but they do drop out once or twice a week. They do self-recover though (the newer publicly available firmware for these lights has been a significant improvement as I know Nanoleaf were/still are a point of contention for HA and Matter).

TL;DR: I can reinstate the OTBR devices for a short time if it is deemed necessary to further this issues development, however my MoT network is somewhat sizeable so I'm unsure if that is also a contributing factor.

evakq8r avatar Aug 12 '24 09:08 evakq8r

@agners Something I've just noticed on another update attempt of one of my motion sensors:

2024-08-12 20:39:46.941 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:149973922 on exchange 51884i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-08-12 20:39:49.785 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-08-12 20:39:49.787 (Dummy-2) CHIP_ERROR [chip.native.CTL] Session establishment failed for <00000000000F1BE4, 1>, error: src/app/OperationalSessionSetup.cpp:257: CHIP Error 0x00000032: Timeout.  Next retry expected to get a response to Sigma1 or fail within 10 seconds
2024-08-12 20:39:55.732 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:149973923 on exchange 51885i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-08-12 20:39:58.829 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-08-12 20:39:58.832 (Dummy-2) CHIP_ERROR [chip.native.CTL] Session establishment failed for <00000000000F1BE4, 1>, error: src/app/OperationalSessionSetup.cpp:257: CHIP Error 0x00000032: Timeout.  Next retry expected to get a response to Sigma1 or fail within 10 seconds
2024-08-12 20:40:04.962 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:149973924 on exchange 51888i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-08-12 20:40:07.873 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-08-12 20:40:07.876 (Dummy-2) CHIP_ERROR [chip.native.CTL] Session establishment failed for <00000000000F1BE4, 1>, error: src/app/OperationalSessionSetup.cpp:257: CHIP Error 0x00000032: Timeout.  Next retry expected to get a response to Sigma1 or fail within 10 seconds

These entries above directly followed after the OTA provider established a session with the sensor and only after attempting to update firmware, though I haven't seen them on every update attempt.

Full log: https://dpaste.org/6EM1k

evakq8r avatar Aug 12 '24 11:08 evakq8r

@agners as referenced here https://github.com/home-assistant/core/issues/123401#issuecomment-2283338235 this is my thread border configuration: image I am using a Skyconnect with thread-only firmware and an Apple 4kTV.

All devices are in one VLAN - my router is a Unifi Dreammachine Pro, and Unifi Switches. MDNS and IPv6 is disabled on the unifi side - enhanced multicast is enabled.

Here are the logs of the matter addon, during the update process:

2024-08-12 14:23:09.794 (MainThread) INFO [matter_server.server.device_controller] <Node:28> New software update found: 3.2.1 on UpdateSource.MAIN_NET_DCL (current 3.2.0).
2024-08-12 14:23:11.420 (MainThread) INFO [matter_server.server.device_controller] <Node:28> Update to software version 6650
2024-08-12 14:23:11.574 (MainThread) INFO [matter_server.server.device_controller] <Node:28> New software update found: 3.2.1 on UpdateSource.MAIN_NET_DCL (current 3.2.0).
2024-08-12 14:23:11.576 (MainThread) INFO [matter_server.server.device_controller] <Node:28> Downloading update from 'https://eve-updates.evehome.com/matter/eve_energy_eu-3.2.1-matter-benc-enc-4E71C921-3218-4E56-9928-527DA7A3BE88.bin'
2024-08-12 14:23:11.721 (MainThread) INFO [matter_server.server.ota.provider] Update file 'eve_energy_eu-3.2.1-matter-benc-enc-4E71C921-3218-4E56-9928-527DA7A3BE88.bin' downloaded to '/config/updates/28'
2024-08-12 14:23:11.722 (MainThread) INFO [matter_server.server.device_controller] <Node:28> Starting update using OTA Provider.
2024-08-12 14:23:11.722 (MainThread) INFO [matter_server.server.ota.provider] Starting OTA Provider
2024-08-12 14:23:11.724 (MainThread) INFO [matter_server.server.ota.provider] Commission and initialize OTA Provider
2024-08-12 14:23:11.913 (Dummy-2) INFO [chip.ChipDeviceCtrl] Established secure session with Device
2024-08-12 14:23:12.341 (Dummy-2) INFO [chip.ChipDeviceCtrl] Commissioning complete
2024-08-12 14:23:12.341 (MainThread) INFO [matter_server.server.ota.provider] OTA Provider App commissioned with node id 990028.
2024-08-12 14:23:12.619 (MainThread) INFO [matter_server.server.ota.provider] Waiting for target node update state change
2024-08-12 14:23:12.846 (MainThread) INFO [matter_server.server.ota.provider] Update state changed from <UpdateStateEnum.kIdle: 1> to <UpdateStateEnum.kQuerying: 2>
2024-08-12 14:23:57.908 (MainThread) INFO [matter_server.server.ota.provider] Update state changed from <UpdateStateEnum.kQuerying: 2> to <UpdateStateEnum.kIdle: 1>
2024-08-12 14:23:57.908 (MainThread) INFO [matter_server.server.ota.provider] Cleaning up OTA provider
2024-08-12 14:23:57.908 (MainThread) INFO [matter_server.server.ota.provider] Terminating OTA Provider
2024-08-12 14:23:57.924 (MainThread) ERROR [matter_server.server.client_handler] [139764281573840] Error while handling: update_node (node 28): Target node did not process the update file

Beside that, I am getting a lot of these error messages: 2024-08-12 14:22:35.116 (Dummy-2) CHIP_ERROR [chip.native.DIS] DNSSD packet parsing failed (for SRV records) 2024-08-12 14:22:35.116 (Dummy-2) CHIP_ERROR [chip.native.DIS] DNSSD packet parsing failed (for non-srv records)

pietro4711 avatar Aug 12 '24 12:08 pietro4711

@pietro4711 yeah the same symptom, the device changes from kQuerying to kIdle after exactly 25s, likely a timeout in finding the OTA requestor node.

Can you shutdown your Apple TV and try again? I did see similar issues with Apple BRs in the past, the OTBR worked for me. Make sure the device stays in range of your mesh.

agners avatar Aug 12 '24 13:08 agners

These entries above directly followed after the OTA provider established a session with the sensor and only after attempting to update firmware, though I haven't seen them on every update attempt.

This is most likely because we reuse the same node ID for the OTA provider. It seems that we have some lingering CASE sessions from previous update attempts :thinking: I don't think those are really interfering, but ideally we should clean those up before a new attempt.

I guess you don't see those when you restart the Matter Server and only run a single attempt (per device)? :thinking:

agners avatar Aug 12 '24 13:08 agners

Yep, correct - only one update attempt per device.

evakq8r avatar Aug 12 '24 13:08 evakq8r

Yep, correct - only one update attempt per device.

Well, my theory is, at least for Node ID 180 (resp. the corresponding OTA requestor 990180) there was a previous attempt. This previous attempt is the reason for the errors. This could be days ago (in case you haven't restarted the Matter Server).

The same error is not present for Node ID 187, so for this one it seems to be the first attempt.

To be sure, can you check if this happens after a Matter Server restart, so you can be sure there was only a single attempt.

agners avatar Aug 12 '24 13:08 agners

Should the update folders within addon_configs be deleted first? Or not a factor here?

evakq8r avatar Aug 12 '24 13:08 evakq8r

Can you shutdown your Apple TV and try again? I did see similar issues with Apple BRs in the past, the OTBR worked for me. Make sure the device stays in range of your mesh.

I reboot the Apple TV so many times - I also relocated the plug - update always failed - any other clue?

pietro4711 avatar Aug 12 '24 17:08 pietro4711

I reboot the Apple TV so many times - I also relocated the plug - update always failed - any other clue?

I didn't meant to reboot, but to disconnect/remove it entirely. Make the OTBR the only BR. That is a setup I've definitely had working on my end.

agners avatar Aug 12 '24 17:08 agners

Should the update folders within addon_configs be deleted first? Or not a factor here?

Shouldn't matter as the relevant files (storage file) is timestamp when the for each update start.

agners avatar Aug 12 '24 17:08 agners

Should the update folders within addon_configs be deleted first? Or not a factor here?

Shouldn't matter as the relevant files (storage file) is timestamp when the for each update start.

Have restarted the Matter Server and tried to update 3 of the Eve Motion Sensors, and they all failed without the quoted entry from here: https://github.com/home-assistant/core/issues/123568#issuecomment-2283728778

evakq8r avatar Aug 13 '24 11:08 evakq8r

@agners did some more testing today. Without any real direction, I decided to try and update multiple devices at once and run into these error:

2024-08-14 22:01:53.621 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-08-14 22:01:53.623 (Dummy-2) CHIP_ERROR [chip.native.CTL] Session establishment failed for <00000000000F1BE4, 1>, error: src/app/OperationalSessionSetup.cpp:257: CHIP Error 0x00000032: Timeout.  Next retry expected to get a response to Sigma1 or fail within 10 seconds
2024-08-14 22:01:53.623 (Dummy-2) CHIP_ERROR [chip.native.SC] Received error (protocol code 1) during pairing process: src/protocols/secure_channel/CASESession.cpp:1984: CHIP Error 0x000000C9: No shared trusted root
2024-08-14 22:01:53.623 (Dummy-2) CHIP_ERROR [chip.native.CTL] Error on commissioning step 'kFindOperationalForStayActive': 'src/protocols/secure_channel/CASESession.cpp:1984: CHIP Error 0x000000C9: No shared trusted root'
2024-08-14 22:01:53.624 (Dummy-2) WARNING [chip.ChipDeviceCtrl] Failed to commission: src/protocols/secure_channel/CASESession.cpp:1984: CHIP Error 0x000000C9: No shared trusted root
2024-08-14 22:01:53.624 (MainThread) ERROR [root] Failed setting up OTA Provider.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/ota/provider.py", line 96, in _commission_ota_provider
    commissioned_node_id = await chip_device_controller.commission_on_network(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/sdk.py", line 150, in commission_on_network
    await self._chip_controller.CommissionOnNetwork(
  File "/usr/local/lib/python3.11/site-packages/chip/ChipDeviceCtrl.py", line 1872, in CommissionOnNetwork
    return await asyncio.futures.wrap_future(ctx.future)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
chip.exceptions.ChipStackError: src/protocols/secure_channel/CASESession.cpp:1984: CHIP Error 0x000000C9: No shared trusted root
2024-08-14 22:01:53.626 (MainThread) INFO [matter_server.server.ota.provider] Cleaning up OTA provider
2024-08-14 22:01:53.626 (MainThread) INFO [matter_server.server.ota.provider] Terminating OTA Provider
2024-08-14 22:01:53.629 (Dummy-2) CHIP_ERROR [chip.native.CTL] Found unconnected device, removing
2024-08-14 22:01:53.629 (Dummy-2) CHIP_ERROR [chip.native.CTL] Failed to extend fail-safe for CASE retry: src/app/CommandSender.cpp:328: CHIP Error 0x00000032: Timeout
2024-08-14 22:01:53.629 (Dummy-2) CHIP_ERROR [chip.native.CTL] Failed to extend fail-safe for CASE retry: src/app/CommandSender.cpp:328: CHIP Error 0x00000032: Timeout
2024-08-14 22:01:53.633 (MainThread) ERROR [matter_server.server.client_handler] [139719429632080] Error while handling: update_node (node 180): Error while setting up OTA Provider.
2024-08-14 22:01:59.854 (Dummy-2) CHIP_ERROR [chip.native.SC] PASESession timed out while waiting for a response from the peer. Expected message type was 33
2024-08-14 22:01:59.854 (Dummy-2) CHIP_ERROR [chip.native.ZCL] Secure Pairing Failed
2024-08-14 22:01:59.854 (Dummy-2) WARNING [chip.ChipDeviceCtrl] Failed to establish secure session to device: src/controller/python/ChipDeviceController-ScriptDevicePairingDelegate.cpp:89: CHIP Error 0x00000003: Incorrect state
2024-08-14 22:01:59.854 (MainThread) ERROR [root] Failed setting up OTA Provider.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/ota/provider.py", line 96, in _commission_ota_provider
    commissioned_node_id = await chip_device_controller.commission_on_network(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/sdk.py", line 150, in commission_on_network
    await self._chip_controller.CommissionOnNetwork(
  File "/usr/local/lib/python3.11/site-packages/chip/ChipDeviceCtrl.py", line 1872, in CommissionOnNetwork
    return await asyncio.futures.wrap_future(ctx.future)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
chip.exceptions.ChipStackError: src/controller/python/ChipDeviceController-ScriptDevicePairingDelegate.cpp:89: CHIP Error 0x00000003: Incorrect state
2024-08-14 22:01:59.855 (MainThread) INFO [matter_server.server.ota.provider] Cleaning up OTA provider
2024-08-14 22:01:59.855 (MainThread) INFO [matter_server.server.ota.provider] Terminating OTA Provider
2024-08-14 22:01:59.864 (MainThread) ERROR [matter_server.server.client_handler] [139719429632080] Error while handling: update_node (node 184): Error while setting up OTA Provider.

I'm not sure whether it's because I introduced a race condition when attempting to update a bunch at once, or if this is expected behaviour when someone decides to spam updates.

Full Matter log (after a fresh restart): https://dpaste.org/EPK4i

evakq8r avatar Aug 14 '24 12:08 evakq8r

You will probably never be able to update more than one device on the same border router at the same time stably, just because thread bandwidth is very limited. I am barely able to toggle more than 8 thread lights in a stable fashion at the same time. (That is also why thread group is working on reducing SRP messages sizes on the thread mesh, for instance...)

So what the matter integration could do is queue-up the accepted updates (possibly skipping failing updates and possibly maybe also managing grace periods between updates).

3oris avatar Aug 15 '24 08:08 3oris

One other observation I made while trying to update eve energys is an error message like "cannot update node because it is already processing an update".

My assumption here is as follows:

I use a nest thread network, and android phone to commission my devices to home assistant. The result is that devices are always being commissioned to the home assistant and google home matter fabric (even though the devices don't show up in the Google Home App). At least Home Assistant will always show both fabrics in device info > manage matter fabric.

Now, could it be that there is some interference between both fabrics?

3oris avatar Aug 15 '24 08:08 3oris

I reboot the Apple TV so many times - I also relocated the plug - update always failed - any other clue?

I didn't meant to reboot, but to disconnect/remove it entirely. Make the OTBR the only BR. That is a setup I've definitely had working on my end.

Ok, the trick was: shutdown Apple TV - wait for 15min till the plug reconnects to the skyconnect. And now the update is working! Thanks for the help.

pietro4711 avatar Aug 15 '24 10:08 pietro4711

I reboot the Apple TV so many times - I also relocated the plug - update always failed - any other clue?

I didn't meant to reboot, but to disconnect/remove it entirely. Make the OTBR the only BR. That is a setup I've definitely had working on my end.

Ok, the trick was: shutdown Apple TV - wait for 15min till the plug reconnects to the skyconnect. And now the update is working! Thanks for the help.

Interesting. Two notes here:

  • I am pretty sure that I updated my eve motions via a nest hub.
  • For plugs it would probably be a work-around to move them near the otbr since they are movable. For the rest of my 80+ thread devices (mostly GU10s) probably not. I also don't want to think of what happens when these devices all try to connect to the otbr while the nest hubs being turned off... 😀

3oris avatar Aug 15 '24 11:08 3oris

So, ultimately the eve energy plug did the update last night o_O

Since I removed it from the google home fabric the day before I actually wonder why the update happened automatically then?

Ntl, its on 3.2.1 now...

(another one that I had commissioned to the google fabric only did the update as well last night btw...)

3oris avatar Aug 16 '24 09:08 3oris

Ok, the trick was: shutdown Apple TV - wait for 15min till the plug reconnects to the skyconnect. And now the update is working! Thanks for the help.

Thanks for confirming! That is what I've observed too. It seems that the Apple TV BR somehow does not forward the mDNS announcement from our OTA provider or causes some other networking issue.

agners avatar Aug 16 '24 12:08 agners

  • I am pretty sure that I updated my eve motions via a nest hub.

Technically, any Thread BR should work. I've Google Nest Hubs in my network and I was successfully updating devices, so I think the Google Nest Hubs are fine. There are reports though from folks using Google Nest Hubs and having issues with the HA update. :cry:

  • For plugs it would probably be a work-around to move them near the otbr since they are movable. For the rest of my 80+ thread devices (mostly GU10s) probably not. I also don't want to think of what happens when these devices all try to connect to the otbr while the nest hubs being turned off... 😀

In theory, it is a mesh, and traffic gets routed through Thread routers (your GU10 are acting as Thread routers since they are mains powered), technically, a single exit point should be sufficient. That is how Z-Wave/Zigbee operates... :man_shrugging: But yeah, with Thread and having multiple BRs (+TREL) can lower the traffic going through the mesh via RF by quite a bit.

So, ultimately the eve energy plug did the update last night o_O

So it worked through Home Assistant update :thinking:

agners avatar Aug 16 '24 12:08 agners

  • I am pretty sure that I updated my eve motions via a nest hub.

Technically, any Thread BR should work. I've Google Nest Hubs in my network and I was successfully updating devices, so I think the Google Nest Hubs are fine. There are reports though from folks using Google Nest Hubs and having issues with the HA update. 😢

  • For plugs it would probably be a work-around to move them near the otbr since they are movable. For the rest of my 80+ thread devices (mostly GU10s) probably not. I also don't want to think of what happens when these devices all try to connect to the otbr while the nest hubs being turned off... 😀

In theory, it is a mesh, and traffic gets routed through Thread routers (your GU10 are acting as Thread routers since they are mains powered), technically, a single exit point should be sufficient. That is how Z-Wave/Zigbee operates... 🤷‍♂️ But yeah, with Thread and having multiple BRs (+TREL) can lower the traffic going through the mesh via RF by quite a bit.

You're absolutely right. I guess I wanted to state something more like: Removing 5/6th of my border routers will cause recalculation of RLOCs of quite a chunk of the network devices (that alone can flood the network quite a bit) yielding changes in topology with changes in link quality and changed usable bandwidth in upstream paths of leaf nodes, nodes popping out to search for new routers, new RLOCs again, and so on. What I meant to say is, that after turning off all TBRs but the OTBR one I probably should wait a bit for devices to settle down before thinking of using my thread network for a device update. :)

So, ultimately the eve energy plug did the update last night o_O

So it worked through Home Assistant update 🤔

Yes it worked eventually.

3oris avatar Aug 17 '24 12:08 3oris

@agners Just an update on this; unfortunately my VM for HAOS required a complete reinstall from a backup. Backup restored successfully with the exception of Matter.

Not a single Matter device was rediscovered by mDNS but the entities remained, which was further proven by factory resetting a motion sensor. The sensor added successfully, but took the same slot as the previous integration, just with _2 against the entity IDs. Is this expected behaviour? Even my 35 Zigbee devices reintegrated into the restored VM without any problems.

With that said, HA prompted a DCL update to the motion sensor and to my surprise, it worked!

Matter Log: https://dpaste.org/YY9XV OTA Log (it's 13k lines): ota_provider_20240826_171409.log

So it does at least prove the Google Hubs can do the updates (which others have said have been successful, eventually), I just don't understand why it only did them now. The only change that occurred was my VM needed to be restored from a backup, otherwise the hardware/software/network conditions are the same.

evakq8r avatar Aug 26 '24 17:08 evakq8r

Were the Matter devices online previously? What is crucial is that the Matter Server is restored too, since it has all the certificates for the devices. I definitely transferred setups before from one installation to another and things came back online...

Do you happen to use Docker on your VM host machine? I just recently discovered that it blocks IPv6 multicast on all bridges, which is problematic when using the network interface of your VM in bridged mode.

agners avatar Aug 26 '24 21:08 agners

Were the Matter devices online previously?

90% of them were. I shut down and restarted HA to try and bring the other devices that weren't working back into the fold, but that's where my problems began.

Do you happen to use Docker on your VM host machine?

I do. The host machine is Unraid, and HAOS runs in a VM using the qcow2 image.

The network setup isn't standard though (but has worked fine, with a few tweaks). I have a VLAN (40) configured in the Network Settings section of Unraid with a fixed IP to allow my VMs to obtain IP addresses from my router within VLAN. I then configure IPv6 RA (Router Advertisements) on the VLAN in Unraid using the following sysctl commands and have the same commands run at Array Startup (ie. when the hard drive array starts after a host reboot):

https://forums.unraid.net/topic/166559-unraid-with-matter-thread-and-unifi-ipv6-connectivity-issues/page/2/

sysctl -w net.ipv6.conf.eth0/40.accept_ra=2
sysctl -w net.ipv6.conf.eth0/40.accept_ra_rt_info_max_plen=64

I then verify the routes are visible within Terminal of HA with ip -6 r and confirm the IPv6 addresses visible on the VLAN 40 vNIC.

evakq8r avatar Aug 27 '24 01:08 evakq8r

Yeah IPv6 RIO processing is crucial to get the routes of Thread border router. This is also documented in the Python Matter server readme. Also note that NetworkManager does it's own IPv6 RIO processing, and unfortunately it doesn't do it that well :cry:

I do. The host machine is Unraid, and HAOS runs in a VM using the qcow2 image.

What version of Docker are you using? Afaik, the problem only starts wtih recent versions (seemingly newer than 27.0.3, see https://github.com/moby/moby/issues/48365).

Definitely check your firewall rules, try sysctl -w net.bridge.bridge-nf-call-ip6tables=0 to disable firewall processing for bridges (especially if you use a bridge to expose the network to your VM).

agners avatar Aug 27 '24 06:08 agners

Yeah IPv6 RIO processing is crucial to get the routes of Thread border router.

Ah, that's what they are called. Good to know.

What version of Docker are you using? Afaik, the problem only starts wtih recent versions (seemingly newer than 27.0.3, see moby/moby#48365).

Unraid version is 6.12.13, Docker version is 24.0.9:

image

Definitely check your firewall rules, try sysctl -w net.bridge.bridge-nf-call-ip6tables=0 to disable firewall processing for bridges (especially if you use a bridge to expose the network to your VM).

Made the change. Should there be anything expected in HA/Matter behaviour with this rule disabled?

evakq8r avatar Aug 27 '24 06:08 evakq8r

Made the change. Should there be anything expected in HA/Matter behaviour with this rule disabled?

In the end, it depends on the firewall rules present :sweat_smile:

I am a bit confused on what the problem is currently. From what I understand after restoring it seemed like a complete communication breakdown, which could be related to a bogous firewall rules/bridge configuration and probably a thousand other issues.

The most likely case that somehow you've borked the Matter Server data store (e.g. by not restoring it). But it is hard to tell at this point.

Given that the update of your Thread device worked finally, I don't think there is a Core/code issue here. This is a network connectivity issue, or intermittend network connection problems. This is the wrong place to debug such issues, especially since your network seems rather complicated etc. If you'd like to further debug/discuss the issue the community forum's Thread/Matter section is probably the better place.

agners avatar Aug 27 '24 06:08 agners