core icon indicating copy to clipboard operation
core copied to clipboard

Matter Server: All device offline all of a sudden

Open 3oris opened this issue 1 year ago • 16 comments

The problem

After about 5 days of operation all matter devices become unavailable. The devices are still online in the other (google home) fabric though.

The devices are still pingable from the device info page, and if I do so the specific device gets back online again.

This is not feasible though manually with over 90 matter devices in the system.

Matter devices

  • eve sensors and plugs (thread)
  • nanoleaf lights (thread)
  • onvis s4 plugs (thread)
  • innovation matters (wifi)
  • ledvance lights (wifi)
  • wiz lights (wifi)

Border routers

  • 1 OTBR hosted on RPI 3
  • 5 Nest Hubs 2nd gen (updated to F20)

What version of Home Assistant Core has the issue?

core-2024.9.1

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Matter

Link to integration documentation on our website

No response

Diagnostics information

core_matter_server_2024-09-17T15-42-59.844Z.log matter-c921cb8346a353e6865401775d822fe4-Essentials GU10-80fecbd596935ee1f84171a5c0aac88b.json

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

3oris avatar Sep 17 '24 15:09 3oris

Restarting the matter server fixes it (after some time).

3oris avatar Sep 17 '24 16:09 3oris

Same problem for me. Using Home Assistant OS and core-2024.9.2

tornenen avatar Sep 17 '24 18:09 tornenen

Veryfy your network setting in homeassistant. Mine had changed to something completely different. Setting a static adress solved the isue

deveylder avatar Sep 17 '24 18:09 deveylder

Veryfy your network setting in homeassistant. Mine had changed to something completely different. Setting a static adress solved the isue

No, still the same for me.

tornenen avatar Sep 17 '24 18:09 tornenen

@3oris (and others) when the device go unavailable, does reloading the integration helps? Settings -> Devices & services -> Matter -> Three dot menu -> Reload.

What Home Assistant OS and Matter Server add-on version are you using?

agners avatar Sep 18 '24 10:09 agners

Hey there @home-assistant/matter, mind taking a look at this issue as it has been labeled with an integration (matter) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of matter can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign matter Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


matter documentation matter source (message by IssueLinks)

home-assistant[bot] avatar Sep 18 '24 10:09 home-assistant[bot]

@3oris (and others) when the device go unavailable, does reloading the integration helps? Settings -> Devices & services -> Matter -> Three dot menu -> Reload.

@agners : will check as soon as it happens again (probably tomorrow or Saturday). Restarting the Add-On does help to say the least.

What Home Assistant OS and Matter Server add-on version are you using?

  • HAOS: 13.1
  • Supervisor: 2024.09.1
  • Home Assistant Core: 2024.9.1 (last time it happened, now .2)
  • Matter Server: 6.5.1

3oris avatar Sep 19 '24 11:09 3oris

@agners -- Also, I was wondering if it might be a regression in 6.5.1 https://github.com/home-assistant-libs/python-matter-server/pull/882 , but you probably will know anyways.

3oris avatar Sep 19 '24 11:09 3oris

@agners -- Also, I was wondering if it might be a regression in 6.5.1 home-assistant-libs/python-matter-server#882 , but you probably will know anyways.

You would have a SEVERE issue with mdns if that cleanup is causing your nodes now to be offline.

What is the state of the nodes within the Matter Server's own UI ?

marcelveldt avatar Sep 19 '24 19:09 marcelveldt

Hello, I think I have a similar issue with the 6.5.1 matter server. I don't see any nodes in the Web UI.

Bildschirmfoto 2024-09-19 um 22 02 45 Bildschirmfoto 2024-09-19 um 22 02 29

ThomasKoppensteiner avatar Sep 19 '24 20:09 ThomasKoppensteiner

Hello, I think I have a similar issue with the 6.5.1 matter server. I don't see any nodes in the Web UI.

Well, that is another issue. Maybe you (accidentally) reinstalled the whole Matter integration? You need to restore a backup to get your nodes back as the data is stored in the matter addon data.

marcelveldt avatar Sep 19 '24 21:09 marcelveldt

@3oris (and others) when the device go unavailable, does reloading the integration helps? Settings -> Devices & services -> Matter -> Three dot menu -> Reload.

@agners -- So, it happened again, I restarted the integration , devices came back very very slowly. And only a few minutes after they were all back, they all disappeared again and the matter server was one again in the state of https://github.com/home-assistant/core/issues/124647 which I hadn't seen since the upgrade to 6.5.0b2.

Before I restarted the Matter server I took the logs: matter-server.log

3oris avatar Sep 20 '24 09:09 3oris

i guess my problem just flew away.. after 3 times i had this issue and restarting the matter server afterwards its now running since 2 days without problems.

tornenen avatar Sep 20 '24 09:09 tornenen

@agners -- Also, I was wondering if it might be a regression in 6.5.1 home-assistant-libs/python-matter-server#882 , but you probably will know anyways.

You would have a SEVERE issue with mdns if that cleanup is causing your nodes now to be offline.

What is the state of the nodes within the Matter Server's own UI ?

@marcelveldt -- Will tell next time it happens.

3oris avatar Sep 20 '24 09:09 3oris

You need to restore a backup to get your nodes back as the data is stored in the matter addon data.

@marcelveldt yes, I reinstalled the matter integration, but why does a reinstall not create a new node? Isn't this an issue?

If so should I create a new github issue?

ThomasKoppensteiner avatar Sep 20 '24 16:09 ThomasKoppensteiner

Resetting my HomeAssistant VM to a previous state fixed the problem for me. Know I see the nodes again. Running version 6.4.1 now.

ThomasKoppensteiner avatar Sep 21 '24 08:09 ThomasKoppensteiner

@marcelveldt yes, I reinstalled the matter integration, but why does a reinstall not create a new node? Isn't this an issue?

If you reinstall the Matter integration, all data gets reset. So you basically destroyed your Matter network by uninstalling Matter from HA.

marcelveldt avatar Sep 23 '24 07:09 marcelveldt

Resetting my HomeAssistant VM to a previous state fixed the problem for me. Know I see the nodes again. Running version 6.4.1 now.

If you do a regular update, the nodes should not get lost. Can you try updating the add-on (again)? Worst case you should be able to restore 6.4.1.

That said, while the outcome of your issue is similar to the original poster, I don't think you suffer the same problem: In your case the store on the Matter Server lost all devices. If this happens with the second update attempt again, can you open a separate issue for this? This would be some type of add-on update issue :thinking:

agners avatar Sep 23 '24 15:09 agners

@agners -- So, it happened again, I restarted the integration , devices came back very very slowly. And only a few minutes after they were all back, they all disappeared again and the matter server was one again in the state of #124647 which I hadn't seen since the upgrade to 6.5.0b2.

Hm, that sounds like your whole system is completely overwhelmed somehow. I guess the Matter Server doesnt' respond in time for the Core, so the Core gives up communicating. I wonder if the Matter Server gets itself in a state where things just go awry.

Some messages I haven't seen so far, that sounds as if the message got corrupted :thinking:

[32m2024-09-20 05:56:18.928[0m (Dummy-2) [1;30mCHIP_ERROR[0m [34m[chip.native.EM][0m [31mDropping unexpected message of type 0x5 with protocolId (0, 1) and MessageCounter:141254017 on exchange 44431i with Node: <00000000000000E2, 1>[0m

From what I can tell you run this on a Raspberry Pi 3? :thinking: Maybe this is just a bit too much for it to handle :cry:

agners avatar Sep 23 '24 15:09 agners

Resetting my HomeAssistant VM to a previous state fixed the problem for me. Know I see the nodes again. Running version 6.4.1 now.

If you do a regular update, the nodes should not get lost. Can you try updating the add-on (again)? Worst case you should be able to restore 6.4.1.

That said, while the outcome of your issue is similar to the original poster, I don't think you suffer the same problem: In your case the store on the Matter Server lost all devices. If this happens with the second update attempt again, can you open a separate issue for this? This would be some type of add-on update issue 🤔

He removed the Matter integration (to reinstall) but that also removed the matter add-on with its configuration. So that is what got his nodes lost. It reminds me that we should probably add a confirmation to HA when trying to remove Matter, Z-Wave or Zigbee that this may lead to loss of data without a backup.

marcelveldt avatar Sep 23 '24 20:09 marcelveldt

@agners -- Also, I was wondering if it might be a regression in 6.5.1 home-assistant-libs/python-matter-server#882 , but you probably will know anyways.

You would have a SEVERE issue with mdns if that cleanup is causing your nodes now to be offline. What is the state of the nodes within the Matter Server's own UI ?

@marcelveldt -- Will tell next time it happens.

@marcelveldt -- they just all show offline in the Matter server add-on UI

3oris avatar Sep 25 '24 16:09 3oris

@agners -- So, it happened again, I restarted the integration , devices came back very very slowly. And only a few minutes after they were all back, they all disappeared again and the matter server was one again in the state of #124647 which I hadn't seen since the upgrade to 6.5.0b2.

Hm, that sounds like your whole system is completely overwhelmed somehow. I guess the Matter Server doesnt' respond in time for the Core, so the Core gives up communicating. I wonder if the Matter Server gets itself in a state where things just go awry.

Some messages I haven't seen so far, that sounds as if the message got corrupted 🤔

�[32m2024-09-20 05:56:18.928�[0m (Dummy-2) �[1;30mCHIP_ERROR�[0m �[34m[chip.native.EM]�[0m �[31mDropping unexpected message of type 0x5 with protocolId (0, 1) and MessageCounter:141254017 on exchange 44431i with Node: <00000000000000E2, 1>�[0m

From what I can tell you run this on a Raspberry Pi 3? 🤔 Maybe this is just a bit too much for it to handle 😢

@agners -- no, this is Home Assistant running on HA Green. What I run on RPi3 is the OTBR which I run isolated from HA and compile myself in order to have some observability into the thread network via cli like channel monitor, TREL connectivity, child node distribution, link quality and stuff. By this I was also able to chose a thread channel with literally no wifi interference (as far as I can tell). But also, there is no difference on the matter fabric if I take the OTBR or any of the nest hubs out of the thread network. (I cannot take two or more TBRs out of the network though, because then total coverage is to low and the thread network gets overloaded.)

The points I am trying to make here:

  • The matter server should be good in terms of resources (HA Green, with reasonable CPU usage of the matter server)
  • The thread network should also be good by and large.

3oris avatar Sep 25 '24 17:09 3oris

If you do a regular update, the nodes should not get lost. Can you try updating the add-on (again)? Worst case you should be able to restore 6.4.1.

That said, while the outcome of your issue is similar to the original poster, I don't think you suffer the same problem: In your case the store on the Matter Server lost all devices. If this happens with the second update attempt again, can you open a separate issue for this? This would be some type of add-on update issue 🤔

Hey, I did another upgrade to 6.5.1 and this time it works as expected. The old nodes were visable right after the updated and were also available soon afterwards. Additionally I was able to add new matter devices as well (this was also not working before).

My issue is fixed. Thank you for the support.

ThomasKoppensteiner avatar Sep 27 '24 11:09 ThomasKoppensteiner

I have the same issue for my EVE matter decices (motion, door, energy) the exact time I updated my iPhone to iOS 18 and my homepod to latest version. Matter server is also 6.5.1, i have no pending updates on anything in HA and HA is also on latest version. My EVE devices work on EVE app and on Home app. I also cannot re-add them For some reason it keeps failing. Here are my logs:

2024-09-27 18:45:23.435 (MainThread) WARNING [matter_server.server.device_controller] <Node:2> Setup for node failed: Unable to establish CASE session with Node 2
2024-09-27 18:45:23.435 (MainThread) INFO [matter_server.server.device_controller] <Node:2> Retrying node setup in 60 seconds...
2024-09-27 18:45:27.963 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964488 on exchange 28264i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:45:34.630 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:45:37.635 (MainThread) INFO [matter_server.server.sdk] <Node:3> Attempting to establish CASE session... (attempt 2 of 2)
2024-09-27 18:46:19.261 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964489 on exchange 28265i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:46:23.438 (MainThread) INFO [matter_server.server.device_controller] <Node:2> Setting-up node...
2024-09-27 18:46:26.609 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:46:26.611 (MainThread) WARNING [matter_server.server.device_controller] <Node:3> Setup for node failed: Unable to establish CASE session with Node 3
2024-09-27 18:46:26.611 (MainThread) INFO [matter_server.server.device_controller] <Node:3> Retrying node setup in 60 seconds...
2024-09-27 18:47:04.691 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964490 on exchange 28266i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:47:12.163 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:47:26.613 (MainThread) INFO [matter_server.server.device_controller] <Node:3> Setting-up node...
2024-09-27 18:47:54.767 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964491 on exchange 28267i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:48:00.684 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:48:03.689 (MainThread) INFO [matter_server.server.sdk] <Node:2> Attempting to establish CASE session... (attempt 2 of 2)
2024-09-27 18:48:07.901 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964492 on exchange 28268i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:48:15.340 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:48:45.746 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964493 on exchange 28269i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:48:52.418 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:48:56.840 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964494 on exchange 28270i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:49:03.867 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:49:06.872 (MainThread) INFO [matter_server.server.sdk] <Node:3> Attempting to establish CASE session... (attempt 2 of 2)
2024-09-27 18:49:32.548 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964495 on exchange 28271i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:49:40.942 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:49:40.944 (MainThread) WARNING [matter_server.server.device_controller] <Node:2> Setup for node failed: Unable to establish CASE session with Node 2
2024-09-27 18:49:40.945 (MainThread) INFO [matter_server.server.device_controller] <Node:2> Retrying node setup in 60 seconds...
2024-09-27 18:49:47.108 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964496 on exchange 28272i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:49:55.714 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:49:55.716 (MainThread) WARNING [matter_server.server.device_controller] <Node:3> Setup for node failed: Unable to establish CASE session with Node 3
2024-09-27 18:49:55.716 (MainThread) INFO [matter_server.server.device_controller] <Node:3> Retrying node setup in 60 seconds...
2024-09-27 18:50:40.947 (MainThread) INFO [matter_server.server.device_controller] <Node:2> Setting-up node...
2024-09-27 18:50:55.719 (MainThread) INFO [matter_server.server.device_controller] <Node:3> Setting-up node...
2024-09-27 18:51:21.878 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964497 on exchange 28273i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:51:29.677 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:51:38.724 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964498 on exchange 28274i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:51:44.442 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:52:12.310 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964499 on exchange 28275i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:52:18.203 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:52:21.207 (MainThread) INFO [matter_server.server.sdk] <Node:2> Attempting to establish CASE session... (attempt 2 of 2)
2024-09-27 18:52:26.424 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964500 on exchange 28276i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:52:32.970 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:52:35.976 (MainThread) INFO [matter_server.server.sdk] <Node:3> Attempting to establish CASE session... (attempt 2 of 2)
2024-09-27 18:53:01.039 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964501 on exchange 28277i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:53:09.931 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:53:17.511 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964502 on exchange 28278i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:53:24.817 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:53:24.819 (MainThread) WARNING [matter_server.server.device_controller] <Node:3> Setup for node failed: Unable to establish CASE session with Node 3
2024-09-27 18:53:24.820 (MainThread) WARNING [matter_server.server.device_controller] <Node:3> Node setup not completed after 30 minutes, giving up.
2024-09-27 18:53:50.917 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964503 on exchange 28279i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:53:58.447 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:53:58.449 (MainThread) WARNING [matter_server.server.device_controller] <Node:2> Setup for node failed: Unable to establish CASE session with Node 2
2024-09-27 18:53:58.450 (MainThread) INFO [matter_server.server.device_controller] <Node:2> Retrying node setup in 60 seconds...
2024-09-27 18:54:58.457 (MainThread) INFO [matter_server.server.device_controller] <Node:2> Setting-up node...
2024-09-27 18:55:42.408 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964504 on exchange 28280i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:55:47.179 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:56:30.409 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964505 on exchange 28281i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:56:35.708 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:56:38.713 (MainThread) INFO [matter_server.server.sdk] <Node:2> Attempting to establish CASE session... (attempt 2 of 2)
2024-09-27 18:57:22.536 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964506 on exchange 28282i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:57:27.442 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:58:09.472 (Dummy-2) CHIP_ERROR [chip.native.EM] Failed to Send CHIP MessageCounter:217964507 on exchange 28283i with Node: <0000000000000000, 0> sendCount: 4 max retries: 4
2024-09-27 18:58:15.960 (Dummy-2) CHIP_ERROR [chip.native.SC] CASESession timed out while waiting for a response from the peer. Current state was 1
2024-09-27 18:58:15.962 (MainThread) WARNING [matter_server.server.device_controller] <Node:2> Setup for node failed: Unable to establish CASE session with Node 2
2024-09-27 18:58:15.962 (MainThread) WARNING [matter_server.server.device_controller] <Node:2> Node setup not completed after 30 minutes, giving up.
s6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service matter-server: stopping
2024-09-27 19:15:30.842 (MainThread) WARNING [aiorun] Stopping the loop
2024-09-27 19:15:30.842 (MainThread) INFO [aiorun] Entering shutdown phase.
2024-09-27 19:15:30.842 (MainThread) INFO [aiorun] Executing provided shutdown_callback.
2024-09-27 19:15:30.842 (MainThread) INFO [matter_server.server.server] Stopping the Matter Server...
2024-09-27 19:15:30.843 (MainThread) INFO [matter_server.server.client_handler] [139977044284496] Connection closed by client
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
2024-09-27 19:15:30.848 (MainThread) INFO [matter_server.server.stack] Shutting down the Matter stack...
2024-09-27 19:15:30.848 (MainThread) CHIP_ERROR [chip.native.CTL] Shutting down the stack...
2024-09-27 19:15:30.850 (MainThread) CHIP_ERROR [chip.native.DIS] Failed to advertise records: src/inet/UDPEndPointImplSockets.cpp:416: OS Error 0x02000065: Network is unreachable
2024-09-27 19:15:30.853 (MainThread) CHIP_ERROR [chip.native.DIS] Failed to advertise records: src/lib/dnssd/minimal_mdns/Server.cpp:344: CHIP Error 0x00000046: No endpoint was available to send the message
2024-09-27 19:15:30.854 (MainThread) CHIP_ERROR [chip.native.DL] Inet Layer shutdown
2024-09-27 19:15:30.854 (MainThread) CHIP_ERROR [chip.native.DL] BLE shutdown
2024-09-27 19:15:30.854 (MainThread) CHIP_ERROR [chip.native.DL] System Layer shutdown
2024-09-27 19:15:30.855 (MainThread) INFO [aiorun] Waiting for executor shutdown.
2024-09-27 19:15:30.855 (MainThread) INFO [aiorun] Shutting down async generators
2024-09-27 19:15:30.855 (MainThread) INFO [aiorun] Closing the loop.
2024-09-27 19:15:30.855 (MainThread) INFO [aiorun] Leaving. Bye!
[16:15:31] INFO: matter-server service exited with code 0 (by signal 0).
s6-rc: info: service matter-server successfully stopped
s6-rc: info: service banner: stopping
s6-rc: info: service banner successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service banner: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
s6-rc: info: service legacy-cont-init successfully started 

Sporbillis avatar Sep 27 '24 16:09 Sporbillis

I've had this happen, but I concluded that the issue wasn't HA, it (or at least that the issue also involved other equipement). I found that to bring devices back online, I needed to reboot my Google Wifi Pro 6e WiFi routers (which also include my OTBRs).

Also, I have both Nest OTBRs and 3 Apple TV OTBRs and have found that if I leave the Nest enabled (and unplug the Apple TVs), all seems OK and stable, but if I add more than 1 Apple OTBR, it can cause instability.

I'm thinking there may be something going on when you have a mix of OTBRs from different vendors, in my case, particularly seems to happen when Apple OTBRs and Nest OTBRs try to join into a single thread network. But as long as Apple / Google Nest maintain separate thread networks, its more stable. None of this really makes much sense, but it points to issues that may be beyond HA. Also, entire setup destabilizes if I use Matter 1.0 devices (hello Eve!).

jvmahon avatar Sep 30 '24 15:09 jvmahon

I've had this happen, but I concluded that the issue wasn't HA, it (or at least that the issue also involved other equipement). I found that to bring devices back online, I needed to reboot my Google Wifi Pro 6e WiFi routers (which also include my OTBRs).

Also, I have both Nest OTBRs and 3 Apple TV OTBRs and have found that if I leave the Nest enabled (and unplug the Apple TVs), all seems OK and stable, but if I add more than 1 Apple OTBR, it can cause instability.

I'm thinking there may be something going on when you have a mix of OTBRs from different vendors, in my case, particularly seems to happen when Apple OTBRs and Nest OTBRs try to join into a single thread network. But as long as Apple / Google Nest maintain separate thread networks, its more stable. None of this really makes much sense, but it points to issues that may be beyond HA. Also, entire setup destabilizes if I use Matter 1.0 devices (hello Eve!).

Maybe your case is different because as I mentioned everything was fine for 1 year until I upgraded to homepod OS 18 and iOS18. The devices work on all my other apps except home assistant. I am also not able to re-add them anymore it keeps failing.

Sporbillis avatar Sep 30 '24 17:09 Sporbillis

@agners @marcelveldt -- an update:

I have been running on 6.5.2b0 with way less trouble over the last 1.5 weeks. I also see there is a 6.5.2 release but I don't seem to receive it.

Anyhow, with 6.5.2b0 always only a few devices go offline in the HA fabric while still being pingable from the device info page. So, not all devices any more. These devices are then also reported as unavailable on the Matter Add-On UI, and as before I can ping them back online into the HA matter fabric.

Also, but this is guessing now, those devices that go offline in chunks seem to be connected to the same TBR (Nest Hub G2 F20) at that time which also is a bit contradictory to the fact that they are pingable. continuing to keep an eye...

3oris avatar Oct 02 '24 19:10 3oris

Let's try to prevent duplicate issues. We're tracking the availability issue in this report: https://github.com/home-assistant/core/issues/123835

In general, using multiple Border routers is simply broken atm. Using just one and it will be stable. A lot of pingpong is going on if this is a apple issue, general Thread issue, TREL issue or combination. In any case, the issue is not unique to HA.

marcelveldt avatar Oct 29 '24 15:10 marcelveldt

This is not an Apple issue, it's all Nest Hubs and one OTBR (on a dedicated RPi3b).

Lowering the amount of BRs is not an option, since the node count is already 135 and one Nest Hub BR is only able to handle about 20 nodes max (be it due to hardware capacity or thread channel congestion). So 7 BRs seems to be a reasonable amount of BRs.

With the recent update to Fuchsia 20.1 things really started to become more stable. I cannot tell why though. Also, TREL seems to work actually well with Nest Hubs. E.g. it makes a huge difference if I disable TREL in the OTBR.

So, in general, I would not follow your statement that using multiple BRs is broken.

But I am also fine with closing this issue here since I feel that the same issue is now popping up every other week, and I see that you guys are actually on the topic.

3oris avatar Oct 30 '24 17:10 3oris

I had the same issue! I restored my Homeassistant VM with the backup from before the last update, and all Eve / Matter devices are back! So there must be something wrong with the latest update!

Puller avatar Nov 12 '24 16:11 Puller