core icon indicating copy to clipboard operation
core copied to clipboard

All thread Homekit devices drops regularly and restart of HA required.

Open holblin opened this issue 5 months ago • 13 comments

The problem

My blinds configured through Thread/Homekit becomes unavailable until I restart my HA.

Here are some screenshot of my HA interface of when the problem occurs: Screenshot 2024-01-24 at 1 56 08 PM Screenshot 2024-01-24 at 1 56 49 PM

I also noticed that in the log, they where some Homekit errors: Screenshot 2024-01-24 at 1 58 23 PM

Note than when a device becomes unavailable, not all the blind are shown as unavailable but as soon as you visit their page/details, they become unavailable.

The only way I found to fix temporary the issue is to restart HA. After the restart, the devices works for some time but they eventually fails. I will report each fail, so the frequency is known but it's pretty often by experience (multiple time per months).

What version of Home Assistant Core has the issue?

core-2024.1.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

HomeKit Device

Link to integration documentation on our website

https://www.home-assistant.io/integrations/homekit_controller/

Diagnostics information

Here is the diagnostic of the device:

homekit_controller-514baf1422af40e0e6972880cbb662bc-Eve MotionBlinds 7448-d7332f9cc788103decbb2606a2bad9b9.json.txt

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: aiohomekit.controller.coap.connection
Source: components/homekit_controller/connection.py:891
First occurred: January 16, 2024 at 8:37:12 AM (118 occurrences)
Last logged: 1:47:32 PM

Decryption failed, desynchronized? Counter=6053/6058
Failed flailing attempts to resynchronize, self-destructing in 3, 2, 1...
Decryption failed, desynchronized? Counter=1204/1209
Decryption failed, desynchronized? Counter=11/20
Decryption failed, desynchronized? Counter=11/28

Additional information

Logger: aiohomekit.controller.coap.connection Source: components/homekit_controller/connection.py:891 First occurred: January 16, 2024 at 8:37:12 AM (118 occurrences) Last logged: 1:47:32 PM

Decryption failed, desynchronized? Counter=6053/6058 Failed flailing attempts to resynchronize, self-destructing in 3, 2, 1... Decryption failed, desynchronized? Counter=1204/1209 Decryption failed, desynchronized? Counter=11/20 Decryption failed, desynchronized? Counter=11/28

holblin avatar Jan 24 '24 22:01 holblin

Hey there @jc2k, @bdraco, mind taking a look at this issue as it has been labeled with an integration (homekit_controller) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of homekit_controller can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign homekit_controller Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


homekit_controller documentation homekit_controller source (message by IssueLinks)

home-assistant[bot] avatar Jan 24 '24 22:01 home-assistant[bot]

Same issue, but restart doesn't help (for me), only reboot does. I only use BT devices so far, so it might not be Thread only.

jnnsrctr avatar Jan 29 '24 15:01 jnnsrctr

If you don't use thread it's not this issue, thanks.

Jc2k avatar Jan 29 '24 15:01 Jc2k

@holblin those errors indicate that packet loss is occurring. Obviously we can't stop packet loss occurring (you need more router capable devices or to be closer to the BR). Packet loss causes the encryption session to become invalid - each packet is encrypted with a key derived from the number of packets exchanged. So if we lose enough packets we can't derive the correct key. When that happens we have to start again from the beginning. Those numbers in the logs indicate when that packet loss has happened.

Off the top of my head, thread is more reliant on streaming events to update the state in HA than the other HomeKit protocols, which is why it can sometimes be bad at switching to "unavailable".

Unfortunately the expert on the thread support is not around these days. And it is reverse engineered.

I can see problems like this with my own Nanoleaf device - about once a month (if that) here. Up until very recently the problem was (for myself and many other discord users I've helped) normally traceable to thread networking itself (bad routing, bad mesh topology, bad prosumer network gear).

Now that side is getting better, we might have better luck tracking it down.

What I need is full debug logs while it is happening. (Either enabled through the UI or make sure to turn on debugging for aiohomekit if doing manually in config file). And ideally, I need to see its mdns record when it's working, and again when it stops working.

Jc2k avatar Jan 29 '24 15:01 Jc2k

homekit_controller-514baf1422af40e0e6972880cbb662bc-Eve MotionBlinds 7448-d7332f9cc788103decbb2606a2bad9b9.json.txt

It stopped working yesterday during last night. I enabled the debug earlier so I will see how to send that file (I edit my message and attach it). I have one HomePod and multiple blinds, some close by, some further away, does a bad connection to the further away makes everything dropped even the close by?

holblin avatar Jan 30 '24 06:01 holblin

That's the diagnostic download, not the log file.

Jc2k avatar Jan 30 '24 08:01 Jc2k

Same Here. Homekit becomes unavailable until restart HA.

transalpia avatar Feb 01 '24 06:02 transalpia

@transalpia please read my messages earlier in the ticket and if possible provide the information requested. We can't move this issue forward without debug logs.

Please provide as much detail about your environment as possible. For example, not using HAOS will likely cause frequent outages that look like this problem. Some brands of switches can interfere with multicast. Some brands of BR are also just less good. The number of and placement of BRs relative to the device could Be a factor. And of course, we know some vendors just have devices that crash a lot. It could even be a matter device from a different vendor to the ones you are struggling with - if a mains powered thread device crashes any connected devices will be affected.

All of these factors are in addition to the problem in the original post, but they need to be ruled out to make sure any data gathered is valid for this issue.

Jc2k avatar Feb 01 '24 10:02 Jc2k

@Jc2k today in the morning (last time it happens) i activated "debug protokoll". The issue appears approx once a week. Meanwhile for the 4th time. All iCloud components not available. After restart HA ok again.

transalpia avatar Feb 01 '24 11:02 transalpia

Hi @Jc2k , sorry for the delay but I wanted to make sure I clean the log and remove few auth token before sending the file :-) And the file is BIG! I will clean my logs for the next time the error occurs and restart my HA.

https://drive.google.com/file/d/1sEXtP79m83zu1lIIkhcZkjA4AhipGQr-/view?usp=sharing

For my setup I have a HA on a Intel. This is using the HA OS. https://www.amazon.com/gp/product/B09H5961YN/

My thread device are for now, only my blinds from the company eve. They are rechargeable (USB-c), battery is long duration, probably > 6 months.

The thread gateway is my Homepod Mini. The homepod mini is connected through my network via a unifi AP which is very close and there is another AP not far in case the first one drop.

I currently have 7 blinds configured on my system, 2 very close (same room), 2 across one wall, 2 across 2 wall and 1 on the other side of a house (range limit). I planned to buy other Homepod Mini to have better coverage and also have more of my blind connected but I refrain to purchase more until it's working properly.

Does a approximative map of the house + blind position + homepod mini + Wifi AP would be helpfull?

Also, how could I do the mdns record easily? (you probably want to see that from HA?)

holblin avatar Feb 02 '24 23:02 holblin

@Jc2k: My Setup is HA on a Raspberry4 running since 6 months. The devices affected are 6 x Eve Energy, 5 x Eve Thermo, 1x Eve Weather and 2 x Eve Door and Window. All are thread capable. No others via Homekit! Everything connected via thread to an iPod Mini. HACS is installed, WiFi connection via Unifi. No changes in arrangement or configuration. The error first occurred about 3 weeks ago. All Homekit entities were no longer available, but could be reactivated via an HA restart. The first time about 3 weeks ago I didn't think anything of it. The second time was about 10 days ago. The last time it happened was 2 days ago. Everything is fine at the moment. I currently have the debug log enabled and next time the error occurs I will post it here.

By the way: After disconnecting the network or the HomePod, all devices are usually automatically reconnected.

transalpia avatar Feb 03 '24 07:02 transalpia

Are you able to try the beta? Otherwise try the feb release when it's out on Wednesday. I found a case where packet loss can induce an irrecoverable connection.

As above, packet loss causes encryption related counters on both sides to get out of sync. Some devices stop responding when that happens, some send a coap error. We were handling the coap error and resetting the encryption state. We were not doing the same when the device started ignoring us.

Note if the change in 2024.2.0 helps that means you are experiencing either crashing devices or packet loss. The fix just helps with recovery from that, your devices are still having issues.

Jc2k avatar Feb 04 '24 13:02 Jc2k

Installing the 2024.2.0 now and I will report if I have other disconnections 👍

holblin avatar Feb 07 '24 21:02 holblin

I didn't have any issue since. Closing the issue for now. If I encounter some issue, I will report them here.

holblin avatar Mar 16 '24 04:03 holblin