Heatmiser-for-home-assistant Updating heatmiserneo climate took longer than the scheduled update interval

Updating heatmiserneo climate took longer than the scheduled update interval

Open knyar opened this issue 3 years ago • 6 comments

After upgrading Heatmiser integration to the latest version (with neohubapi being a separate library now) it seems to occasionnaly get "stuck", with status not being updated and commands not being passed to the Neohub.

When this happened, I observed the following symptoms:

An "Updating heatmiserneo climate took longer than the scheduled update interval" error message is logged every minute:

Feb  6 11:09:24 lizard hass[9304]: 2021-02-06 11:09:24 WARNING (MainThread) [homeassistant.components.climate] Updating heatmiserneo climate took longer than the scheduled update interval 0:01:00
Feb  6 11:10:24 lizard hass[9304]: 2021-02-06 11:10:24 WARNING (MainThread) [homeassistant.components.climate] Updating heatmiserneo climate took longer than the scheduled update interval 0:01:00
Feb  6 11:11:24 lizard hass[9304]: 2021-02-06 11:11:24 WARNING (MainThread) [homeassistant.components.climate] Updating heatmiserneo climate took longer than the scheduled update interval 0:01:00

No network traffic is being sent from Home Assistant to Neohub (checked using tcpdump).
netstats shows a few established tcp connections from Home Assistant to the Neohub:

netstat -n | grep 10.11.12.67
tcp        0      0 10.11.12.2:44380        10.11.12.67:4242        ESTABLISHED
tcp        0      0 10.11.12.2:44378        10.11.12.67:4242        ESTABLISHED
tcp        0      0 10.11.12.2:57284        10.11.12.67:4242        TIME_WAIT
tcp        0      0 10.11.12.2:44382        10.11.12.67:4242        ESTABLISHED
tcp        0      0 10.11.12.2:44384        10.11.12.67:4242        ESTABLISHED

Stopping homeassistant shows that these TCP connections were not actually active on the Neohub side (it sends back RST in response to FIN):

12:19:02.932468 IP 10.11.12.2.44378 > 10.11.12.67.4242: Flags [F.], seq 999241392, ack 2491317285, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.932671 IP 10.11.12.2.44380 > 10.11.12.67.4242: Flags [F.], seq 2142873974, ack 1140983017, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.932740 IP 10.11.12.2.44382 > 10.11.12.67.4242: Flags [F.], seq 3942683128, ack 60268326, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.932811 IP 10.11.12.2.44384 > 10.11.12.67.4242: Flags [F.], seq 3296834747, ack 1860058818, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.932807 IP 10.11.12.67.4242 > 10.11.12.2.44378: Flags [R], seq 2491317285, win 0, length 0
12:19:02.932866 IP 10.11.12.67.4242 > 10.11.12.2.44380: Flags [R], seq 1140983017, win 0, length 0
12:19:02.932917 IP 10.11.12.67.4242 > 10.11.12.2.44382: Flags [R], seq 60268326, win 0, length 0
12:19:02.945824 IP 10.11.12.2.44378 > 10.11.12.67.4242: Flags [F.], seq 0, ack 1, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.945927 IP 10.11.12.2.44380 > 10.11.12.67.4242: Flags [F.], seq 0, ack 1, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.945989 IP 10.11.12.2.44382 > 10.11.12.67.4242: Flags [F.], seq 0, ack 1, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.946050 IP 10.11.12.2.44384 > 10.11.12.67.4242: Flags [F.], seq 0, ack 1, win 502, options [nop,nop,TS val 4228902972 ecr 6392437], length 0
12:19:02.946135 IP 10.11.12.67.4242 > 10.11.12.2.44378: Flags [R], seq 2491317285, win 0, length 0
12:19:02.946211 IP 10.11.12.67.4242 > 10.11.12.2.44380: Flags [R], seq 1140983017, win 0, length 0
12:19:02.946384 IP 10.11.12.67.4242 > 10.11.12.2.44382: Flags [R], seq 60268326, win 0, length 0
12:19:02.946417 IP 10.11.12.67.4242 > 10.11.12.2.44384: Flags [R], seq 1860058818, win 0, length 0

Neostat itself is working fine when this happens (checked by manually sending a {"INFO": 0} command).

I took a quick look at the neohubapi source code and observed this integration during normal operation, and I believe it is usually establishing a new TCP connection for every command. This indicates that this weird "stuck" state happened while one of the commands was actively running (e.g. in the middle of Neohub._send) -- perhaps the hub froze while processing the command and restarted, but the integration was still waiting for read or write operation to succeed.

A common way to handle such cases is to abort an operation if it takes too long (more than a given timeout value). I am not super familiar with asyncio, but it seems that wait_for can be used to implement such timeout. However, I am not sure what would be the best place to add such timeouts:

This library can have a timeout while calling neohubapi (wrap all calls to neohubapi in wait_for)
neohubapi can have a timeout for all requests (wrap all calls to _send in wait_for)
neohubapi._send can have a timeout for connection opening and read/write operations.

Perhaps doing this as high in the stack as possible (#1) would make more sense, but I am not sure whether asyncio will cleanly close open connections in such case.

Feb 06 '21 12:02 knyar

I would tentatively concur that (1) is most correct. Can you open an issue over at neohubapi and point here?

@stikonas FYI

Feb 06 '21 15:02 gerrowadat

We have also just released neohubapi 0.5 which include some fixes and command line client. (Maybe command line client can be used to reproduce this problem easier).

And I think I observed that neohub was more stable when opening a new connection for each operation instead of keeping connection open.

But please open an issue on on neohub project, it will be easier to track it.

Feb 06 '21 15:02 stikonas

Sorry, after re-reading and parsing your issue, I think (3) is most correct, sorry :-)

And, thank you so much for the detailed report. I have opened a neohubapi issue: https://gitlab.com/neohubapi/neohubapi/-/issues/6

Feb 06 '21 16:02 gerrowadat

Yes, I agree. 2) doesn't really make sense, because it would mean more duplicate code. And it's better to solve it in the underlying library than add workarounds to library users.

Feb 06 '21 16:02 stikonas

I've released https://pypi.org/project/neohubapi/ version 0.6. Can anybody affected by this bug try to test it? If it helps, we can again bump minimal version requirement, so that everybody gets update.

Feb 19 '21 17:02 stikonas

I have been using the master branch of neohubapi and it mitigated the issue for me.

What I wanted to explore before we switch to 0.6 is whether we can make retries configurable in the custom component. It's now just the matter of passing request_attempts toNeoHub(), but I am not sure what's the best way to allow users to do it. Any thoughts?

Feb 19 '21 18:02 knyar

Heatmiser-for-home-assistant Heatmiser-for-home-assistant copied to clipboard

Updating heatmiserneo climate took longer than the scheduled update interval

Heatmiser-for-home-assistant
Heatmiser-for-home-assistant copied to clipboard