ha-edgeos icon indicating copy to clipboard operation
ha-edgeos copied to clipboard

daily disconnects

Open MagieMalone opened this issue 2 years ago • 21 comments

I get daily disconnects , reload integration fixes the issue.

running on v2.0.12

This error originated from a custom integration.

Logger: custom_components.edgeos.component.api.websocket
Source: custom_components/edgeos/component/api/websocket.py:133
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: October 18, 2022 at 19:21:02 (1 occurrences)
Last logged: October 18, 2022 at 19:21:02

Failed to connect WS, Error: 503, message='Invalid response status', url=URL('wss://192.168.3.1/ws/stats'), Line: 109
October 19, 2022
ROUTER Interface ETH0 Status was connected
08:46:50 - 14 minutes ago
ROUTER Interface ETH0 Status was connected
08:34:09 - 27 minutes ago
ROUTER Interface ETH0 Status became unavailable
08:33:59 - 27 minutes ago
October 18, 2022
ROUTER Interface ETH0 Status was connected
07:21:45 - Yesterday
ROUTER Interface ETH0 Status became unavailable
07:21:35 - Yesterday
October 17, 2022
ROUTER Interface ETH0 Status was connected
07:55:17 - 2 days ago

at the same time, ping integration to google dns keeps running during the events.

I'll add debug as well to see what causes this.

MagieMalone avatar Oct 19 '22 07:10 MagieMalone

this happens due to service disconnections by the router, the component connectivity recovers from it after 30 seconds, in old versions of component, that warning message was not written unless it failed 3 times in row, now it logs every time it happens.

it also happened for me twice over the last week, again, this is not related to the component (can happen also with the UI of EdgeOS, but again, it recovers from it as well)

elad-bar avatar Oct 19 '22 07:10 elad-bar

btw, you don't need to reload as this is part of the recovery mechanism

elad-bar avatar Oct 19 '22 07:10 elad-bar

everythings stays disconnected, until I reload. for instance eth0 graph stays at 0 for hours. so this is not the only sy,mptom of the issue I face.

MagieMalone avatar Oct 19 '22 07:10 MagieMalone

image

MagieMalone avatar Oct 19 '22 07:10 MagieMalone

please provide debug log level for that, I will need to understand why it's not recovering (without additional warnings)

elad-bar avatar Oct 19 '22 07:10 elad-bar

I will, once it happens again, so I expect tomorrow.

MagieMalone avatar Oct 19 '22 07:10 MagieMalone

thanks

elad-bar avatar Oct 19 '22 07:10 elad-bar

@poonab mentioned there is a log message of Update in progress, will skip the request which can cause it, please let me know if it happens to you as well, it can be a great lead

elad-bar avatar Oct 19 '22 08:10 elad-bar

@poonab, are there any log messages with the following pattern before / after it happens (under the assumption log level is set to INFO / DEBUG): WS Status changed to ***, API Status: ***?

elad-bar avatar Oct 19 '22 08:10 elad-bar

released new version with improved connectivity, set the message to e displayed only when the connection was not created, in case of disconnection, based on the error, in addition, fixed the ping message which could of cause it since it was not working.

elad-bar avatar Oct 19 '22 16:10 elad-bar

image data is missing again from the graph but in the log I see it is still being collected. Btw, the state in Developer tools is also not changing since from around 0:00, however not much around that time that would indicate an issue.

some logs: https://pastebin.com/1AAKNK1Y

struggling to paste bigger log, as debug creates a TON :-)

MagieMalone avatar Oct 20 '22 05:10 MagieMalone

the log you have sent is from 07:43:02, i cannot understand anything with log of one second :) please upload the entire log begins before it stopped working

elad-bar avatar Oct 20 '22 06:10 elad-bar

in addition, I added 3 new sensors of messages: received, ignored and error, can you please check the received messages sensor's graph?

thanks

elad-bar avatar Oct 20 '22 06:10 elad-bar

log file:

https://wetransfer.com/downloads/24ee030959579cfb47bb74f7263d9ae220221020061726/7d89821307cdde029044ff7737bd339820221020061758/b7b95a

MagieMalone avatar Oct 20 '22 06:10 MagieMalone

I don't have those new sensors yet, as I still ran on 0.12 when the issue occured. I will update to it now.

MagieMalone avatar Oct 20 '22 06:10 MagieMalone

I will try getting to it by evening, with the latest version (v2.0.13) - enough to set the logs for info and above to understnad the issue

elad-bar avatar Oct 20 '22 06:10 elad-bar

Happened again last night at around 0:00, currently on 2.0.14 home-assistant.log.gz

MagieMalone avatar Oct 21 '22 08:10 MagieMalone

Is it happening at the same time every night?

elad-bar avatar Oct 21 '22 10:10 elad-bar

the last 2 nights, yes, at midnight, and the night before at 19:20

MagieMalone avatar Oct 21 '22 10:10 MagieMalone

Mine also failed again @ 21:54:40, not happening at the same time. Errors on HA are

This error originated from a custom integration.

Logger: homeassistant
Source: custom_components/edgeos/component/api/websocket.py:158
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 22 October 2022 at 21:54:40 (1101 occurrences)
Last logged: 13:11:36

Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/config/custom_components/edgeos/component/managers/home_assistant.py", line 83, in async_send_heartbeat
    await self.ws.async_send_heartbeat()
  File "/config/custom_components/edgeos/component/api/websocket.py", line 158, in async_send_heartbeat
    await self._ws.send_str(data)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_ws.py", line 151, in send_str
    await self._writer.send(data, binary=False, compress=compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 690, in send
    await self._send_frame(message, WSMsgType.TEXT, compress)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 601, in _send_frame
    raise ConnectionResetError("Cannot write to closing transport")
ConnectionResetError: Cannot write to closing transport

This error originated from a custom integration.

Logger: custom_components.edgeos.core.managers.home_assistant
Source: custom_components/edgeos/core/managers/home_assistant.py:244
Integration: Ubiquiti EdgeOS Routers (documentation, issues)
First occurred: 20 October 2022 at 04:12:41 (55 occurrences)
Last logged: 01:17:16

Update in progress, will skip the request

Skipping through the logs I cant see anything weird included but at 21:54 all the data that continues to be logged going forward is no longer updating the values

poonab avatar Oct 23 '22 14:10 poonab

That is not the same error is reported in this issue, another issue, saw it as well and this one already has a fix

elad-bar avatar Oct 23 '22 14:10 elad-bar

please check v2.0.15, should improve dramtically the stability, for me 3 days without a single warning / error

thanks

elad-bar avatar Oct 28 '22 12:10 elad-bar

I upgraded to v2.0.16 yesterday, happened again last night. INFO level log doesn't show much, just that after 1s. past midnight no values change anymore. 192.168.3.9-20221029-095831.log

MagieMalone avatar Oct 29 '22 08:10 MagieMalone

According to the log you have sent, there are updates up until the last log line: 2022-10-29 00:00:03.101 INFO (SyncWorker_30) [custom_components.edgeos.core.managers.entity_manager] ROUTER Interface ETH0 Received Traffic (sensor) updated, state: 527423484.475 | updated

in addition, i don't see a single disconnection within that log, please use debug log level to get more details for the same period, the fact it happens on dialy basis around the same time is weird, happens to me in the past because lots of load on the DNS server I ran (pi-hole), could it be that you are using HASS.IO / AdGuard / PiHole?

thanks

elad-bar avatar Oct 30 '22 08:10 elad-bar

I enabled debug yesterday, included the log from 23:58-00:02 192.168.3.9-20221030-102935.zip

I run pi-hole but not on HA, it's on a seperate VM.

MagieMalone avatar Oct 30 '22 10:10 MagieMalone

Will review the log file later today, thanks.

Additional direction, you might have the same issue I had around a year ago when the pihole was overloaded at specific time which caused many unexplained issues related to networking, Please check how to increase the number of requests in parallel to the pihole.

elad-bar avatar Oct 30 '22 11:10 elad-bar

I added some monitoring to esxi yesterday, it seems it's not busy on the cpu's at the time: image also not on HA: image though it is currious to see that it drop cpu cycles right at that time...

MagieMalone avatar Oct 31 '22 07:10 MagieMalone

is the second screen shot is from the component of EdgeOS CPU?

thanks

elad-bar avatar Oct 31 '22 07:10 elad-bar

no, overal HA processor usage. I'm putting an snmp probe in for the edgeos as we speak. and I'm automating the reloading for now at 00:30.

MagieMalone avatar Oct 31 '22 08:10 MagieMalone

Can you share the screen shot of the EdgeOS CPU / Memory sensors?

Thanks

elad-bar avatar Oct 31 '22 09:10 elad-bar