OTGW-firmware icon indicating copy to clipboard operation
OTGW-firmware copied to clipboard

Gateway stops working after a few days of not being used

Open rotilho opened this issue 1 year ago • 3 comments

I'm facing a bug that takes a long time to reproduce. I'm using the OTGW in stand-alone mode where I control the setpoint based on multiple sensors.

I also have an automation that changes the setpoint of each room based on how far I'm from home to the minimum of 16 degrees. So when I travel, the house is normally set to 16 degrees, causing the boiler to not kick in for several days. After around 3 or 4 days, by the time the temperature is low enough, OTGW is not responsive anymore, requiring me to remotely restart it. There are no errors in the logs; it's just stuck.

Edit: hot water pre-heat is disabled.

rotilho avatar Jan 26 '24 19:01 rotilho

Do you have any logs to share? If not could you try to capture logs?

rvdbreemen avatar Jan 26 '24 21:01 rvdbreemen

Which part is stuck? The Wemos or the PIC? Can you still access the web interface on the Wemos? Can you connect to port 25238? If so, do things start working again if you send a GW=R command?

hvxl avatar Jan 26 '24 22:01 hvxl

Sorry guys, I already restarted it and my place was very cold so I didn't spend too much time debugging. This is the fourth time it happened I suspected that upgrading to the latest version would fix it but this time I already had the lasted version.

The web interface was working, I connected to the serial port but nothing was happening. Any command to set temperature was accepted but after a few seconds it was going back to zero. Everything was reporting zero or no information.

Next time I'll try to collect more meaningful information before being on my way home.

rotilho avatar Jan 27 '24 08:01 rotilho

I might have had a similar experience yesterday.

After months of stable operation, I restarted Home Assistant, which then wasn't able to connect over the socket anymore. The web page was still up, but not displaying any info about the OTGW status anymore. OTGW itself was working, communication between boiler and thermostat still worked. I was able to reboot using the web page, after which functionality was restored. Logs showed no new entries since the previous reboot.

I'm using the latest ESP firmware 0.10.2 and PIC firmware 6.5.

JvHummel avatar Feb 13 '24 13:02 JvHummel

Since I recent started to push the outside temperature, i have the same problem seen twice (empty UI and no connection with HA, even after reload of the otgw socket plugin)

On the empty page there was the message: "PS=1 mode; No UI updates."

How can i next time obtain useful debug info?

dwar avatar Feb 25 '24 18:02 dwar

Hi @dwar

That message means you are using the serial connections somehow and put the OTGW in PS=1 mode. That mode prevents the web UI to get updates.

If you want this to work I recommend using the MQTT integration for OTGW using Home Assistant.

Check out the wiki on how to setup MQTT with Auto Discovery:

https://github.com/rvdbreemen/OTGW-firmware/wiki/How-to-setup-another-OTGW-using-the-WebUI

rvdbreemen avatar Feb 26 '24 18:02 rvdbreemen

Okay guys. Few days without using it and the problem appeared again.

Here, everything I collected:

Trying 192.168.2.40...
Connected to 192.168.2.40.
Escape character is '^]'.
18:04:07.363651 (  12392| 11880) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:07.502457 (  13928| 12528) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:07.639094 (  13928| 12528) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:07.784869 (  14600| 12528) handleOTGW  (1771): Net2Ser: Sending to OTGW: [PR=I] (4)
18:04:08.014900 (  11800| 11232) checkOTGWcmd(1293): CmdQueue: Checking if command is in in queue [PR: I=11] (8)
18:04:08.636180 (  13144| 12664) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:09.633980 (  13472| 12368) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:10.631923 (  13144| 12144) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:11.630608 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:12.627204 (  14480| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:13.625744 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:14.624155 (  15400| 14504) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:15.621442 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:16.619026 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:17.617316 (  15400| 14504) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:18.163610 (  14400| 13688) handleOTGW  (1771): Net2Ser: Sending to OTGW: [PR=I] (4)
18:04:18.178821 (  11792| 11096) checkOTGWcmd(1293): CmdQueue: Checking if command is in in queue [PR: I=11] (8)
18:04:18.616931 (  14480| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:19.613016 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:20.612575 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:21.608490 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:22.606201 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:23.604771 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:24.602882 (  15344| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:25.600831 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:26.597438 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:27.595410 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:28.317772 (  15824| 14728) handleOTGW  (1771): Net2Ser: Sending to OTGW: [PR=I] (4)
18:04:28.564279 (  13808| 13208) checkOTGWcmd(1293): CmdQueue: Checking if command is in in queue [PR: I=11] (8)
18:04:28.727358 (  15344| 14504) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:29.626360 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:30.588597 (  15400| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:31.587418 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:32.584916 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:33.583013 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:34.579991 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:35.577788 (  14480| 13856) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:36.575655 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]
18:04:37.573828 (  13808| 13208) processOT   (1676): Request Boiler    R00000000 (9)[MsgID=  0][READ_DATA       ]>Status = Master [-----W--]

Screenshot_20240301-232539 Screenshot_20240301-232629 Screenshot_20240301-232657 Screenshot_20240301-232736-EDIT

rotilho avatar Mar 02 '24 17:03 rotilho

I'm starting to suspect that may be my boiler. Restart didn't work this time, just after using HW.

rotilho avatar Mar 05 '24 16:03 rotilho

I have the same issue, checked several things, but somehow I cannot determine what’s wrong. Can the pic be faulty, is the otgw board? I encountered the same behaviour after reflashing and returning to default parameters and no mqtt connection. I am close to the point to toss the thing in a corner and try the diyless device…

htca avatar Mar 11 '24 21:03 htca

As you didn't provide any logs or other information about the things you checked, it's not really possible to answer your questions.

hvxl avatar Mar 11 '24 23:03 hvxl

otdata.txt otlog-20240312.txt sorry for my frustration.... I am really thankfull for any help. I just made them, I rebooted the OTGW and logged it until I needed to go the office..

htca avatar Mar 12 '24 06:03 htca

A few things stand out:

  • The log stops when a message from the thermostat is expected. This could be caused by an inappropriate reference voltage.
  • Something is sending a whole bunch of serial commands. There are 41 of them within just a few seconds near the end of the log. The OTGW should be able to handle that, but it's not something I routinely test. It also doesn't seem useful to request the same information 3 times within a few milliseconds.
  • There seems to be a problem with the WiFi connectivity. On average, one message exchange is reported every second. However, I sometimes see no message for a few seconds, followed by multiple messages being reported with less time between them than should be possible at the 9600 baud communication speed used by the PIC. Most times there is only a gap of 3 seconds or less. But just before the log stops, there is a gap of 12 seconds.

Can you confirm that the reason the log stops is not due to the WiFi connection dropping? Can you switch off whatever is sending all those serial commands to see if that makes any difference? When the problem happens again, can you monitor TCP port 23 while doing the following:

  • Try sending some harmless command from OTmonitor, like PR=V. If that works, try different reference voltage settings to see if that gets communication going again.
  • If you get no response, try GW=R. The Wemos firmware will normally intercept that command and reset the PIC using the reset pin. This can help to determine if the problem is in the firmware of the Wemos or the PIC.

hvxl avatar Mar 12 '24 20:03 hvxl

ok, used a fresh new wemos and flashed it, improved the wifi by moving the device a bit... Will have it running for a while and report later.

htca avatar Mar 13 '24 16:03 htca

Just to add; I also reconnected the the OTGW integration in home assistant, it still works as before. I suspect that the improving the wifi was the issue (it is ok at the moment, but I need some time to expand the wifi range properly).

htca avatar Mar 15 '24 09:03 htca

@htca what kind of "integration" are you using? Are you using the MQTT way of integrating? If you use the native component in HA, then you use serial over wifi to control. This was not what the ESP8266 firmware was build for, the MQTT integration is preferred way (imho).

So wondering how it you new setup is working. And how you have setup your integration with OTGW from the HA perspective. MQTT or Serial over Network integration?

rvdbreemen avatar Mar 18 '24 19:03 rvdbreemen

@htca I will close the issue as the wifi now seems to be fixed solving the issue. Still interested in your anwsers, so you can reopen the issue.

rvdbreemen avatar Mar 18 '24 19:03 rvdbreemen

Actually I used both. The MQTT to get the status in ha and I installed the ha integration to have an easy implementation to the update the outside temperature. I think something had changed in ha in one of the updates, up to a few months ago it worked as supposed, although I had regularly a freeze (once every few weeks), but the frequency of freezes increased to a few hours maximum. I assumed I could use both interfaces simultaneously (mqtt and the serial) but I use now only mqtt and use a publish automation of the external temperature. Thanks for all your good work and effort! Fine if you close the issue of course.

htca avatar Mar 18 '24 19:03 htca

@htca thanks for the response. Combining both integrations has worked for others before, not sure what has changed. But glad to know that the MQTT integration works as designed for you.

Will keep topic closed then.

rvdbreemen avatar Mar 18 '24 20:03 rvdbreemen