OTGateway icon indicating copy to clipboard operation
OTGateway copied to clipboard

1.5.0 Stability issues when boiler is not supporting enabled sensors

Open SanFable opened this issue 1 year ago • 57 comments

Hello,

I’ve been using OTGateway with my Beretta Ciao Green 25 C.S.I. boiler for about a year. It works fine with smart TRVs and Home Assistant, although the functionality is somewhat limited.

Recently, I updated to version 1.5.0 and switched to an ESP32-S3 (I also tried the ESP32-C3 with the same results). Previously, I was using version 1.4.5 with an ESP8266.

The Issue After the update, the system became unstable:

  • After about a minute, the OpenTherm Gateway status shows "problematic," and everything becomes unavailable.

  • The ESP32-S3 seems to struggle:

  • The web interface is very slow and unresponsive.

  • Logs over Telnet are delayed and limited.

  • With the ESP32-C3, the system was completely unreachable.

What I Found I think the issue is caused by new sensors/IDs, such as:

  • Return temperature
  • Flow rate
  • Exhaust temperature
  • Pressure
  • Minimum modulation
  • Maximum power

In the logs, I saw warnings like: [WARN] Failed to receive ...

It looks like the boiler doesn’t support these IDs, and they might be overloading the OpenTherm communication, causing it to crash.

What Worked I turned off these new sensors and reconnected the OTGateway to the boiler. Since then, it’s been running stable for over 30 minutes, and the OpenTherm Gateway status is "OK."

Attaching logs_1.txt from a setup where I didn’t disable all the mentioned sensors (minimum modulation and maximum power were still enabled). In the end, the ESP32 crashed, and the Telnet connection was lost.

I just realized that even after turning off the mentioned sensors in my successful run, I’m still seeing invalid request IDs 15 and 14 (minimum modulation, maximum power, and maximum modulation).

Attaching logs_2.txt from over 30 minutes of stable operation. However, there are still warnings in the logs from sensors that should be disabled. Could it be that something is overlapping when these sensors are enabled?

EDIT: After about 6h I had few reconnections: image

PS is there anything that I could do to improve support for my boiler?

SanFable avatar Dec 28 '24 16:12 SanFable

I believe I'm having the same issue. The symptoms are the same, and I have also recently upgraded to 1.5.0 but have not been able to get it running in any stable or reliable way.

I think the issue might have something to do with MQTT as I get better results with it turned OFF but have not absolutely verified this yet.

tincanpete avatar Dec 29 '24 15:12 tincanpete

Hi both, Quick question, but did you delete the HA instance of OpenTherm gateway on mqtt before updating the firmware or chipset. Yuri warned of potential errors, if this was not done.

Daveblanche avatar Dec 29 '24 18:12 Daveblanche

Thanks @Daveblanche for the tip, but yes I definitely did do that as recommended. I think the issue is on the OT Gateway end not the HA end though. I'm going to continue to try to figure this out and will post more here as I make progress.

tincanpete avatar Dec 29 '24 18:12 tincanpete

@Daveblanche Before setup I have went to HA settings ->devices and services->mqtt->devices, selected opentherm and removed it. Then after installing new one I removed and added new cards in dashboard.

Regarding my stability, I had few reconnections yesterday, but at 23:00 I disabled logging (serial and telnet) and it failed only once a whole day for 40 seconds.

SanFable avatar Dec 29 '24 23:12 SanFable

Hi guys,

I use S3 myself and tested the project on C3, but I can't reproduce the problem. Indeed, the web works slower when there is no connection to the boiler via openterm, and I will try to fix this.

As for the fact that polling some IDs breaks the bus - I don't know why this could be. Perhaps there is some kind of bug in the boiler firmware. In the logs I did not see a poll of these IDs and loss of connection via OT.

If you have more information it will help.

Laxilef avatar Dec 30 '24 13:12 Laxilef

Anyone who has problems with losing connection, test this build. 1.5.1-testing.zip

And happy holidays!

Laxilef avatar Dec 31 '24 11:12 Laxilef

Thank you @Laxilef I will give it a try!

tincanpete avatar Dec 31 '24 13:12 tincanpete

@Laxilef @tincanpete

I have tested 1.5.1, I have setup it like normal, kept telnet logs enabled, for 20mins I stayed with default settings and warns because boiler low support (just filled the mqtt section) and everything was stable. After that I have disabled sensors that are not existing in my boiler (thats shame its lacking some useful sensors...)

no reconnection since 2 hour, it works now. Page is responsive. No complaints. This state was not achievable on 1.5.0, Good job :)

I'm interested about @tincanpete feedback.

attaching logs.txt, maybe something minor or expected, I have few warnings.

edit,

I hit ctrl c in telnet terminal (my bad lol), then closed putty and wanted to turn it off in settings and page is laggy, problem detected in HA. RIP lol.

After turning off telnet and restarting the ESP everythings is laggy again, so issue is still open :/

after 2 mins since boot it might went back to normal (but not sure, like 70% of the responsivity)?, I will see if I got any reconnections

I think we need more detailed logs

I have swapped to c3 and its about not usable. so it looks like s3 barely handles the problem, where c3 not. I have swapped in same way to s3 and now its snappy as it should, no idea whats going on. I will wait with this state for reconnections

SanFable avatar Dec 31 '24 14:12 SanFable

Hello, I have also just tested 1.5.1 and while I thought there was some initial success, its seems no. Having MQTT enabled definitely makes things significantly worse.

I have attached a file showing repeated 'ping' to the board. When it's working well, I always see about 10-50ms. However as you can see it quickly becomes unstable and unresponsive; and will eventually sort-of come back to life but it seems quite random.

During the time when ping is very slow or dropped, the UI is also unresponsive, and if connected, the MQTT server will report the device is off-line. Sometimes my boiler will also report an OpenTherm communication error on its display.

When the problem goes away, everything goes back to normal and works OK, but often not long enough to be useful.

I'm using S2-mini board, and this ping trace was done with the gateway serial port, telnet, and logging all turned OFF to try and make sure that high level logging wasn't causing high load to be part of the problem.

ping example.txt

Thanks for your help!

tincanpete avatar Dec 31 '24 17:12 tincanpete

@tincanpete try following: download current settings backup reflash whole esp32 using PC (flash factory image and filesystem like new one) when connecting to esp32 for the first time restore settings. connect esp32 to the boiler. After that I have setup settings that I wanted and its working OK (2 hours now)

In my case esp32-s3 seems fine, but when I changed to weaker esp-c3 it was nightmare, just not working. I guess I would have similar results as you.

If you still have problems maybe try disabling sensors that are not available on your boiler (to minimize warnings in logs)

I belive my pings.txt are fine, its wifi with signal 68-76% reported on OTG page.

SanFable avatar Dec 31 '24 17:12 SanFable

@SanFable I will try that, I did not use the "factory" image, just the normal one and upgraded via the UI. I will report back my progress! Thanks

tincanpete avatar Dec 31 '24 17:12 tincanpete

Still no joy unfortunately even after fully erasing and re-flashing the S2 and using the factory bin.

I have attached a larger log but just look at this extract from the end:

[00:02:56][OT][DHW][NOTICE] Received flow rate: 0.00 (converted: 0.00)
[00:02:56][SENSORS][NOTICE] #6 'DHW flow rate' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:02:56][OT][HEATING][NOTICE] Received temp: 10.00
[00:02:56][SENSORS][NOTICE] #2 'Heating temp' new value 0: 10.00, compensated: 10.00, raw: 10.00
[00:02:57][OT][HEATING][NOTICE] Received return temp: 10.60 (converted: 10.60)
[00:02:57][SENSORS][NOTICE] #3 'Heating return temp' new value 0: 10.60, compensated: 10.60, raw: 10.60
[00:02:58][OT][NOTICE] Received exhaust temp: 10.00 (converted: 10.00)
[00:02:58][SENSORS][NOTICE] #7 'Exhaust temp' new value 0: 10.00, compensated: 10.00, raw: 10.00
[00:02:58][OT][NOTICE] Received pressure: 0.80 (converted: 0.80)
[00:02:58][SENSORS][NOTICE] #8 'Pressure' new value 0: 0.80, compensated: 0.80, raw: 0.80
[00:02:59][OT][NOTICE] Received boiler status. Heating: 0; DHW: 0; flame: 0; cooling: 0; fault: 0; diag: 0
[00:02:59][SENSORS][NOTICE] #9 'Modulation level' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:02:59][SENSORS] #10 'Power' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:03:11][SENSORS][NOTICE] #4 'Heating setpoint temp' new value 0: 30.00, compensated: 30.00, raw: 30.00
[00:03:11][OT][DHW][NOTICE] Received temp: 11.00 (converted: 11.00)
[00:03:18][SENSORS][NOTICE] #5 'DHW temp' new value 0: 11.00, compensated: 11.00, raw: 11.00
[00:03:28][SENSORS][NOTICE] #4 'Heating setpoint temp' new value 0: 30.00, compensated: 30.00, raw: 30.00
[00:03:37][OT][DHW][NOTICE] Received flow rate: 0.00 (converted: 0.00)
[00:03:38][SENSORS][NOTICE] #6 'DHW flow rate' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:03:44][OT][HEATING][NOTICE] Received temp: 10.20
[00:03:53][SENSORS][NOTICE] #2 'Heating temp' new value 0: 10.20, compensated: 10.20, raw: 10.20
[00:03:55][SENSORS][NOTICE] #4 'Heating setpoint temp' new value 0: 30.00, compensated: 30.00, raw: 30.00
[00:03:55][OT][HEATING][NOTICE] Received return temp: 10.80 (converted: 10.80)
Connection closed by foreign host.

Note the time stamps from 2:59 onwards, there's a big delay between each one.

At the same time, the pings to the board are looking like this screenshot:

image

Full log file below: 1.5.1 log.txt

All of this was with MQTT turned OFF by the way, which I thought would be better, but did not help actually.

thanks! Pete

tincanpete avatar Jan 01 '25 14:01 tincanpete

To add to my previous comment, I did try disabling the "power" sensor which my boiler does not support, but it didn't seem to make any difference.

tincanpete avatar Jan 01 '25 14:01 tincanpete

From my side, I had 2 reconnections every 24h using esp32s3. I assume esp32 s3 is powerful enough to handle something bad thats going on in the background. C3 was not usable.

SanFable avatar Jan 01 '25 16:01 SanFable

Guys, I can assume that the problem may be in the router. Let's check it: check how the web interface works when connected to the ESP access point, i.e. when the network is not yet configured on the ESP.

P.S. What routers do you use? Can you disable telnet and check the web? To view the logs at this moment, you can use the serial port.

Laxilef avatar Jan 01 '25 18:01 Laxilef

I'm using ubiquiti U7 pro. Everything was fine on previous versions (1.4.5 in my case with mini D1).

When first time configuring connected to ESP access point web interface is blazing fast. I will try to differ if it slows down after connecting to boiler or not.

UniFi controller doesn't show me problems, it says wifi experience excellent (95%+), some spikes to good (88%)

SanFable avatar Jan 01 '25 19:01 SanFable

Everything was fine on previous versions (1.4.5 in my case with mini D1).

Test 1.5.0 on your D1 mini. Now you are testing on ESP32, these are different boards and there is a different SDK.

upd: If you are using mesh and multiple access points, this may not work correctly with ESP. I don't know why, but sometimes it happens. And I don't recommend using 2G and 5G APs with the same SSID.

Laxilef avatar Jan 01 '25 20:01 Laxilef

Interesting idea, I had considered it might be a wifi/router issue, however the network is stable and has been for a long time, running Home Assistant and many other ESP-based devices (Shelly Relays and similar) without a problem. Do you think there's a chance my Wemos S2-mini board just "doesn't like" the wifi network? The problem was present, although not as severe, with software 1.4.5.

tincanpete avatar Jan 01 '25 20:01 tincanpete

I’m on UniFi, too. Check your retry rates on the front page of UniFi network app. I had an issue a few weeks ago, with high retries, and it was down to channel choice/availability.

My network is incredibly stable, too.

Daveblanche avatar Jan 01 '25 20:01 Daveblanche

Something similar, and there was UniFi there too: https://community.home-assistant.io/t/opentherm-gateway-thermostat-with-full-integration-for-home-assistant/617228/128

Laxilef avatar Jan 01 '25 20:01 Laxilef

I just have only one U7 Pro, no meshing. image TX retries, I would say opentherm is in the very middle of the devices. I just have noisy 2.4ghz. I'm on channel 1 20mhz which unifi auto choosen. I will experiment with others and check TX rate.

I will try d1 mini tomorrow.

SanFable avatar Jan 01 '25 21:01 SanFable

ESP32 C3 connected to mikrotik, OT not connected

esp c3.webm

Laxilef avatar Jan 02 '25 20:01 Laxilef

I've not had a chance to try with Wireless AP only as the hardware is running in a shop we own and I've not been there for a couple of days. However, remotely monitoring it I have just noticed the "Uptime" on the UI homepage has been reset and the "Last Reset Reason" is showing "Reset due to other watchdogs". Does this offer any clues to you?

tincanpete avatar Jan 04 '25 18:01 tincanpete

Does this offer any clues to you?

Screenshot_9

Laxilef avatar Jan 04 '25 19:01 Laxilef

No, "save debug data" just gives me this:


{
  "build": {
    "version": "1.5.1-testing",
    "date": "Dec 31 2024 02:02:27",
    "env": "s2_mini",
    "core": "3.1.0",
    "sdk": "v5.3.2-174-g083aad99cf-dirty"
  },
  "heap": {
    "total": 188452,
    "free": 55288,
    "minFree": 48052,
    "maxFreeBlock": 30708,
    "minMaxFreeBlock": 25588
  },
  "chip": {
    "model": "ESP32-S2",
    "rev": 100,
    "cores": 1,
    "freq": 240
  },
  "flash": {
    "size": 4194304,
    "realSize": 4194304
  },
  "crash": {
    "reason": "Reset due to other watchdogs",
    "core": 0,
    "heap": 58416,
    "uptime": 680780475
  }
}

Under what circumstances will Watchdog cause a reboot?

tincanpete avatar Jan 04 '25 19:01 tincanpete

Hmm, strange, there is no backtrace in the debug data. Without a backtrace it is impossible to find out the reason. There may be many reasons, sometimes it is related to poor power supply of the ESP.

Laxilef avatar Jan 04 '25 20:01 Laxilef

Hi, I have a similar issue with system instability (frequent disconnections). The system seems to get saturated. This happens exactly after modifying the values related to Emergency mode.

By default, I have these parameters:

Target temperature: 40
Threshold time: 120

My system is configured with a minimum flow temperature of 50 degrees. If I change the target temperature to 50 degrees in Emergency mode and set the threshold to 120, the system stops working, becomes unstable, connects and disconnects continuously, and constantly activates Emergency mode.

I’ve tried this three times (always with firmware 1.5.0), and the issue has replicated every time. The only way to restore the system is to erase the firmware and flash it again.

If I leave the Emergency mode values unchanged (40°C and 120 seconds), everything works correctly.

I hope this can help. Thank you so much for the amazing work!

Symon84 avatar Jan 05 '25 11:01 Symon84

Never mind… the problem has now reappeared even without modifying the parameters. I’ll try replacing the ESP8266 (D1 Mini) to see if it’s a hardware issue. I’ll keep you updated. For the record, it worked perfectly for four consecutive days.

The disconnection issue occurs even if the device is not connected to the boiler.

Symon84 avatar Jan 05 '25 12:01 Symon84

If your ESP is powered via USB, try replacing the power supply with a different one.

Laxilef avatar Jan 05 '25 15:01 Laxilef

If your ESP is powered via USB, try replacing the power supply with a different one.

Initially, the D1 mini was connected with an external stabilized power supply (via 5V pin). I tried replacing the power supply with a USB type power supply but the problem persists. I also tried to change the D1 mini with a new one, but I still have the same problem. Today I give it a try by disabling the 5G wifi network, but I have many other devices (including D1 mini) connected, which never have data connection problems.

Symon84 avatar Jan 07 '25 08:01 Symon84