espurna icon indicating copy to clipboard operation
espurna copied to clipboard

MQTT Broker reconnection

Open xoseperez opened this issue 7 years ago • 28 comments

Originally reported by: chryss (Bitbucket: chryss, GitHub: chryss)


I have a big problem with MQTT :-(.

Sometmes, my broker MQTT (Jeedom) lose ethernet connection or crash, i don't kwow why but the problem is not here.

The problem is when my broker is up again, all module Sonoff not reconnect to MQTT and are disconnect on the status. It's necessary to set MQTT to OFF, save, reboot and set MQTT to ON for all module :-(((

There is no reconnection process if it is lost ?

xoseperez avatar Jan 18 '18 09:01 xoseperez

Cannot reproduce it, moving it ahead to avoid blocking next release.

xoseperez avatar Jan 27 '18 09:01 xoseperez

I have the same problem, any solution?

fmolero avatar Mar 18 '18 21:03 fmolero

I'm running Hassio and Home Assistant on a RasPi 2 using the prebuilt image and it sometimes drops off the network, I can ping it but the web portal doesn't respond at all. This then causes two known working ESP8266 modules to just hang after a connect to broker message and then a status MQTT message is sent, they just don't do anything else at that point. It seems that the connect didn't work and should have failed which should stop the ESP's from attempting the MQTT publish but as it doesn't cause a failure the MQTT publish is attempted and is possibly a blocking call. This to me seems why this is happening.

Can someone else confirm that they also see this?

Also is there a way to stop the MQTT publish from blocking and return after a timeout, this would at least cure the problem in the short term and get us all up and running again?

I'm more than willing to run any tests or change code etc to help out, I just really need this working as my home heating is totally manual at the moment and is annoying the hell out of me, simply because it's annoying my wife, arg! lol!

Cheers, Dan

danriches avatar Apr 06 '18 11:04 danriches

I have come to the conclusion that when a sonoff can not connect for a while to the MQTT server it does not try to connect again, so it is quite logical what the comrade says.

I am also willing to do tests.

fmolero avatar Apr 06 '18 11:04 fmolero

Well I tried a ton of stuff and added debug output and I've come to the conclusion that this is not MQTT library related. At one point I pressed the button connected to GPIO0 when it was in a failed state and all of a sudden it output a "Button #0 event 1" message and carried on working as per normal. So there is an issue somewhere else possibly regarding button events perhaps or a loop that gets stuck somehow. I've reached the limit of my debug with this code so I'm going to try to roll my own very very basic pubsubclient button / relay and temp sensor modules so I can get my heating working again. I really hope this gets fixed and I'll keep tabs on the development of this otherwise really nice project. Keep up the good work @xoseperez ....

danriches avatar Apr 06 '18 14:04 danriches

I have now 4 modules that do not respond to mqtt commands, I have not yet deactivated mqtt and reactivated to do tests. I just tried to press the button connected in gpio0 in a basic sonoff and still does not respond to mqtt commands

fmolero avatar Apr 06 '18 16:04 fmolero

In the last 3 hours I threw together an AysncMqtt sketch that subscribes and responds to commands from my Home Assistant RPi 2 and it's been working now for far longer than Espurna ever has for me. That shows that the ESP8266 is fine, the MQTT library is fine, TCPAsync Lib is fine too and so is my Home Assistant. I hope this helps someone in the debugging and fixing of this, it was driving me crazy over the last week of tinkering with it. I'm now happy to report that I have my own fix for the moment though. ;)

I really hope you get some results @fmolero

danriches avatar Apr 06 '18 17:04 danriches

Sorry but my English is pretty bad, I can not understand what the solution you have found was. Mount asyncmqtt? I'm using the sonoff in jeedom, I guess I can not mount anything there, right?

fmolero avatar Apr 06 '18 20:04 fmolero

Hi @fmolero, I wrote my own quick sketch using the esp8266 asyncmqtt examples as a starting point. It doesn't use espurna at all and is just a temporary solution until this current issue is resolved. I can upload it to my github later today so you can see what i did if you like?

Regards, Dan

danriches avatar Apr 07 '18 07:04 danriches

Hi, any updates on this ? I have multiple devices, running different SW versions. Occasionally after some networking event, which I'm unable to reproduce, all devices with SW 1.12.0 are not reconnecting to the MQTT broker, and their status stays as "not connected". the only way to fix this is reboot the device. mqtt reset command from the telnet interface doesn't help as well. Other devices running older 1.10.1 are working fine.

here is a snapshot form the telnet session:

[MQTT] Connecting to broker at 192.168.86.13:1883 [MQTT] Client ID: ESPURNA_3B7D73 [MQTT] QoS: 0 [MQTT] Retain flag: 1 [MQTT] Will topic: /switch/SONOFF_2/status [MQTT] Connecting to broker at 192.168.86.13:1883 [MQTT] Client ID: ESPURNA_3B7D73 [MQTT] QoS: 0 [MQTT] Retain flag: 1 [MQTT] Will topic: /switch/SONOFF_2/status

mqtt.reset +OK [MQTT] Connecting to broker at 192.168.86.13:1883 [MQTT] Client ID: ESPURNA_3B7D73 [MQTT] QoS: 0 [MQTT] Retain flag: 1 [MQTT] Will topic: /switch/SONOFF_2/status [MQTT] Connecting to broker at 192.168.86.13:1883 [MQTT] Client ID: ESPURNA_3B7D73 [MQTT] QoS: 0 [MQTT] Retain flag: 1 [MQTT] Will topic: /switch/SONOFF_2/status [MQTT] Connecting to broker at 192.168.86.13:1883 [MQTT] Client ID: ESPURNA_3B7D73 [MQTT] QoS: 0 [MQTT] Retain flag: 1 [MQTT] Will topic: /switch/SONOFF_2/status

this goes on forever...

izhar

izharm avatar May 19 '18 06:05 izharm

I have the same situation as @danriches reported on 6 Apr.

When I make any change on the MQTT via web, web does not respond, MQTT disconnects and hardware resets. It is only possible to reach it by telnet.

Any help?

jsponz avatar Jun 19 '18 04:06 jsponz

Hi, I had the same problem on V1.13.0. My broker crach and when is up again, no reconnection from module. It's necessary reboot all module to make connection's good :-(

chryss44 avatar Aug 21 '18 09:08 chryss44

Hi, i've just updated to ESPURNA 1.13.3 but still experiencing the same issue, cant connect to mqtt mosquitto broker by host address. Not IP Address, Host Address. Is there any trick? when i change from the mosquitto broker IP address > mosquitto broker Host Address and Save the changes, it continues to work. But after a restart, mqtt protocol from ESPURNA cant connect to mosquitto, i have to change from the Host address > IP address. I want to connect without knowing IP address because of DHCP on networks, i want to be able to connect ESPURNA with mosquitto using Host Addresses. Has anyone experienced the same?

image

mgulius90 avatar Oct 25 '18 15:10 mgulius90

I believe I am also experiencing this issue. When my mosquitto resets / disconnects, the espurna devices all show "MQTT Status NOT CONNECTED" on the status page. The MQTT Server is up and Tasmota has already reconnected. If I reboot the node, then it connects to MQTT and everything works. I can somewhat reliably re-create the issue when restarting the MQTT broker. Forcing a Reconnect via the red reconnect button in the UI will get the MQTT connection working, as does a reboot of the esp8266.

I am seeing this on version ESPURNA 1.12.6a and 1.13.4-dev. Not sure when this became an issue

In the debug log, it shows

[608868] [MAIN] Free heap: 17208 bytes
[608870] [MAIN] Power: 2800 mV
[608871] [MAIN] Time: 2019-01-08 00:07:14
[699138] [MQTT] Connecting to broker at mqtt.example.org:1883
[699140] [MQTT] Connecting as user mymqttusername
[699141] [MQTT] Client ID: ESPURNA_1953C2
[699142] [MQTT] QoS: 0
[699143] [MQTT] Retain flag: 1
[699145] [MQTT] Keepalive time: 30s
[699148] [MQTT] Will topic: /places/ba6/eg/entry/smartnode/status

The above repeats every two minutes (MQTT_RECONNECT_DELAY_MAX)

Running a traffic capture, I see a DNS Query and answer (both correct) at the same interval:

23:06:45.962615 IP (tos 0x0, ttl 128, id 49355, offset 0, flags [none], proto UDP (17), length 59)
    192.168.140.179.4096 > 192.168.140.1.53: [udp sum ok] 167+ A? mqtt.example.org. (31)
23:06:45.963727 IP (tos 0x0, ttl 64, id 52718, offset 0, flags [none], proto UDP (17), length 75)
    192.168.140.1.53 > 192.168.140.179.4096: [bad udp cksum 0x9a4e -> 0x52d7!] 167* q: A? mqtt.example.org. 1/0/0 mqtt.example.org. A 192.168.140.14 (47)
23:08:45.962765 IP (tos 0x0, ttl 128, id 49441, offset 0, flags [none], proto UDP (17), length 59)
    192.168.140.179.4096 > 192.168.140.1.53: [udp sum ok] 121+ A? mqtt.example.org. (31)
23:08:45.963027 IP (tos 0x0, ttl 64, id 15283, offset 0, flags [none], proto UDP (17), length 75)
    192.168.140.1.53 > 192.168.140.179.4096: [bad udp cksum 0x9a4e -> 0x5305!] 121* q: A? mqtt.example.org. 1/0/0 mqtt.example.org. A 192.168.140.14 (47)

In addition to the above packets, I can see some mDNS activity on :5353, but there is not even an attempt to connect to the MQTT server.

I may try changing to IP to rule out any DNS issues, but based on the above resolution is working, there is no connection attempt to the broker.

Lookng at the logs from the MQTT server, after the forced reconnect

1546903245: New connection from 192.168.140.179 on port 1883.
[INFO] found mymqttusername on local database
1546903246: Client ESPURNA_1953C2 disconnected.
1546903246: New client connected from 192.168.140.179 as ESPURNA_1953C2 (c0, k30, u'mymqttusername').

So it appears MQTT server ha an old dead session which is thrown away when the connection is rebuilt, but this is the first log in hours from this IP

mitchese avatar Jan 07 '19 23:01 mitchese

@mitchese I do remember seeing related bug in async-mqtt-client repo regarding erroneous connection state, but that was never really tracked down. And best debugging option is to do it on the device itself + server, checking how tcp connection / espasynctcp library behaves.

Symptoms point to the espasynctcp<->async-mqtt-client interaction somewhere and having wrong assumptions about the flow when network performance is bad. But it is not really obvious how to reproduce it :/ Maybe recent lwip helps with such errors, as I've seen it happening extremely rarely before, but not recently with 2.5.0-git core / lwip2 (~ since the beginning of december)

mcspr avatar Jan 21 '19 16:01 mcspr

same problem here!

tota-5 avatar May 14 '19 21:05 tota-5

I also have this problem. Sometimes there are connectivity problems with my WiFi and afterwards MQTT does not reconnect. I have to reboot the device. I have tested with 1.13.5 and 1.13.6-dev.

JumpmanJunior avatar Jun 01 '19 20:06 JumpmanJunior

@JumpmanJunior @ariel-madril It would be useful to know whether SDK / Core version changes things (and lwip2 when core is >2.5.1). Do you have problems with binary releases or you build things yourself? If self-built, was it using PlatformIO or Arduino IDE?

I do see a report in #1748 which sounds something like this too, but that needs confirmation first.

mcspr avatar Jun 01 '19 21:06 mcspr

I have this problem with the binary release 1.13.5 and with a self-built 1.13.6-dev using PlatformIO.

JumpmanJunior avatar Jun 04 '19 16:06 JumpmanJunior

As previously stated, I did find some issues related to active connections & lwip1.4 & arp cache etc. At the very least, one thing is possible - to see if any are still active and try to tcp_abort(...) them, or try to workaround arp issues by sending them manually Other than that, not much to go on.

I was not being able to reproduce that behaviour with esp <-> ap <-> mqtt setup and unplugging mqtt broker computer cable for some time. Maybe will try again to do that. edit: only interesting thing is that esp was not aware of disconnection for some time. it kept sending mqtt messages, which were marked as sent to the network. assuming that router kept the mapping alive for longer than expected

Can you try changing platform for the dev build? That will require 3 manual edits to the platformio.ini:

  1. change common.platform from arduino_core_2_3_0 to arduino_core_2_5_2
  2. change -DPIO_FRAMEWORK_ARDUINO_LWIP_HIGHER_BANDWIDTH to -DPIO_FRAMEWORK_ARDUINO_LWIP2_HIGHER_BANDWIDTH in build_flags = ...
  3. change 4m3m4s to 4m or 1m0m1s to 1m (same with 2m) in build_flags_... = -Wl,-T... lines

mcspr avatar Jun 04 '19 20:06 mcspr

A firmware compiled with these parameters was behaving very strange (e.g. missing menu items). I then compiled another one with arduino_core_2_4_2, DPIO_FRAMEWORK_ARDUINO_LWIP2_HIGHER_BANDWIDTH and default build flags. This one is now running stable for a week without MQTT problems. So this would confirm your assumptions regarding lwIP 1.4,

JumpmanJunior avatar Jun 17 '19 14:06 JumpmanJunior

@JumpmanJunior thanks for checking that.

for the record, there was a small problem dev branch not triggering reconnect, fixed by c57a0abf7b57897d1be16e2b1f2eb42b38a6f8b8

i was also able to create conditions to force infinite reconnection cycle with async-mqtt-client, ref #1759 , https://github.com/marvinroger/async-mqtt-client/issues/105

  • long keepalive interval (300 seconds default is enough)
  • fast heartbeat interval (15 seconds), maybe even use json payload
  • remove network connection from the device that hosts mqtt broker

If mqtt host gets back on the network, we will trigger onDisconnect event and reconnect on the next cycle. If mqtt host changes ip (e.g. in dhcp environment), we are stuck. Sometimes for a very long time There is a timeout calculation bug in the mqtt client. It also looks like this depends on lwip versions, but they are broken other way around. But i need to prepare a smaller test case first to verify that theory

mcspr avatar Jun 17 '19 15:06 mcspr

I'm also getting this issue with 1.14.1. It's like the Espurna try to keep the MQTT connection up and doesn't detect when the MQTT broker is down.

0hax avatar Feb 05 '20 08:02 0hax

Just to check in that my Espurna on a Sonoff POW also exhibits this issue. Really odd!!

26tajeen avatar Feb 05 '20 11:02 26tajeen

It's like the Espurna try to keep the MQTT connection up and doesn't detect when the MQTT broker is down.

Does using this async-mqtt-client fix the issue? https://github.com/marvinroger/async-mqtt-client/pull/156 https://github.com/arihantdaga/async-mqtt-client/tree/temporaryfix-mqtt-stuck-loop

platformio.ini needs to have this change to lib_deps:

    diff --git a/code/platformio.ini b/code/platformio.ini
index eabcdbe5..c1d56b66 100644
--- a/code/platformio.ini
+++ b/code/platformio.ini
@@ -101,7 +101,7 @@ lib_extra_dirs =
 # ------------------------------------------------------------------------------
 lib_deps =
     [email protected]
-    https://github.com/marvinroger/async-mqtt-client#v0.8.1
+    https://github.com/arihantdaga/async-mqtt-client#d91c309c9482436867ae092cc4cf4d386ea338b3
     Brzo I2C
     https://github.com/xoseperez/debounceevent.git#2.0.5
     https://github.com/xoseperez/eeprom_rotate#0.9.2

mcspr avatar Feb 05 '20 16:02 mcspr

I have 14 devices (Sonoff Pow R2, TH, Blitzwolf) with firmware 1.14.1 from January 3rd (web update from 1.14.10, web update from 1.13.5). All devices work all the time and are connected to the same MQTT Broker (mosquitto 1.16.8 running on a dedicated Debian Intel x64 server). All infrastructure is protected by UPS. The WiFi channel is 12, no interference is detected with the WiFi analyzer. Phones, tablets, laptops are connected on 5GHz. So I'm sure it's not a WiFi connection problem.

Every day I find that one or more devices are randomly disconnected from the broker. For example, today I discovered that a Sonoff Pow R2 was disconnected at 5:30. I can't find a reason for that. I need to restart the device to restore functionality.

I have never had a problem for more than 10 months with any of these devices (version 1.13.5), same configuration. (only regular firmware updates for routers, updates, etc.). I also have another Sonoff TH with 1.13.6 Dev, modified for serial access running 24/24. There is no problem with it.

wVio avatar Feb 13 '20 13:02 wVio

The problem was solved 99% by reconfiguring the WiFi - professional parameters of the router, applying an workaround for Beamforming. The cause was identified on the ASUS RT-AC86U wireless router. Respectively after firmware update to version 3.0.0.4.384_81352, it frequently forcing WiFi clients to disconnect several times per minute. The log for router admin is usefull.

wVio avatar Apr 02 '20 11:04 wVio

I have had the same problem with Sonoff devices thought my Raspberry PI, with HA and MQTT. I first tried to downgrade the ver. 1.13.5, because it worked fine on a S20 device. This didn't solve the problem, but the I tried to remove the "Underscore" from the host name and from the MQTT Root Topic and then it worked fine again on both 1.14.1 and 1.13.5. Try it..!

krimo80 avatar Apr 16 '20 20:04 krimo80