paho.mqtt.golang icon indicating copy to clipboard operation
paho.mqtt.golang copied to clipboard

MQTT disconnects/reconnects due to missed KeepAlive interval to send ping

Open juliandroid opened this issue 6 years ago • 6 comments

The issue is related to https://github.com/eclipse/paho.mqtt.golang/issues/300 which aimed to "Use monotonic time for keep alive".

Unfortunately, the originally used time.Unix() rounds the time, so you always have at least 0.5 seconds to send the Ping and you always have checkInterval set to KeepAlive/2 and thus every second that check succeed and sends appropriate Ping message.

Currently, I get "random" MQTT disconnects/reconnects due to the missed Ping. Now the code sends one or zero Ping messages every KeepAlive interval due to the higher precision and lack of "rounding".

However, I don't believe that the original code was intended to rely on the rounding second to achieve sending of Ping message. Thus, I've split the KeepAlive to 5 and Ping would be sent at the last 1/5 (4rd check) of the KeepAlive interval.

Pull request: https://github.com/eclipse/paho.mqtt.golang/pull/316

Could you please review it and approve it. I don't want to create another account for the Eclipse :)

juliandroid avatar May 17 '19 16:05 juliandroid

we have encountered a serious problem while regressing with the current version, specifically as you mentioned, random disconnections and problem with reconnect as well....for now we reverted the changes to the previous version...

ovaltzer avatar May 19 '19 20:05 ovaltzer

The older version basically relies on rounding nature of the Unix() and leaves the PING with something like 0.5s to send the reply which might be not enough on heavy loaded system. The new code actually exposes the real problem behind.

@odedva You can also try https://github.com/eclipse/paho.mqtt.golang/pull/316/files

juliandroid avatar May 19 '19 20:05 juliandroid

we actually were dealing for very long time with issues of connect\reconnect on bad networks scenarios with this client. mainly due to the nature of the publish channels and etc. our next goal probably would be to get something running on top of the c client as we cannot find any better solution for sync connections to mqtt

odedva avatar May 19 '19 22:05 odedva

I didn't try using in harsh network environment, but with current implementation this 0.5s could be the issue. I'm not sure I understand what problems due to publish channels you have? Is there a ticket here?

It is a bit strange that for this major issue no-one reacts for the last 10 days. For the near future I won't going to use mqtt library anymore, so someone else have to carry this fight :)))

juliandroid avatar May 29 '19 15:05 juliandroid

As per the spec the server is supposed to allow 1.5 times the keepalive interval to receive a pingreq

If the Keep Alive value is non-zero and the Server does not receive a Control Packet from the Client within one and a half times the Keep Alive time period, it MUST disconnect the Network Connection to the Client as if the network had failed

I can see this would be a problem if the keepalive interval is short, I appreciate the work in the associated PR, but I cannot merge it without a signed ECA

alsm avatar Jul 03 '19 12:07 alsm

I've run into this when the client is under load and orderMatters is set to true. https://github.com/eclipse/paho.mqtt.golang/issues/210

Also found that when we removed ordering, under load the app would overflow routines. Not sure if this is still in place but we ended up forking and modifying as a fix. https://github.com/meshifyiot/paho.mqtt.golang

ashtonian avatar Jul 16 '19 23:07 ashtonian