micropython-lib icon indicating copy to clipboard operation
micropython-lib copied to clipboard

umqtt.simple socket behaviour when WiFi is degraded

Open peterhinch opened this issue 7 years ago • 16 comments

umqtt.robust works when WiFi fails completely. In testing there are circumstances when WiFi performance can be degraded to a point where blocking sockets can hang indefinitely without throwing an OSError. The behaviour is hard to replicate consistently. Moving slowly to the limit of WiFi range can (rarely) provoke it. It occurs here every few hours even when well within range, perhaps owing to RF interference. I have seen the following in many hours of testing:

  1. Socket read and write methods hanging for long periods.
  2. Publish with qos==1 failing to receive a PUBACK.

These will cause an application to hang. Possible fixes:

  1. Use a socket timeout rather than a blocking socket.
  2. Implement a timeout in the qos==1 loop which waits for a PUBACK. This begs the question as to what to do if a PUBACK is not received when the timeout expires. Arguably the publication should be repeated with dup==1 on the grounds that the broker may not have received the original message.

peterhinch avatar Sep 19 '16 08:09 peterhinch

Does it also hang if you restart the broker only?

phieber avatar Oct 01 '16 06:10 phieber

I tried this using example_sub_robust.py minimally modified to connect with my server. I published a couple of entries using mosquitto_pub on a PC to verify the code on the ESP8266 was working correctly, and then killed the broker process on the server. The ESP8266 immediately rebooted. This behaviour seems repeatable.

Is this intended or expected? My feeling is that this constitutes a bug and that an exception should be raised.

peterhinch avatar Oct 01 '16 16:10 peterhinch

It's perhaps worth noting that others are experiencing these issues but incorrectly raising them against micropython rather than here. See this issue.

peterhinch avatar Oct 29 '16 16:10 peterhinch

timeout property was added in https://github.com/micropython/micropython-lib/commit/1b15e3d7b72fd3f117af28b824f254a6ac8fe36c

pfalcon avatar Oct 29 '16 21:10 pfalcon

In such simple way it probably will be useful only for publishing, subscription will require additional handling.

pfalcon avatar Oct 29 '16 21:10 pfalcon

Great, that's a step forward. I did try something along those lines when I did my testing, but it didn't solve all the problems, notably the missing PUBACK one.

peterhinch avatar Oct 30 '16 06:10 peterhinch

As I hinted in https://github.com/micropython/micropython/issues/2568#issuecomment-257143287, nothing will "solve all the problems". And there's a problem with this "arms race" on adding features to umqtt.simple - it's no longer simple, but becomes a bloated mess (soon people will report getting OOM with it). The idea behind it was to show that one can write a simple MQTT client in Python, not to make something which will work well with weak WiFi signal or resistant to cosmic rays. Ditto for umqtt.robust - it was an example how one can add configurable error handling on top of that "simple" thing, not to make it 100% robust (can't make it such, can optimize for a particular behavior). None of that is intended to be working for all possible cases (you can't even imagine all possible cases), just be a working example for many (not all) possible cases. You may need to tweak it, maybe heavily, maybe rewrite, for your particular case. It's like "what to do when there's a rain?" There's no single answer, "stay home", "take umbrella", "sneak under building's cap" are valid, and very different answers, and answers like "invent antigrav machine to make drops dodge you" is also a plausible answer. On a personal note, I'd wish you write not your cooperative scheduler module, but your mqtt module ;-).

pfalcon avatar Oct 30 '16 11:10 pfalcon

timeout property was added in 1b15e3d

Now that's why I didn't add it before - because I'm working on 10 other things already and don't have time to test such changes, and that leads to errors:

https://github.com/micropython/micropython-lib/issues/110#issuecomment-257146648

So, I'm afraid that's not leading anywhere, and we (MicroPython maintainers) need to concentrate on further development of esp8266 instead of trying to spreading thin.

In this regard, umqtt.simple implements large part of MQTT protocol and known to work. umqtt.robust is also known to handle conditions like server restart. Their usage in conditions like non-reliable network is outside of their scope. We welcome community research and development trying to address that problem. (I also would be personally interested to look into that - when time permits.)

Timeout changes will be reverted.

pfalcon avatar Oct 30 '16 12:10 pfalcon

Can anyone suggest alternative for the non-reliable situations, if mqtt is not good fit there?

My case is that I want reliable solution for real-time data in none-guaranteed wifi situations. If network does not work, then program should just continue, discard unsent data, and when network is up again, then no old data is sent, only current one. It would be nice to know whether current data sending works, but I can live without it. In low level I imagine just pushing raw UDP packages instead of TCP connection, this saves also time and energy of device. But I'd prefer using some standard high-level stack in both sender (micropython) client and server (pyhton) sides. MQTT looks nice in functional level and there are many readymade implementations, so I started with this for my project, but it does not seem to support microcontroller-level optimizations like this?

I tried with mqtt_client.timeout = 1 (before mqtt_client.connect()), but this does not seem to change anything, if network is bad then it seems to hang forever.

So neither mqtt.simple or mqtt.robust seem to fit my needs. They have following issues:

  • they keep and resend data later. For me this is garbage which is also hard to detect in server side. Sometimes no data is better than old data.
  • I do not knoa whether message was sent to server or not, publish does not tell my program whether it was success or not
  • in worst situation (a bit longer network loss) whole program seems to stop

jaakla avatar Jul 01 '18 11:07 jaakla

This repo contains my attempt at a solution to a resilient, asynchronous, MQTT implementation.

peterhinch avatar Jul 01 '18 12:07 peterhinch

Hi Is there any solution for this problem ?! I'm using umqtt and my code stuck in publish block forever when my internet connection lost and this is very bad ! There is no timeout or other option that can i use for situation that wifi/internet connection get lost .

alirezaimi avatar Oct 30 '19 08:10 alirezaimi

I'm not sure why you gave mqtt_as a thumbs-down. Are you experiencing problems with it?

FYI I've been performing a long term test on the library using two Pyboard D's in different locations. So far they have accumulated >6 weeks combined runtime doing qos==1 transfers ever 5 seconds without data loss. In that period WiFi went down twice on one unit and once on the other. Both resumed operation without data loss.

peterhinch avatar Oct 30 '19 10:10 peterhinch

@peterhinch YES !! there is a big problem with it for tls on 8883 port that i can not find any solution for it ! and you told it in README part in known problems. https://forum.micropython.org/viewtopic.php?p=40657&sid=c8779b6e259a21eba90f1dc7cd423d5a#p40657

But i use umqtt.simple with simply ssl=True param and everything work fine. if the timeout option be in umqtt lib or any other option that can be passed easily from borken connection, that's was awesome !

alirezaimi avatar Oct 30 '19 11:10 alirezaimi

In extensive testing by myself and Kevin Köck mqtt_as has proved reliable on ESP8266, ESP32, Pyboard 1.x and Pyboard D.

I appreciate the desire for an official solution, but the task of achieving and proving long term reliability in conditions of poor and intermittent WiFi connectivity was (inevitably) time consuming. I'm not a maintainer so the following is just a personal opinion: achieving an official solution could be expensive - assuming the developer time is available.

peterhinch avatar Jul 27 '20 14:07 peterhinch

Sometimes, you don't need robust, you just need to do a reconnection when something goes wrong:

Client = MQTTClient(CLIENT_ID, SERVICE_IP)
if Client.sock:
    Client.sock.close()

yingshaoxo avatar Nov 14 '20 11:11 yingshaoxo

This is true where the client only ever publishes. One strategy is to connect to WiFi only when you are ready to publish, and to disconnect afterwards. You can detect failure to connect; you can also detect timeouts in qos==1 publications. In these cases you can react accordingly. In such applications the official clients can work.

The difficulty arises when a client subscribes. The official clients have numerous failure modes when connectivity is variable and achieving resilience is problematic.

peterhinch avatar Nov 15 '20 07:11 peterhinch