micropython-mqtt icon indicating copy to clipboard operation
micropython-mqtt copied to clipboard

purpose of max_repubs?

Open tve opened this issue 4 years ago • 8 comments

In mqtt_as' MQTT_base you have a _max_repubs config variable which defaults to 4. It governs how often publish republishes a message on the existing connection (https://github.com/peterhinch/micropython-mqtt/blob/master/mqtt_as/mqtt_as.py#L361). I'm a bit puzzled: what is the value of republishing on the same connection? I.e., given that the connection is TCP-based, what kind of error does this try to overcome. It seems to me that if no ACK is received on the connection then there's no point resending on the same connection.

tve avatar Mar 08 '20 16:03 tve

Consider the characteristics of radio links. Near the limits of range connectivity can come and go. TCP can no longer provide a guaranteed connection - there is some phrase on the lines of "best effort" to describe its behaviour when physical connectivity is unreliable. In (a lot of) practical testing I concluded that it was worth making more than one attempt before initiating the slow process of declaring the link down and initiating reconnection.

I'm not expert on TCP so I'm not qualified to judge the extent to which this observed behaviour of MicroPython platforms is correct. When I was developing the module I asked in the forum how TCP handled intermittent connectivity and I was informed of the "best effort" notion (this may not be the correct terminology).

peterhinch avatar Mar 09 '20 07:03 peterhinch

This is not a "best effort" thing. You you send a bunch of chars on a TCP connection and then a bunch more chars, the second set is never delivered if the first one isn't. It's totally pointless to "retransmit" something on a TCP connection: it cannot possibly be delivered to the remote application if the prior instance isn't.

tve avatar Mar 09 '20 15:03 tve

In our testing for micropython-iot we often saw that the expected behavior can't be relied on because we are working with an unreliable low level wifi system. We'd often get very short outages that just made messages vanish but as far as I remember it was not always necessary to completely reconnect. Sometimes just resending would be enough, but my memory about this specific tests are a bit foggy.

My point is, it seems like you can't rely on how tcp should work in theory.

kevinkk525 avatar Mar 09 '20 15:03 kevinkk525

Sorry, but you're implying that LwIP's implementation of TCP is completely broken and somehow the thousands of developers using it haven't noticed.

tve avatar Mar 09 '20 16:03 tve

Well try to develop a resilient communication library for an esp8266 testing at very low dbm and see for yourself what we experienced. How you interpret these results is up to you..

But as I said, I'm not completely sure about the tests we did and how the republish worked or even if we did it the same way in the micropython-iot library. I just know that we were relying on tcp to handle the connection state correctly and republish dropped messages (due to wifi problems) by itself, but it just didn't do it.

kevinkk525 avatar Mar 09 '20 16:03 kevinkk525

I'm not using MP on esp8266 so no interest in debugging that :-) . However, if LwIP delivered TCP data with a gap then that would show up majorly in many projects. I have used the esp8266 enough to know that that doesn't happen. What you are most likely to have observed is that repeated send attempts cause LwIP to notice a retransmission failure sooner. But your implementation already retransmits ping packets every second when there's an impending failure, so retransmitting user data doesn't improve anything and just adds to complexity.

You mention your testing, do you have any artifacts from that?

tve avatar Mar 09 '20 16:03 tve

I think this should be set in context.

Changes related to connectivity outages require a great deal of time consuming testing. In the past such testing threw up many surprising edge cases, in one instance involving my sending hardware to Australia for Damien to evaluate. That one remains unresolved.

At one point, embarrassed by the complexity of the code, I started a new project with the sole purpose of finding a minimal solution to a resilient socket-like connection. The end result, after a lot of work and in collaboration with @kevinkk525 , demonstrated that the complexity was actually required.

Every so often a network guru emerges declaring some of the code to be unnecessary. I fully accept that it is an empirical solution: arguably this is both a strength and a weakness.

I would greatly welcome someone with a deep grasp of networking issues and a willingness to tackle the underlying libraries, backing up their statements with fixes and enabling us to simplify the module. It is beyond my capabilities. Proving such an effort would involve much testing.

In the meantime this solution has the merit of empirical resilience on ESP8266, ESP32 and Pyboard D platforms. It may be that this particular value can be set to zero without ill-effects. It is user-configurable, feel free to set it to zero and re-test on all platforms at the limits of range.

peterhinch avatar Mar 10 '20 17:03 peterhinch

Great, thanks for making this nice library available! I'm trying to add streaming publish capability and it's not trivial, but that's why I have gone with a fine toothed comb through the code. I'm also only interested in esp32 at the moment (maybe I'll try esp8266 but my tolerance for non-https has pretty much reached zero and I don't think MP + https is practical on the esp8266). In any case, I'll stop bothering you with github issues.

tve avatar Mar 10 '20 21:03 tve