paho.mqtt.golang icon indicating copy to clipboard operation
paho.mqtt.golang copied to clipboard

Rearchitect status handling

Open MattBrittan opened this issue 3 years ago • 2 comments

Currently this package stores the connection status in a uint32 protected by a RWMutex:

status       uint32 // see const definitions at top of file for possible values
sync.RWMutex        // Protects the above two variables (note: atomic writes are also used somewhat inconsistently)

Unfortunately the status is often accessed without locking the Mutex e.g. here. While these accesses are usually via atomic.LoadUint32 they are not particularly threadsafe and the code has been complicated to work around the deficiencies (e.g. here). This is especially apparent with auto reconnection (what happens if Disconnect is called while the system is attempting to reconnect?).

Currently the available statuses are:

const (
	disconnected uint32 = iota
	connecting
	reconnecting
	connected
)

This is fairly limiting. For instance the addition of a disconnecting status would make it obvious that the disconnection process was underway avoiding the need for workarounds.

Whilst I believe that workarounds are in place for all known issues (i.e. it works) the code can be difficult to follow which discourages new contributors (e.g. this PR) and makes checking/testing contributions more difficult then it needs to be.

I have been putting off changing this for some time, due to the potential to introduce subtle bugs, but feel that the time has come where a change is needed so will be submitting a PR shortly. I would expect this change to require a few iterations; the initial change will introduce new status handling code and follow-up PR's will (not necessarily by me) will refactor code to remove workarounds introduced due to the limitations of the old mechanism.

Note: The new code will focus on being readable over performance (for example I'm aiming to remove uses of atomic until benchmarking provides a solid rationale for using them). This change will not change the API and should make no difference to users (unless bugs are introduced!).

MattBrittan avatar Aug 09 '22 02:08 MattBrittan

Changes have been merged but I'll leave this issue open for any issues arising.

MattBrittan avatar Aug 10 '22 04:08 MattBrittan

I've been running this live in four systems (some with very poor connections) for a couple of weeks without experiencing any issues. Will leave the issue open until after the next release but it appears that the change might be mostly bug free (but I would not be surprised if there is an edge case I have missed). Pull requests to clean up old work arounds would be welcome (there are a few places where steps were taken to avoid deadlocks that are no longer needed because of the centrally controlled status).

MattBrittan avatar Sep 06 '22 03:09 MattBrittan

Hi!

We currently receive sporadic errors of the following kind when subscribing to topics: "not currently connected and ResumeSubs not set"

The issue here is, if you tcpdump, there is never an interruption or a reconnect or anything happening while this error comes up. ResumeSubs is not set like the error already correctly says and in our case the application does this a lot:

  • subscribe
  • send something
  • wait for an answer on the subscribed topic
  • unsubscribe

Do you know any good ways to debug this further within the library. This is very sporadic and mostly happens on production work load and very infrequently, we still would like to see if we can track it down.

I'm kind of assuming right now, that this might relate to the merged PR from August, since its description fits our symptoms.

Thanks!

peterhoneder avatar Jan 26 '23 15:01 peterhoneder

@peterhoneder looking at the code this can only happen if !c.IsConnectionOpen() and I cannot see any way that can occur unless the connection has dropped. Of course I may be missing something and, as I wrote the code, it would be good to have another set of eyes run through it. Can you please confirm what version you are running? (if anything I would expect the changes to decrease the likelihood of false positives on this).

This would probably be better logged as a separate issue but would require a lot more information to be actionable (ideally code extracts and debug logs). This kind of issue can be very hard to track down (and is next to impossible without debug logs when I cannot duplicate the problem).

MattBrittan avatar Jan 26 '23 23:01 MattBrittan

Closing this issue off as the changes to status handling appear to be working OK (without more details I'm unable to follow up on the one comment above).

MattBrittan avatar Jul 06 '23 21:07 MattBrittan