paho.mqtt.c
paho.mqtt.c copied to clipboard
Possibly premature call of MQTTAsync_terminate leading to crashes
We have a scenario in which we're repeatedly creating new clients with MQTTAsync_create and connecting them with MQTTAsync_connect. Both of these calls are on the same thread for a particular client, though there may be other sequences of create -> connect going on for different clients on other threads.
When we get an onFailure callback for that connection (authentication error in this situation, but I don't know how relevant that is), we're signalling another thread of ours to MQTTAsync_destroy the client, then returning from the callback.
We've disabled automatic reconnect when creating the client, so I don't think that's a factor.
So there's a lot of client/connection churn going on. After repeating this for a while, we hit a seqfault inside the Paho library, usually in the _destroy function, or in the receive thread. I've attached Valgrind output showing both of these things happening. Unfortunately I've had to put "REDACTED" in place of lines mentioning our own internal code, sorry! There's absolutely nothing inside Paho in those bits of the stack though.
It looks like MQTTAsync_terminate is being called by an earlier invocation of MQTTAsync_destroy, when the client count drops to zero. However, the client count dropping to zero is a temporary state of affairs, and we'll be calling into MQTTAsync_create->MQTTAsync_connect on another thread after this happens, or even while it's happening.
Is there any chance we're misbehaving with the API? It's unlikely, as Paho seems to be very threadsafe, looking at the implementation, but I thought I should ask.
I wish we could be more helpful - we've not got a minimal testcase for this, it's happening when embedded inside our product while it's being stress tested.
Are you creating lots of new client objects so frequently that it would be better to reuse them rather than create new ones?
Hi Ian, In practice, we are not actually creating them that frequently; this is very much an extreme abusive test to find robustness and thread-safety issues in our own code, where we stumbled across this crash.
However, the pattern of use, in which sometimes more than one client is created and later destroyed while others are active is something that does sometimes happen in practice. The only difference is in volume, so this bug that we find in our extreme testing is something that could possibly occur in a real scenario, just as a lower probability.
I think the only way we could avoid this problem 100% is to have at most one client and destroy it on shutdown of our own process. This isn't really feasible as we're sometimes connecting to more than one distinct broker.
Thanks, Richard
The library is trying to be helpful in destroying its internal structures when the last client is destroyed. It seems there is still a timing window in that processing, probably when the next create client call is made while the previous destroy is taking place (guessing).
A circumvention would be to make the cleaning up of the client library data structures a separate function which the application could call when it finishes using the MQTT client library. This could be an option in the global_init call. This might be ultimately necessary as the setup/teardown processing creates and destroys the mutexes needed for locking, so my trying to be helpful won't work in a high volume client object create/destroy environment when the number of client objects goes back to 0.
Thanks Ian. I'm pondering a local workaround where we simply patch out the MQTTAsync_terminate function; so when the client count goes to zero, none of the globals (mutexes, receive + send threads) are terminated, and they're all still there when the client count increases again. Obviously it means we don't get the automatic cleanup, but it'll probably fix the crash.
Does that sound safe, or a naive and dangerous idea? :-)
When I have a spare moment I'll have another look around the code for thread-safety issues. Nothing obvious cropped up when I looked at the create/destroy functions the last time, so I'll cast my net wider.
What OS are you using? I think the safest way which will not be too difficult to add, will be to follow my suggestion of an explicit cleanup function option.
Several OSes, but the one that's easiest for me to work with (to do repro and analysis) is Linux. RHEL 7 with gcc 4.8.5 , if that helps. I'm happy to try out an early patch against our extreme-abuse testing, if you want to send one my way.