aws-c-mqtt
aws-c-mqtt copied to clipboard
Disconnect instantly while reconnecting
We noticed that trying to shut down clients in reconnecting-state may take several minutes before the on_disconnect
callback gets called. The reason appears to be that when the connection
has no channel
when it has no network, the disconnection request will not be handled until the next scheduled reconnect-task is run. This may take up to 128 seconds, which is a bit long for e.g. demanding people to wait until they can reboot their computer. In addition to this, if the client is running as a Windows service and the computer is in S0 low-power idle state, the service process gets only a fraction of normal CPU time, and it may take over 30 minutes before the reconnect task gets run.
It looks like the disconnection task is heavily tied with a channel
of the connection
, which is not available when there's no network, so currently we wait until a reconnect-task tries to create a new channel
. However once the disconnect-request has been made, the new channel
will be only used for calling s_mqtt_client_shutdown
, which is to be called after the channel
has completed its shutdown, so having a channel
should not be needed?
This patch skips the wait for the reconnect-task, and instead lets the disconnect-task to call s_mqtt_client_shutdown
directly in case there's no channel. This shutdown function was moved from client_channel_handler.c to client.c to get access to static void s_mqtt_client_shutdown()
defined within client.c. It also feels like it fits there almost better than the channel-file, as the only link to channels appears to be that s_mqtt_disconnect_task
should ensure that the channel has been shut down before s_mqtt_client_shutdown
gets called.
Testing this did not show any obviuous side-effects even for the case where reconnect task gets run after disconnect, as the call aws_atomic_store_ptr(&connection->reconnect_task->connection_ptr, NULL)
from s_mqtt_disconnect_task
appears to cause the reconnect task s_attempt_reconnect
to bail out due to connection pointer being NULL.
One worry pointed out by @kankri about this patch is the small but possible chance that the s_mqtt_disconnect_task
and s_attempt_reconnect
tasks would get run simultaneously by two separate aws event loop threads, and then s_attempt_reconnect
acquires a pointer to connection
whose content might get freed and/or invalidated by s_mqtt_disconnect_task
before s_attempt_reconnect
is done. But I guess this is already a potential problem, because the comment "If there is an outstanding reconnect task, cancel it" inside s_mqtt_disconnect_task
already assumes that there might be a reconnect task scheduled for the short term future? We could make a separate issue out of that.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
I just started oncall and will try to take a look at this as part of that. We have a longer-term refactor/evaluation of the reconnection logic planned as part of the overall iot quality pass we're working through. I think this would be a good thing to get in beforehand. I would like to move the connection's event loop to a stable one-time choice so that even when there's no channel we execute all tasks (like reconnect/shutdown) on a single thread of execution, but that will need to wait for the refactor.
I would like to move the connection's event loop to a stable one-time choice so that even when there's no channel we execute all tasks (like reconnect/shutdown) on a single thread of execution
That sounds like a good idea, as channels come and go together with typical network connectivity issues, so having a static event loop throughout the lifetime of a connection (shared with its channels?) sounds more stable.
In this PR I focused just on minimizing the changes, and therefore there's the two paths to shutdown, the old one where channel(+connection) shutdown gets scheduled into the channel event loop when possible, and the new where only connection shutdown is run in client bootstrap event loop when there's no channel. I'm also fully OK with you modifying these changes as you see fit, as I'm not on super firm ground on understanding the whole channel/connection/event loop -system, and also because our opposing time zones might make back-and-forth changes slow.