Improve handling of adapter shutdown in case of many devices subscribed for commands
The given scenario is that of an AMQP or MQTT adapter being shut down while there are a lot of devices still connected and subscribed for commands. The adapter will then close each command subscription and send corresponding "disconnectedTtdEvent" messages.
The many "unregister command consumer" requests in that scenario may cause "no credits" errors on the Command Router connection AMQP link.
A way to improve the handling here would be to combine the data of the "unregister command consumer" requests into fewer batch requests.
A similar, but less common, scenario would be that of a gateway disconnecting while having previously created a lot of command subscriptions for individual devices.
Or we might simply not close the consumers at all. In the end, the same thing happens when the adapter crashes, right? So we need to be able to cope with this situation anyway, don't we?
Yes, we could just skip closing the consumers (cleanup needed for some special cases is part of #2028).
But I guess we do have to send the "disconnectedTtdEvent" messages still - the corresponding devices might not automatically reconnect right away. Looking into this in more detail, there are cases where we currently skip sending the "disconnected" event. That happens. if the "unregister command consumer" remote call has revealed that the consumer wasn't actually active any more (because the device had already connected to a different adapter in the meantime, see #2244). We wouldn't be able to do that when skipping the "unregister command consumer" invocation altogether. But maybe we could let the command router handle sending the "disconnectedTtdEvent" messages in the case of the adapter shutting down. Technically the Command Router has the needed information in the form of the 'adapter instance to device' mappings (although this would currently require traversal over all cache entries, ~~but this is probably a problem to be solved already for #2028~~ EDIT: not necessarily).
But I guess we do have to send the "disconnectedTtdEvent" messages still - the corresponding devices might not automatically reconnect right away.
Again, we do/can not do that when an adapter crashes, so I guess we need to find a way to deal with this situation that does not depend on the adapter shutting down gracefully. Let's assume that we do not send out any messages when shutting down. Some devices will re-connect, some won't. For those that do reconnect, we do not need to do anything special, right?
For those devices that don't reconnect we need to deal with the stale information in the Command Router and the incorrect connection status. The former is already addressed by #2028, if I am not mistaken. The latter could probably be addressed when it actually becomes evident, i.e. when an application tries to send a command to such a device and delivery of the command fails due to incorrect routing information. The Command Router could then remove the wrong routing information and send a corresponding disconnected event on behalf of the adapter. Better late than never ...
WDYT? @calohmn
For those devices that don't reconnect we need to deal with the stale information in the Command Router and the incorrect connection status. [...] The latter could probably be addressed when it actually becomes evident, i.e. when an application tries to send a command to such a device and delivery of the command fails due to incorrect routing information.
Yes, the scenario would be that there still is a mapping entry for the device entry in the Infinispan cache, but the corresponding target protocol adapter instance has been found to be dead. Along with deleting the cache entry, we could send a "disconnected" event message in the Command Router. Decoupling the sending of "disconnected" event messages in that case from the actual point in time when the device is disconnected, and sending "connected" and "disconnected" events from different Hono components, could create timing issues though. If a command is sent at the device at roughly the same time that the device reconnects, and if the Command Router Kafka sender is slow to send out messages, the "disconnected" message could end up being sent last. So the downstream application will consider the connection status to be "disconnected" although the device is actually connected. Maybe we should consider moving the sending of all "connected" and "disconnected" event messages into the Command Router then.
Hi, could you assign it to me? As it will be implemented within https://github.com/eclipse/hono/issues/2069
So, what is the status here? Do we agree to always let the Command Router send the connected/disconnected event on behalf of the device? @calohmn @n-deliyski
@sophokles73 Yes, that is the idea.
Then my understanding of the discussion above is that there is no need for the protocol adapters to explicitly unregister command consumers from the Command Router nor send out disconnected events during shutdown of the adapter instance, right?
On shutdown, the protocol adapter itself doesn't have to do the sending of the disconnected events. But, considering the case that some devices might not reconnect, there should be disconnected events getting sent for these from somewhere, notably the Command Router, possibly with some delay. An explicit delay on the Command Router part before sending the events here and a prior check whether the device has disconnected in between could let the Command Router skip events that are obsolete. See https://github.com/eclipse-hono/hono/pull/3386#issuecomment-1248919429.
Summing up the proposal for optimization:
- On shutdown, Protocol Adapter will not explicitly unregister Command Consumers and will not send
disconnectedTtdEvent. - Command Router will send
disconnectedTtdEventin case of command from north bount Application to disconnected device. This will happen only if the device is marked as connected to Adapter that was already shutdown. - The sending of
connectedTtdEvent/disconnectedTtdEventwill be moved from Adapter to Command Router.
Fixed via #3422, #3434, #3436.