hono Find ways to reduce cache operations when command consumers get (temporarily) disconnected

When an AMQP/MQTT device, listening for command messages, gets disconnected from the protocol adapter, the association between device id and protocol adapter instance id gets removed in the ~~device connection~~ command router service (~~removeCommandHandlingAdapterInstance~~ unregisterCommandConsumer). When using the Infinispan cache based service implementation, this means a cache.remove operation.

When a large number of devices get disconnected at the same time, quite a lot of cache traffic will be generated. This is also due to the fact that ~~removeCommandHandlingAdapterInstance~~ unregisterCommandConsumer will trigger 2 cache operations, getWithMetadata and removeWithVersion (making sure that no newer mapping entry gets removed).

In most cases however, the removal of these kinds of cache entries isn't even necessary. When the corresponding device reconnects and subscribes for commands again, the cache entry will be overwritten.

There is only one case where a remaining, obsolete cache entry will cause problems: if a gateway subscribes for commands for all its devices, after one of these devices has previously subscribed (and unsubscribed) for commands (see #1858).

The aim here is to find a way to reduce unnecessary cache.remove operations while preventing the above possible issue with stale entries.

Jul 03 '20 17:07 calohmn

There is only one case where a remaining, obsolete cache entry will cause problems: if a gateway subscribes for commands for all its devices, after one of these devices has previously subscribed (and unsubscribed) for commands (see #1858).

An idea would be: If a device removes its command subscription, only remove the mapping entry in the device connection service if the device has a via entry set, i.e. if potentially a gateway could subscribe and handle commands for that device. In all other cases (e.g. when a gateway subscribes for commands for all its devices), removal of the mapping entry could be skipped.

A problem here would arise, if a device without a via entry has already received commands by connecting directly to a protocol adapter, and only later on the device gets a via entry. There is currently no hook to have the mapping entry get deleted on that device registration entry update.

Jul 09 '20 17:07 calohmn

There is currently no hook to have the mapping entry get deleted on that device registration entry update.

@calohmn I think we have such a hook in the meantime by means of the notifications that the registry sends, don't we?

Jun 13 '22 06:06 sophokles73

Hi, the topic looks interesting. Could I join here?

Jul 04 '22 06:07 n-deliyski

@n-deliyski Sure.

Jul 04 '22 06:07 calohmn

When this ticket was created, the protocol adapters were still directly accessing the Infinispan cache (via the Device Connection API) to update the "subscribed device id" -> "adapter instance id" mapping. Now, the Command Router API is used instead. That means there is first an AMQP request message sent to the Command Router (unregister-cmd-consumer) and the Infinispan cache entry is removed there. So, nowadays the optimization discussed here could be about skipping the unregister-cmd-consumer AMQP request message altogether, if possible.

A problem here would arise, if a device without a via entry has already received commands by connecting directly to a protocol adapter, and only later on the device gets a via entry. There is currently no hook to have the mapping entry get deleted on that device registration entry update.

@calohmn I think we have such a hook in the meantime by means of the notifications that the registry sends, don't we?

I don't know whether the device notifications would help here. The scenario would be:

Device without via entry subscribes for commands
Device disconnects. Because of the optimization discussed here, the Command-Router API unregister-cmd-consumer method isn't invoked. That means the mapping entry associating device and adapter instance isn't deleted.
At some point later, the device gets a via entry.
The corresponding gateway then subscribes for commands.

=> Commands for the device won't reach the gateway because there is still the mapping entry with the device id (such a mapping entry always has precedence over a gateway mapping entry).

Jul 04 '22 06:07 calohmn

Also relevant is #2760 here. As mentioned there already, one thing to consider here is how to deal with sending the disconnectedTtdEvent event message. Currently, the unregister-cmd-consumer Command Router request result is defining whether we send the disconnectedTtdEvent event message or not (see AbstractVertxBasedMqttProtocolAdapter.MqttDeviceEndpoint.onCommandSubscriptionRemoved()). This is important in order not to send a "disconnected" event after the device has already connected to another protocol adapter instance.

An idea mentioned in #2760 is to move the sending of the "connected" / "disconnected" event messages into the Command Router. If we don't usually make the unregister-cmd-consumer requests in the protocol adapter anymore, the question is when to send the "disconnected" event messages. As suggested in #2760, we could do so when a stale mapping entry is found upon receiving a command message. This could be quite late and would only be triggered by the downstream application. An additional way to send the "disconnected" event messages already earlier could be, that the protocol adapters keep track of which devices have recently unsubscribed/disconnected and then do periodic batch requests (e.g. every 60s) with all these device IDs to the Command Router. The Command Router would check the current mapping information to see whether a device in the list has connected to another adapter in the meantime, and, if not, send the "disconnected" event message.

Jul 04 '22 10:07 calohmn

I have collected some thoughts regarding https://github.com/eclipse/hono/issues/2069 and https://github.com/eclipse/hono/issues/2760 as both issues seems to address similar problems.

Directions for improvement:

1 Remove call to unregister-cmd-consumer. Move sending of "connected" and "disconnected" events to command router. Keep track of unsubscribed/closed devices and regularly (e.g 10 seconds) notify the command router with the list. The command router will publish "disconnected" event if needed. Having this notification delay will lead to lag in the service consuming the event (e.g Ditto) when showing the connectivity status. Some customers may be not happy with that. The approach will reduce infinispan cache operations and the AMQP calls between the adapter and command router.

2 Similar to 1). But have two different cases: if adapter is running /a/ and if adapter is stopping /b/ a) on device unsubscribe notify command router directly. b) on device unsubscribe add to list and notify command router on time interval or if the list becomes greater than a limit. The benefit is that the "disconnected" event lag is only on adapter stop. The approach will reduce infinispan cache operations and the AMQP calls between the adapter and command router only during pod restart.

3 Call unregister-cmd-consumer in batches independent of adapter state. The approach will reduce the AMQP calls between the adapter and command router. The usage of batches will increase the complexity as we have to keep try of each individual response to trigger or not the "disconnected" event.

4 Call unregister-cmd-consumer in batches only if the adapter is stopping. The benefit is that the "disconnected" event lag is only on adapter stop. The approach will reduce the AMQP calls between the adapter and command router only during pod restart. The usage of batches increases the complexity as mentioned on 1.3.

5 Handling of problematic via device use case. Use DeviceChangeNotification when a device gets via entry. Assume there is a way to identify exactly the case when via entry is added (filter the other changes). Adapter notify command router "this device got via please update its cache status properly". Additional point to have in mind: The device could be online or offline while the via entry is added. It is edge case if it is online. But this situation is possible into the current implementation too. A variant that may help here is to keep track of disconnected devices and provide this info to the command router too.

@calohmn @sophokles73 Any opinion what path to take?

Jul 12 '22 16:07 n-deliyski

I would suggest going for the first option here, introducing the unregister-cmd-consumers batch requests sent in a configurable interval, and moving the "connected" and "disconnected" events to the command router. The configured interval could be set to a low value, so that the delay in the connectivity status update would be small, while still preventing status updates if the device disconnects and then reconnects directly afterwards.

I would introduce the batch handling not only on adapter shutdown, as a significant number of devices could be controlled by some external service and be triggered to disconnect at the same time. I would move sending the "connected" and "disconnected" events to the command router in order to simplify the batch method return value - not having to specify for each device if the cache value got updated or not. Moving the event sending also has the benefit of reducing the delay between cache update and sending the event a bit, thereby reducing the possibility of timing issues (see https://github.com/eclipse/hono/issues/2760#issuecomment-1173592184).

Jul 13 '22 08:07 calohmn

I would suggest going for the first option here, introducing the unregister-cmd-consumers batch requests sent in a configurable interval, and moving the "connected" and "disconnected" events to the command router.

In order to make it a little more robust, the adapters should send a batch as soon as the batch size has been reached, i.e. they should not wait for the interval to have passed before sending the batch in this case.

When an adapter instance crashes, the worst thing that can happen is that we loose all the piled up device disconnected information and the Command Router will not be able to send corresponding disconnected events downstream. So for the time it takes the devices to connect to another adapter instance, downstream applications will still believe that the devices are connected but commands that they send to these devices will be failed by the command router, right? This time period should be short, however, and the applications need to be prepared to handle failed commands anyway because even if they did not receive a disconnect event for a device yet, the device may already have disconnected from the adapter and we simply did not notice yet (due to heart beat intervals).

Jul 13 '22 08:07 sophokles73

Here are some findings that came together with @calohmn.

At first I thought to add batch operations to both register-cmd-consumer and unregister-cmd-consumer Command Router APIs. But then I face the fact of synchronous execution. When device does MQTT subscribe request, Hono returns response only after the command consumer is created. (The synchronous flow is applicable to unsubscribe flow too.) It is required to keep that when applying the batch functionality. This means the response will be returned after the batch call is executed. This will impose a delay on the corresponding client requests. That's not so good.

To prevent any delays, we could do the batch operations only when doing the unregister as part of a AMQP/MQTT connection disconnect, or when the adapter shuts down. There, the device doesn't need to wait for the "unregister" result.

Aug 19 '22 13:08 n-deliyski

I am thinking how to proceed here.

The issue title is scoped to reduce of (Infinispan) cache operations. #2760 has already implemented an optimization during adapter shutdown.

The first comment mentions that two cache calls are involved during unregisterCommandConsumer - getWithMetadata and removeWithVersion. @calohmn Is that still applicable?

Nov 03 '22 15:11 n-deliyski

@calohmn @n-deliyski are you still working on this one? Otherwise, we could also re-schedule it to 2.3.0 ...

Nov 14 '22 08:11 sophokles73

@calohmn @n-deliyski Any news?

Nov 22 '22 07:11 sophokles73

The first comment mentions that two cache calls are involved during unregisterCommandConsumer - getWithMetadata and removeWithVersion. @calohmn Is that still applicable?

Not any more, see https://github.com/eclipse-hono/hono/commit/5505c3c6f6d4cb2ba70828188c9e38af2d02f1de.

I am thinking how to proceed here.

The issue title is scoped to reduce of (Infinispan) cache operations.

As mentioned in https://github.com/eclipse-hono/hono/issues/2069#issuecomment-1173423185 above, the focus here has shifted from reducing the (Infinispan) cache operations to reducing the unregisterCommandConsumer requests to the Command Router (since protocol adapters aren't communicating with the Infinispan server directly any more). As also mentioned above, this could be done using batch requests.

To avoid confusion regarding issue title and description, I think we could close this issue here and create a new one about reducing Command Router requests (and possible "no more credits" errors) by means of creating batch requests for "register/unregister CommandConsumer". The solution there could be based on existing work done in the abandoned #3386 PR. @n-deliyski Did you want to follow-up on that?

In general, since changes in #3386 were rather big, I think it would be good to first evaluate the real-world benefits of doing the batch requests. Since an adapter shutdown doesn't produce "unregisterCommandConsumer" requests any more, the need for "unregisterCommandConsumer" batch requests is probably not so big any more. On the other hand, if a large number of MQTT devices get disconnected by an adapter shutdown, the subsequent reconnects to another adapter instance would produce many "registerCommandConsumer" requests, which could probably better be done as batch requests.

Nov 22 '22 08:11 calohmn

@calohmn Sounds like we should remove this issue from 2.2.0 then, right?

Nov 22 '22 09:11 sophokles73

@sophokles73 yes, I've removed it from 2.2.0. I have also created #3445 as a follow-up concerning the batch requests. Discussion can be continued there.

Nov 22 '22 09:11 calohmn