thin-edge.io mosquitto stops to persisting messages when rust bridge disconnects

Hi,

we have the behaviour that mosquitto stops to persist messages when the connection tedge_mqtt_bridge to the cloud is interrupted: (see screenshot tedge_mqtt_bridge_disconnect):

2024-08-23T06:42:30.112741403Z ERROR tedge_mqtt_bridge::health: MQTT bridge failed to connect to cloud broker: Mqtt state: Last pingreg isn't acked

The persistence-store continues to grow, as expected, until: Time 6:48:35 AM 08/23/2024.

See screenshots persistence_store_grow_start_github and persistence_store_grow_end_github

But then is stops to grow. Which is strange.

We use the rust bridge and thin edge 1.2

This error does not happen, when we switch to the mosquitto bridge. Is it a bug or do we have to configure something?

When the connection is finally restored the before persisted messages are successfully transmitted. But these are the old messages, messages received after persistence_store_grow_end discarded.

Aug 26 '24 08:08 ck-c8y

Hi,

we set max_queued_messages and max_queued_bytes to 0 ( no limitation).

With the Mosquitto Bridge, persistence worked perfectly. After about a quarter of an hour and about 300 KB of persisted data, we stopped the test.

Bild

With the Rust Bridge, persistence stopped after approx. 5 minutes and approx. 215 KB of persisted data. At the same time, we received the following output in the MQTT Broker log

Client tedge-mapper-bridge-c8y has exceeded timeout, disconnecting.

The persisted messages are not rolling, i.e. when reconnecting, only the data directly after the disconnect is sent and not the data shortly before the reconnect.

Aug 26 '24 10:08 ck-c8y

There are two independent issues here:

If the too many messages are queued by mosquitto for the builtin bridge during a network interruption, then messages are silently dropped by mosquitto and definitely lost.
- An entry is added in the mosquitto log telling that Outgoing messages are being dropped for client tedge-mapper-bridge-c8y.
- The mosquitto clients are not even warned and can do nothing.
- => The recommendation is to configure mosquitto with max_queued_messages and max_queued_bytes set to 0 (no limit).
- I did two experiments (1-hour then 10-hours, both with 2 messages per second) and I can confirm that max_queued_messages 0 fixes the main issue
The second comment is about something different.
- The error Client tedge-mapper-bridge-c8y has exceeded timeout, disconnecting. is due to the builtin bridge failing to timely send a keep alive message.
- => We have to address this. If the builtin bridge fails for some to respond on-time and is disconnected by mosquitto, it should be able to properly reconnect to the local broker even if disconnected a that time from the cloud.

Using this script to publish easy-to-check data, I checked the following:

If the c8y mapper (which runs the builtin bridge) is stopped for a while, then no messages is lost.
If the cloud network connection is lost and recovered before max_queued_messages and max_queued_bytes are reached, then no messages is lost.
If the cloud connection is lost so long that max_queued_messages or max_queued_bytes is reached, then one can observe in mosquitto log that Outgoing messages are being dropped for client tedge-mapper-bridge-c8y and, when the bridge is re-established, all the messages sent between the two events are lost.
Setting max_queued_messages and max_queued_bytes to 0 (i.e. no limit), then the messages are properly queued when the connection is lost and properly published to the cloud when the connection is back. I did a first experiment with an hour disconnection then a second 10 hours long. In both case no error messages have been observed (neither on mosquitto nor on the builtin bridge). In the 1-hour case, all the messages have actually been received on the cloud. For the 10-hours case, things are a bit more difficult to assess as C8Y is aggregating older messages. The aggregated values indicates that the messages have been received for the whole period, ~~but, half of the messages are still persisted in the mosquitto store, which is strange~~ (these messages were correctly stored for another mapper).

Aug 30 '24 09:08 didier-wenzek

* The [second comment ](https://github.com/thin-edge/thin-edge.io/issues/3083#issuecomment-2309913209) is about something different.
  
  * The error [`Client tedge-mapper-bridge-c8y has exceeded timeout, disconnecting.`](https://github.com/eclipse/mosquitto/issues/2124#issuecomment-794452620)  is due to the builtin bridge failing to timely send a keep alive message.
  * => We have to address this. If the builtin bridge fails for some to respond on-time and is disconnected by mosquitto,
    it should be able to properly reconnect to the local broker _even_ if disconnected a that time from the cloud.

I guess this means we need the bridge to process forwarding messages in a separate task, so we can poll the event loop in the meantime? If I understand correctly, mosquitto will only send up to 100 messages (or however many we configure in practice) before it receives an ack, in which case this should be reasonably possible to implement.

Alternative solution, can we just increase the channel size to be greater than the maximum number of in-flight messages, so we're never blocked publishing an event? That would be a really simple solution if it works, and may mean we can avoid tokio::spawn when calling subscribe too.

Aug 30 '24 09:08 jarhodes314

This issue was a combination of:

A mosquitto misconfiguration: in order to avoid message lost during network disconnection, one has to configure mosquitto with max_queued_messages and max_queued_bytes to 0. i.e. to set no maximum on the number of messages queued by mosquitto and waiting to be delivered.
A thin-edge bug: under heavy load, notably when the network is back after a long disconnection, the builtin bridge was failing to make any progress and being finally disconnected by mosquitto (as failing to timely send heart beats). This issue has been fixed by https://github.com/thin-edge/thin-edge.io/pull/3122

Sep 13 '24 17:09 didier-wenzek

We look into using the build-in bridge since we ran into the following issues with the mosquitto bridge:

https://github.com/eclipse/mosquitto/issues/1334
https://github.com/eclipse/mosquitto/issues/2795

Sep 18 '24 08:09 ck-c8y

Tested with Raspberry Pi and can confirm that the problem is not reproducable

Sep 20 '24 06:09 gligorisaev