mosquitto stops to persisting messages when rust bridge disconnects
Hi,
we have the behaviour that mosquitto stops to persist messages when the connection tedge_mqtt_bridge to the cloud is interrupted: (see screenshot tedge_mqtt_bridge_disconnect):
2024-08-23T06:42:30.112741403Z ERROR tedge_mqtt_bridge::health: MQTT bridge failed to connect to cloud broker: Mqtt state: Last pingreg isn't acked
The persistence-store continues to grow, as expected, until: Time 6:48:35 AM 08/23/2024.
See screenshots and
But then is stops to grow. Which is strange.
We use the rust bridge and thin edge 1.2
This error does not happen, when we switch to the mosquitto bridge. Is it a bug or do we have to configure something?
When the connection is finally restored the before persisted messages are successfully transmitted. But these are the old messages, messages received after persistence_store_grow_end discarded.
Hi,
we set max_queued_messages and max_queued_bytes to 0 ( no limitation).
With the Mosquitto Bridge, persistence worked perfectly. After about a quarter of an hour and about 300 KB of persisted data, we stopped the test.
With the Rust Bridge, persistence stopped after approx. 5 minutes and approx. 215 KB of persisted data. At the same time, we received the following output in the MQTT Broker log
Client tedge-mapper-bridge-c8y has exceeded timeout, disconnecting.
The persisted messages are not rolling, i.e. when reconnecting, only the data directly after the disconnect is sent and not the data shortly before the reconnect.
There are two independent issues here:
- If the too many messages are queued by mosquitto for the builtin bridge during a network interruption,
then messages are silently dropped by mosquitto and definitely lost.
- An entry is added in the mosquitto log telling that
Outgoing messages are being dropped for client tedge-mapper-bridge-c8y. - The mosquitto clients are not even warned and can do nothing.
- => The recommendation is to configure mosquitto with
max_queued_messagesandmax_queued_bytesset to0(no limit). - I did two experiments (1-hour then 10-hours, both with 2 messages per second) and I can confirm that
max_queued_messages 0fixes the main issue
- An entry is added in the mosquitto log telling that
- The second comment is about something different.
- The error
Client tedge-mapper-bridge-c8y has exceeded timeout, disconnecting.is due to the builtin bridge failing to timely send a keep alive message. - => We have to address this. If the builtin bridge fails for some to respond on-time and is disconnected by mosquitto, it should be able to properly reconnect to the local broker even if disconnected a that time from the cloud.
- The error
Using this script to publish easy-to-check data, I checked the following:
- If the c8y mapper (which runs the builtin bridge) is stopped for a while, then no messages is lost.
- If the cloud network connection is lost and recovered before
max_queued_messagesandmax_queued_bytesare reached, then no messages is lost. - If the cloud connection is lost so long that
max_queued_messagesormax_queued_bytesis reached, then one can observe in mosquitto log thatOutgoing messages are being dropped for client tedge-mapper-bridge-c8yand, when the bridge is re-established, all the messages sent between the two events are lost. - Setting
max_queued_messagesandmax_queued_bytesto0(i.e. no limit), then the messages are properly queued when the connection is lost and properly published to the cloud when the connection is back. I did a first experiment with an hour disconnection then a second 10 hours long. In both case no error messages have been observed (neither on mosquitto nor on the builtin bridge). In the 1-hour case, all the messages have actually been received on the cloud. For the 10-hours case, things are a bit more difficult to assess as C8Y is aggregating older messages. The aggregated values indicates that the messages have been received for the whole period, ~~but, half of the messages are still persisted in the mosquitto store, which is strange~~ (these messages were correctly stored for another mapper).
* The [second comment ](https://github.com/thin-edge/thin-edge.io/issues/3083#issuecomment-2309913209) is about something different. * The error [`Client tedge-mapper-bridge-c8y has exceeded timeout, disconnecting.`](https://github.com/eclipse/mosquitto/issues/2124#issuecomment-794452620) is due to the builtin bridge failing to timely send a keep alive message. * => We have to address this. If the builtin bridge fails for some to respond on-time and is disconnected by mosquitto, it should be able to properly reconnect to the local broker _even_ if disconnected a that time from the cloud.
I guess this means we need the bridge to process forwarding messages in a separate task, so we can poll the event loop in the meantime? If I understand correctly, mosquitto will only send up to 100 messages (or however many we configure in practice) before it receives an ack, in which case this should be reasonably possible to implement.
Alternative solution, can we just increase the channel size to be greater than the maximum number of in-flight messages, so we're never blocked publishing an event? That would be a really simple solution if it works, and may mean we can avoid tokio::spawn when calling subscribe too.
This issue was a combination of:
- A mosquitto misconfiguration: in order to avoid message lost during network disconnection, one has to configure mosquitto with
max_queued_messagesandmax_queued_bytesto0. i.e. to set no maximum on the number of messages queued by mosquitto and waiting to be delivered. - A thin-edge bug: under heavy load, notably when the network is back after a long disconnection, the builtin bridge was failing to make any progress and being finally disconnected by mosquitto (as failing to timely send heart beats). This issue has been fixed by https://github.com/thin-edge/thin-edge.io/pull/3122
We look into using the build-in bridge since we ran into the following issues with the mosquitto bridge:
- https://github.com/eclipse/mosquitto/issues/1334
- https://github.com/eclipse/mosquitto/issues/2795
Tested with Raspberry Pi and can confirm that the problem is not reproducable