thin-edge.io
thin-edge.io copied to clipboard
Improve synchronisation between tedge daemons
Is your feature request related to a problem? Please describe.
There are many interactions between different tedge daemons where one expects the other daemon to be up and running to respond to any requests that it sends. Here are some examples:
- Once the c8y mapper starts, request software list from tedge-agent
- Once the c8y-bridge is created, make c8y-config-plugin send the list of all supported config files
- Once the c8y-bridge is created, make c8y-log-plugin send the list of all supported log files
- Once the c8y-mapper starts, make tedge-agent send the status of the last "executing" operation, if any
Because of these dependencies, some of these messages could be lost if the requestee service is not up and running when the requester service sends a request.
Currently, we rely on MQTT broker's persistence-session feature to have these requests persisted, even if the requestee service is not up and running to receive those requests. But the broker keeps them persisted to deliver it to that service when it starts later. But, this persistence session feature is not very stable on the mosquitto broker that we use and hence we need a solution that doesn't fully rely on this feature.
Describe the solution you'd like
Make daemons request data from other daemons only once their liveness is validated. For example, c8y-mapper should request the the software list from tedge-agent only once it can confirm that tedge-agent is up an running. The tedge/health
endpoints of these daemons could be used to check this liveness.
Describe alternatives you've considered
Defining systemd service dependencies could be an alternative, but there are many cases where some service pairs have dependencies on each other, leading to cyclic dependencies. Even otherwise, it would have been a systemd specific solution.
I want to highlight that the issue is beyond the stability issues we observed on mosquitto 2.0.
- The agent can successfully receive a message published by the mapper even if it was down when the message was published. But the agent must have been launched at least once creating a subscription persisted in a named session. Messages sent just after the very first installation might be lost. This is why we introduced the
tedge_agent --init
option to create this persisted session on install ... with a major drawback : messages are consumed and discarded if the option is used after install. - The issue with the bridge to Cumulocity is deeper. The bridge is created on
tedge connect c8y
and remove ontedge disconnect c8y
. After the first install and after a disconnect, there is no more bridge, i.e. the topicsc8y/#
are topics without any subscribers. Any message published on ac8y/#
topic before atedge connect c8y
will be lost. Here the--init
trick to create a session cannot work - because the subscriber is mosquitto itself.
I see the fix proposed here as the right approach.
- The thin-edge daemons must send up-status on start, down-status as last will.
- Init messages sent to peers must be published only as reaction to up-status messages received from these peers.
- It's okay to re-send init messages after each restart of the bridge or of one of the thin-edge daemon. However, we must avoid to resend these after each health-check response.
- The
--init
option of the mapper and the agent must no more create a session, because of the risk to discard messages using--init
inadvertently.
Addressed this through below PRs
https://github.com/thin-edge/thin-edge.io/pull/2065
https://github.com/thin-edge/thin-edge.io/pull/2050
Created a follow-up ticket: https://github.com/thin-edge/thin-edge.io/issues/2070