thin-edge.io
thin-edge.io copied to clipboard
Observation: Message on tedge/errors is recently missing when we are not connected
Describe the bug
This is an observation about functionality that has recently changed and that was detected by two system-tests. Not sure if the tests needs to be updated or else.
Usually, when we caused a parsing error e.g. with this: sudo tedge mqtt pub tedge/measurements {
then we received a complaint on topic tedge/errors
when tedge-mapper-c8y was running.
Recently, it looks like we really need to connect to C8y to get the same error message on the error topic.
To Reproduce
These two tests were testing the behavior:
pysys.py run tedge_mapper_c8y_positive
pysys.py run tedge_mapper_c8y_negative
Expected behavior
Test should pass or be fixed. This could be achieved by subclassing the test to EnvironmentC8y.
Screenshots
First approach. We connect before causing an error:
sudo tedge connect c8y
sudo tedge mqtt pub tedge/measurements {
On MQTT:
Second approach: we only start the tege-mapper-c8y service and then cause the error (no
sudo systemctl start tedge-mapper-c8y.service
sudo tedge mqtt pub tedge/measurements {
On MQTT:
Test tedge_mapper_c8y_packet_threshold_size seems also to be affected. The messages on the error topic appear only when we are connected and not when the c8y mapper is running
Similar issue also in this test : pysys.py run mapper_awaits_before_reconnect
I think the tests need to be updated.
-
tedge-mapper-c8y
tries to get its c8y object ID by using JWT token and HTTP request at the mapper start-up. - If the device is not connected c8y, the retrieval fails.
- There is no use-case, where sending some Thin Edge JSON measurements to Cumulocity mapper without connecting to c8y (in other word, without having c8y mosquitto bridge confiugration).
Updating the tests is simple. We would just loose the opportunity to test the mapper in isolation (or we invent test-doubles for the JWT things : )
For me this is a bad sign of internal complexity, even if the issue is minor.
- Sure the c8y mapper is useless if the device is never connected to Cumulocity, but it should work during a network blip, delaying only the responses that require an interaction with Cumulocity.
- Somehow the c8y mapper is too complex or its complexity is not properly managed. Interleaving Pub/Sub over MQTT and Request/Response over HTTP is not free.
Before fixing the tests, I think we need to understand why the HTTP proxy initialization is blocking the main message loop.
Oops I haven't seen the last message. Here is a draft PR that makes the tests connect https://github.com/thin-edge/thin-edge.io/pull/967 . Lets decide how we continue with the message loop in advance.
We will fix to make the two loops (JWT token loop and translation loop) work independently.
I propose that on first cloud connect, we save the internal id to a text file.
When the network is down, the http client can load the internal id from this text file.
- what happens to events that arrive when the network is down? I believe they are dropped.
Before proposing a solution, here a description of the root causes of this issue.
- MQTT and HTTP aspects are tangled due to the protocol used by c8y.
- When there is no MQTT connection to c8y, the tedge mapper cannot send data to c8y via HTTP - because a JWT token must be requested via MQTT.
- Furthermore, to post an HTTP request to c8y one needs an internal id of the device - and this internal id has to be retrieved via HTTP.
- The c8y mapper try to get the internal id on start before doing anything else. So if there is no MQTT connection to c8y when the c8y mapper starts, it is stuck in this init loop. Requesting the internal id only when required just pushed the issue a bit further. If the device is not connect to c8y via MQTT when HTTP is required for the first time - say because an event needs to be sent over HTTP, then the mapper will be stuck in a loop trying to get a JWT token.
- Storing the internal id + JWT token once retrieved might help - but if there is no MQTT connection, then there is no way to have a fresh JWT token, hence HTTP connections will start failing too.
- The main difference between MQTT messages sent by the mapper and the HTTP requests posted by the mapper is persistency. The MQTT messages are persisted by the local MQTT broker, while the HTTP requests are lost when the mapper fails to post them - hence the infinity loop that is the root cause of this issue.
As a summary.
- If the c8y bridge is down, then there is no point posting an HTTP request to c8y.
- When the bridge is down, the c8y mapper must either fail or postpone and persist any request that requires HTTP.
- When the bridge is up again, the c8y mapper must send all the pending requests that require HTTP.
This ticket has expanded from the original description. The integration tests have switched to robot framework so I feel it is no longer applicable. Closing.