thin-edge.io icon indicating copy to clipboard operation
thin-edge.io copied to clipboard

Observation: Message on tedge/errors is recently missing when we are not connected

Open abelikt opened this issue 2 years ago • 8 comments

Describe the bug

This is an observation about functionality that has recently changed and that was detected by two system-tests. Not sure if the tests needs to be updated or else.

Usually, when we caused a parsing error e.g. with this: sudo tedge mqtt pub tedge/measurements { then we received a complaint on topic tedge/errors when tedge-mapper-c8y was running.

Recently, it looks like we really need to connect to C8y to get the same error message on the error topic.

To Reproduce

These two tests were testing the behavior:

pysys.py run tedge_mapper_c8y_positive
pysys.py run tedge_mapper_c8y_negative

Expected behavior

Test should pass or be fixed. This could be achieved by subclassing the test to EnvironmentC8y.

Screenshots

First approach. We connect before causing an error:

sudo tedge connect c8y sudo tedge mqtt pub tedge/measurements { image

On MQTT: image

Second approach: we only start the tege-mapper-c8y service and then cause the error (no

sudo systemctl start tedge-mapper-c8y.service sudo tedge mqtt pub tedge/measurements { image

On MQTT: image

abelikt avatar Mar 01 '22 13:03 abelikt

Test tedge_mapper_c8y_packet_threshold_size seems also to be affected. The messages on the error topic appear only when we are connected and not when the c8y mapper is running

image

Similar issue also in this test : pysys.py run mapper_awaits_before_reconnect

abelikt avatar Mar 01 '22 14:03 abelikt

I think the tests need to be updated.

  • tedge-mapper-c8y tries to get its c8y object ID by using JWT token and HTTP request at the mapper start-up.
  • If the device is not connected c8y, the retrieval fails.
  • There is no use-case, where sending some Thin Edge JSON measurements to Cumulocity mapper without connecting to c8y (in other word, without having c8y mosquitto bridge confiugration).

rina23q avatar Mar 01 '22 16:03 rina23q

Updating the tests is simple. We would just loose the opportunity to test the mapper in isolation (or we invent test-doubles for the JWT things : )

abelikt avatar Mar 02 '22 08:03 abelikt

For me this is a bad sign of internal complexity, even if the issue is minor.

  • Sure the c8y mapper is useless if the device is never connected to Cumulocity, but it should work during a network blip, delaying only the responses that require an interaction with Cumulocity.
  • Somehow the c8y mapper is too complex or its complexity is not properly managed. Interleaving Pub/Sub over MQTT and Request/Response over HTTP is not free.

Before fixing the tests, I think we need to understand why the HTTP proxy initialization is blocking the main message loop.

didier-wenzek avatar Mar 02 '22 09:03 didier-wenzek

Oops I haven't seen the last message. Here is a draft PR that makes the tests connect https://github.com/thin-edge/thin-edge.io/pull/967 . Lets decide how we continue with the message loop in advance.

abelikt avatar Mar 02 '22 14:03 abelikt

We will fix to make the two loops (JWT token loop and translation loop) work independently.

rina23q avatar Mar 08 '22 11:03 rina23q

I propose that on first cloud connect, we save the internal id to a text file.

When the network is down, the http client can load the internal id from this text file.

  • what happens to events that arrive when the network is down? I believe they are dropped.

cmosd avatar Jul 19 '22 15:07 cmosd

Before proposing a solution, here a description of the root causes of this issue.

  • MQTT and HTTP aspects are tangled due to the protocol used by c8y.
  • When there is no MQTT connection to c8y, the tedge mapper cannot send data to c8y via HTTP - because a JWT token must be requested via MQTT.
  • Furthermore, to post an HTTP request to c8y one needs an internal id of the device - and this internal id has to be retrieved via HTTP.
  • The c8y mapper try to get the internal id on start before doing anything else. So if there is no MQTT connection to c8y when the c8y mapper starts, it is stuck in this init loop. Requesting the internal id only when required just pushed the issue a bit further. If the device is not connect to c8y via MQTT when HTTP is required for the first time - say because an event needs to be sent over HTTP, then the mapper will be stuck in a loop trying to get a JWT token.
  • Storing the internal id + JWT token once retrieved might help - but if there is no MQTT connection, then there is no way to have a fresh JWT token, hence HTTP connections will start failing too.
  • The main difference between MQTT messages sent by the mapper and the HTTP requests posted by the mapper is persistency. The MQTT messages are persisted by the local MQTT broker, while the HTTP requests are lost when the mapper fails to post them - hence the infinity loop that is the root cause of this issue.

As a summary.

  • If the c8y bridge is down, then there is no point posting an HTTP request to c8y.
  • When the bridge is down, the c8y mapper must either fail or postpone and persist any request that requires HTTP.
  • When the bridge is up again, the c8y mapper must send all the pending requests that require HTTP.

didier-wenzek avatar Jul 19 '22 16:07 didier-wenzek

This ticket has expanded from the original description. The integration tests have switched to robot framework so I feel it is no longer applicable. Closing.

reubenmiller avatar Dec 01 '22 14:12 reubenmiller