thin-edge.io
thin-edge.io copied to clipboard
Connecting to MQTT broker spins when invalid authentication config
Describe the bug
When using MQTT server authentication and invalid CA certificate was provided in tedge-config, opening connection spins on the error to verify server certificate:
2024-02-19T10:39:23.582843165Z ERROR mqtt_channel::connection: MQTT: failed to connect to broker at 'localhost:8883': TLS: I/O: invalid peer certificate: BadSignature
generating a lot of logs and wasting a lot of CPU.
To Reproduce
-
Create file
/etc/tedge/mosquitto-conf/~secure-listener.conf
[^1] with following contents:listener 8883 allow_anonymous true require_certificate false cafile /etc/mosquitto/ca_certificates/ca.crt certfile /etc/mosquitto/ca_certificates/server.crt keyfile /etc/mosquitto/ca_certificates/server.key
-
Run
gen-certs.sh
, which will generate and place a server certificate into/etc/mosquitto/ca_certificates
tests/RobotFramework/tests/mqtt/gen_certs.sh
-
Generate a different CA certificate which will be used by tedge to try to verify broker certificate and use it as
mqtt.client.auth.ca_file
openssl req \ -new \ -x509 \ -days 7 \ -extensions v3_ca \ -nodes \ -subj "/C=US/ST=Denial/L=Springfield/O=Dis/CN=ca" \ -keyout ca.key \ -out ca.crt tedge config set mqtt.client.auth.ca_file $(realpath ca.crt)
-
Ensure all thin-edge is disconnected, or tedge services are off, and restart mosquitto.
-
Try to start a tedge service, e.g.
tedge-agent
-
A lot of messages like
2024-02-19T10:39:23.54226273Z ERROR mqtt_channel::connection: MQTT: failed to connect to broker at 'localhost:8883': TLS: I/O: invalid peer certificate: BadSignature
should be generated very quickly.
Expected behavior
On failures to connect, a reasonable delay should be used, e.g. 1s after every attempt.
Additional context
I could just do a quick PR adding a delay when this particular error case, but I see that we have a match where we decide which errors should be given a timeout and which should not:
https://github.com/thin-edge/thin-edge.io/blob/64def4e4c5ed8e600d3ab090725ad3af57f1579f/crates/common/mqtt_channel/src/connection.rs#L308-L315
So I have a question: what's the reason that we apply a delay only to some of the connect errors? Should any other cases be added other than BadSignature
case which this ticket is about? Why not apply the delay to all errors when trying to connect?
[^1]: ~
is prepended to the filename so this file is loaded after the one with per_listener_settings true
directive, without it mosquitto fails to start with an error. I guess we could perhaps use the xx-filename.conf
convention for mosquitto configs to more precisely control the load order of configuration files
So I have a question: what's the reason that we apply a delay only to some of the connect errors? Should any other cases be added other than
BadSignature
case which this ticket is about? Why not apply the delay to all errors when trying to connect?
There is no good reason for that beyond that a delay has been introduced for likely transient errors (broker not started or restarting / network hiccup).
In case of a TlsError
or worse a StateError
, which is unlikely a transient error, we should consider to give up with a fatal error.
In case of a
TlsError
or worse aStateError
, which is unlikely a transient error, we should consider to give up with a fatal error.
I agree, the most important thing is to either exit with a fatal error, or apply a reasonable timeout and retry, and never do the third thing, i.e. retry immediately, because it results in spinning. So for any unknown errors, we need to decide whether we retry after delay, or exit. IMO the best would be to exit on all errors, unless it's some transient error, then we can apply a delay and retry.
QA has thoroughly checked the bug and here are the results:
- [ ] Test for ticket exists in the test suite.
- [x] QA has tested the bug and it's not reproducable anymore