thin-edge.io icon indicating copy to clipboard operation
thin-edge.io copied to clipboard

Connecting to MQTT broker spins when invalid authentication config

Open Bravo555 opened this issue 1 year ago • 2 comments

Describe the bug

When using MQTT server authentication and invalid CA certificate was provided in tedge-config, opening connection spins on the error to verify server certificate:

2024-02-19T10:39:23.582843165Z ERROR mqtt_channel::connection: MQTT: failed to connect to broker at 'localhost:8883': TLS: I/O: invalid peer certificate: BadSignature

generating a lot of logs and wasting a lot of CPU.

To Reproduce

  1. Create file /etc/tedge/mosquitto-conf/~secure-listener.conf [^1] with following contents:

    listener 8883
    allow_anonymous true
    require_certificate false
    cafile   /etc/mosquitto/ca_certificates/ca.crt
    certfile /etc/mosquitto/ca_certificates/server.crt
    keyfile  /etc/mosquitto/ca_certificates/server.key
    
  2. Run gen-certs.sh, which will generate and place a server certificate into /etc/mosquitto/ca_certificates tests/RobotFramework/tests/mqtt/gen_certs.sh

  3. Generate a different CA certificate which will be used by tedge to try to verify broker certificate and use it as mqtt.client.auth.ca_file

    openssl req \
        -new \
        -x509 \
        -days 7 \
        -extensions v3_ca \
        -nodes \
        -subj "/C=US/ST=Denial/L=Springfield/O=Dis/CN=ca" \
        -keyout ca.key \
        -out ca.crt
    
    tedge config set mqtt.client.auth.ca_file $(realpath ca.crt)
    
  4. Ensure all thin-edge is disconnected, or tedge services are off, and restart mosquitto.

  5. Try to start a tedge service, e.g. tedge-agent

  6. A lot of messages like 2024-02-19T10:39:23.54226273Z ERROR mqtt_channel::connection: MQTT: failed to connect to broker at 'localhost:8883': TLS: I/O: invalid peer certificate: BadSignature should be generated very quickly.

Expected behavior

On failures to connect, a reasonable delay should be used, e.g. 1s after every attempt.

Additional context

I could just do a quick PR adding a delay when this particular error case, but I see that we have a match where we decide which errors should be given a timeout and which should not:

https://github.com/thin-edge/thin-edge.io/blob/64def4e4c5ed8e600d3ab090725ad3af57f1579f/crates/common/mqtt_channel/src/connection.rs#L308-L315

So I have a question: what's the reason that we apply a delay only to some of the connect errors? Should any other cases be added other than BadSignature case which this ticket is about? Why not apply the delay to all errors when trying to connect?

[^1]: ~ is prepended to the filename so this file is loaded after the one with per_listener_settings true directive, without it mosquitto fails to start with an error. I guess we could perhaps use the xx-filename.conf convention for mosquitto configs to more precisely control the load order of configuration files

Bravo555 avatar Feb 19 '24 11:02 Bravo555

So I have a question: what's the reason that we apply a delay only to some of the connect errors? Should any other cases be added other than BadSignature case which this ticket is about? Why not apply the delay to all errors when trying to connect?

There is no good reason for that beyond that a delay has been introduced for likely transient errors (broker not started or restarting / network hiccup).

In case of a TlsError or worse a StateError, which is unlikely a transient error, we should consider to give up with a fatal error.

didier-wenzek avatar Feb 19 '24 17:02 didier-wenzek

In case of a TlsError or worse a StateError, which is unlikely a transient error, we should consider to give up with a fatal error.

I agree, the most important thing is to either exit with a fatal error, or apply a reasonable timeout and retry, and never do the third thing, i.e. retry immediately, because it results in spinning. So for any unknown errors, we need to decide whether we retry after delay, or exit. IMO the best would be to exit on all errors, unless it's some transient error, then we can apply a delay and retry.

Bravo555 avatar Feb 22 '24 11:02 Bravo555

QA has thoroughly checked the bug and here are the results:

  • [ ] Test for ticket exists in the test suite.
  • [x] QA has tested the bug and it's not reproducable anymore

gligorisaev avatar Apr 24 '24 06:04 gligorisaev