micropython-mqtt
micropython-mqtt copied to clipboard
`ssl.wrap_socket()` hangs on reconnecting
If the internet quality isn't good, sometimes during reconnection, ssl.wrap_socket() blocks. The usually causes other coroutines to stop responding. In my case, this triggers WatchDog Timer to restart the MCU.
I found setting do_handshake = False in ssl_params solves the issue, as wrap_socket() doesn't try to do the handshake anymore.
My question is:
Is this expected behavior?
If so, would you kindly add it to the document, or make do_handshake = False the default behavior?
Much appreciated for the great project!
The mqtt_as module uses the ssl library to create an encrypted socket - see the code. Once this is done, the socket is used in exactly the same way as a normal one. Consequently where the behaviour of an SSL connection differs from that of a normal one, the cause is traceable to the ssl library. For example, as you point out, the connection process of an ssl socket can block.
I don't claim a detailed knowledge of the ssl library. In particular I don't know the implications of setting do_handshake = False. If you could enlighten me, or point me to a reference, I'd be grateful.
In terms of actions the best option (once I understand it) is for me to document this. I think specifying ssl_params should remain the responsibility of the user.
Unfortunately I don't have much knowledge about the SSL module as well. I only did some quick research on Micropython documentation and CPython documentation and had a browse with mqtt_as code.
Here is briefly what I found:
In Micropython Doc, the following statement particularly made me try setting do_handshake, as I found mqtt_as uses none-blocking sockets .
For non-blocking sockets (i.e. when the sock passed into wrap_socket is in non-blocking mode) the handshake should generally be deferred because otherwise wrap_socket blocks until it completes.
That's why I tried to set do_handshake = False.
The Micropython doc didn't say much more about this flag. It referred to CPython's ssl module doc though. In CPython's doc, it says one should manually call do_handshake if do_handshake_on_connect is False. But I don't see the equivalent in MicroPython and MicroPython doc seems to imply that when doing write or read, the handshake will automatically be done.
Note: Micropython's ssl.wrap_socket(do_handshake=False) is equivalent to ssl.wrap_socket(do_handshake_on_connect=False). This seems to be a discrepancy between the two implementations)
The MP doc states
Note that in AXTLS the handshake can be deferred until the first read or write but it then blocks until completion.
This seems to imply:
- The handshake always blocks until complete.
- If not done on connect it will be done when the socket is first used.
Under your conditions of poor connectivity, and with do_handshake=False, do you have evidence of blocking after the initial connection is complete?
The reason I ask is that, in ._connect() the code performs the following steps:
- Create a nonblocking socket and connect to broker address.
- Wrap in SSL.
- Write to the broker (clean session status, last will, user logon etc).
I'd therefore expect setting do_handshake=False to have little effect, merely deferring the blocking for a very brief period until that first write is performed.
I agree that the MP doc implies setting do_handshake=False simply defers the blocking. However, in my testing, this isn't the case. What I observe is:
_connect()callsself._sock.setblocking(False)_connect()callsssl.wrap_socket()if SSL is enabled. This call doesn't block withdo_handshake=Falsewhen connection isn't good._connect()calls_as_write. Which further callssock.write().sock.write()doesn't block at all. When the connection isn't good, it simply returns0immediately. Thenawait asyncio.sleep_ms(0)yield the current coro and allows other coros to execute.- Eventually the timeout is triggered and
_as_write()fails.
I tried the above senario on ESP32S3. I simulated a bad connection by setting MQTT server to youtube.com, which obviously doesn't support MQTT. When having do_handshake=True, the setup blocks at ssl.wrap_socket and simply setting the flag makes it non-blocking all the way through.
I'm out this week. I'll come up with some example code to demonstrate the behavior and do some further testing next week. For instance, is the blocking behavior as documented if self._sock isn't set to non-blocking. I'll also try it on an RP2040 chip to see if the behavior differs from port to port.
I'll be very interested to see your results. I gather some ports use axtls and others mbedtls. The behaviour of non-blocking sockets under TLS is poorly documented. I will document your observations (with attribution).
A factor to bear in mind is that mqtt_as handles initial connection in a different way to reconnection. If the initial connection fails, the exception is thrown to the application. Reconnection is transparent. The reason for this is that failure of initial connection is typically a result of a condition needing human intervention e.g. wrong IP or credentials.
I don't think this has any bearing on your observations but the behaviour has caused confusion in the past.
This doc may be relevant. I wonder if do_handshake=False changes the handshake mode to 1?
In this branch I added some testing code to show the difference. Also a pull-request to show the diff.
Here's the logs when toggling do_handshake:
# do_handshake = True
❯ mpr -m . run mqtt_as/test_ssl_blocking.py
Local directory . is mounted at /remote
Connecting to MQTT broker, Current ticks_ms: 862177
Performing another task. tick_ms: 862180
Performing another task. tick_ms: 865190
Performing another task. tick_ms: 868200 # These prints are printed before entering `ssl.wrap_socket`. Then it blocks for 20+ seconds.
Connect() finished/aborted. Current ticks_ms: 886740 Time taken: 24563 ms
Traceback (most recent call last):
File "<stdin>", line 42, in <module>
File "asyncio/core.py", line 1, in run
File "asyncio/core.py", line 1, in run_until_complete
File "asyncio/core.py", line 1, in run_until_complete
File "<stdin>", line 36, in main
File "mqtt_as/__init__.py", line 800, in connect
File "mqtt_as/__init__.py", line 315, in _connect
File "ssl.py", line 1, in wrap_socket
File "ssl.py", line 1, in wrap_socket
OSError: [Errno 113] ECONNABORTED # This is the error raised by the SSL library
# do_handshake = False
❯ mpr -m . run mqtt_as/test_ssl_blocking.py
Local directory . is mounted at /remote
Connecting to MQTT broker, Current ticks_ms: 924911
Performing another task. tick_ms: 924913
Performing another task. tick_ms: 927920
Performing another task. tick_ms: 930930
Performing another task. tick_ms: 933930
Performing another task. tick_ms: 936930
Performing another task. tick_ms: 939930
Connect() finished/aborted. Current ticks_ms: 941156 Time taken: 16245 ms
Traceback (most recent call last):
File "<stdin>", line 42, in <module>
File "asyncio/core.py", line 1, in run
File "asyncio/core.py", line 1, in run_until_complete
File "asyncio/core.py", line 1, in run_until_complete
File "<stdin>", line 36, in main
File "mqtt_as/__init__.py", line 800, in connect
File "mqtt_as/__init__.py", line 341, in _connect
File "mqtt_as/__init__.py", line 276, in _as_write
OSError: (-1, 'Timeout on socket write') # Timeout raised by mqtt_as code
I ran the code on ESP32S3. I wanted to try on RP2040 as well. But I forgot it doesn't have WiFi capability 😂
My theory after reading your linked doc regarding handshake mode, I thought it might be interesting to try to put a fake CA cert and see what happens (to my real MQTT server on AWS IoT Core).
And the result is surprising.
Regardless if I use do_handshake = True or False, the client both connects and subscribes without issues. If my understanding is correct, it means in both cases we are using handshake mode 1?
Thanks for that interesting result. Setting do_handshake=False evidently fixes the blocking and speeds connection, but I still have some queries.
- Does it reduce the level of security by changing the handshake mode?
- Does it defer the handshake until the first transfer?
I'm unclear whether the test clarifies the second point. The MQTT client attempts a socket write immediately after connecting, but that write fails: whether a handshake was attempted is moot.
You might try the test on a public MQTT broker which supports TLS: with do_handshake=False see if any blocking occurs after the initial socket connect when the initial write is performed. There are some public brokers listed in my docs.
- Does it reduce the level of security by changing the handshake mode? I fully understand your concern. I'll look into it a bit as well.
- Does it defer the handshake until the first transfer?
In my particular case, handshake is usually fast enough when the connection quality is good.
My concern is rather when the server fails to respond. When having do_handshake = True, it would always block for 20+ seconds until it fails. My hardware would be unresponsive during the blocking. My test shows in such cases there would be no such blocking.
My application actually uses AWS IoT core. When the connection quality is good, I don't notice any blockage when using do_handshake = False in normal operation. But I'll try to test later to identify any "shorter" blocks.
After some real-world usage, I now am sure that regardless of the value of do_handshake, the SSL modules uses CA certificate to verify server's identity if config['ssl_params']['cert_reqs'] == ssl.CERT_REQUIRED. Otherwise, the server's certificate is not verified.
config['ssl_params'] = {
'key': settings.MQTT_PRIVATE_KEY_PATH ,
'cert': settings.MQTT_CERTIFICATE_PATH,
'cadata': ca,
'server_hostname': config['server'],
'cert_reqs': ssl.CERT_REQUIRED,
'do_handshake': False, # Important. Otherwise it blocks when wrappin SSL
}
I tried this by replacing my CA certificate with an invalid one.
-
config['ssl_params']['cert_reqs'] = ssl.CERT_REQUIRED, Correct CA, Correct server Connects -
config['ssl_params']['cert_reqs'] = default, Incorrect CA, Correct server Conencts -
config['ssl_params']['cert_reqs'] = ssl.CERT_REQUIRED, Incorrect CA, Correct server Refuses to connect
Thanks for the feedback. In README.md I added a pointer to this thread. You might like to post a summary of your findings for the benefit of others. As I understand it, adding do_handshake=False improves connection speed and reliability with no downsides. Is this correct?
Great! Let me summarize my findings.
With default setting config['ssl_params']['do_handshake']=True, mqtt_client.connect() would become blocking. This is usually fine as normal connection is reasonably fast. However, if the server isn't reachable, connect() would block for 10s of seconds if not minutes until it times out, making my main application unresponsive.
To make the connect() non-blocking, I found config['ssl_params']['do_handshake']=False can be added. In my testing, this makes the connect() and subsequent reconnect non-blocking.
I have not found the do_handshake setting has any security implications or any other negative side effects.
I do find that adding config['ssl_params']['cert_reqs'] = ssl.CERT_REQUIRED makes the client validate the identity of the MQTT broker, regardless of do_handshake settings. I'd recommend adding this as well.
This is my version of ssl_params in my codebase:
config['ssl_params'] = {
'key': settings.MQTT_PRIVATE_KEY_PATH ,
'cert': settings.MQTT_CERTIFICATE_PATH,
'cadata': ca,
'server_hostname': config['server'],
'cert_reqs': ssl.CERT_REQUIRED, # Makes the client validate the identity of the broker
'do_handshake': False, # Makes connect() non-blocking
}
Thank you - an excellent summary!
I'll leave this issue open in case anyone wishes to add further observations.