Client xxxx disconnected: Protocol error: Caused by Packet fragmentation; TLS connection
I'm investigating a possible bug with mosquitto 2.0.15, where the server disconnects TLS clients after some time.
Our server has +4000 TLS connections open with various clients (libmosquitto). Server is build from source (2.0.15 tag), running openssl 1.1.1f . This setup was running fine on a Amazon AWS EC2 instance. Until one week ago: Suddenly the server started to close multuple mosquitto clients (multiple per minute), with the following in the log file: 1663166020: Client UB1000000217 disconnected: Protocol error. 1663166041: Client UB1000000437 disconnected: Protocol error. 1663166055: Client UB1000000437 disconnected: Protocol error. 1663166080: Client UB1000000217 disconnected: Protocol error. 1663166115: Client UB1000000437 disconnected: Protocol error. 1663166141: Client UB1000000217 disconnected: Protocol error. 1663166175: Client UB1000000437 disconnected: Protocol error. 1663166183: Client UB1000000437 disconnected: Protocol error. 1663166201: Client UB1000000217 disconnected: Protocol error. 1663166245: Client UB1000000437 disconnected: Protocol error. 1663166260: Client UB100000007D disconnected: Protocol error. 1663166262: Client UB1000000217 disconnected: Protocol error. 1663166269: Client UB10000000F2 disconnected: Protocol error. 1663166306: Client UB1000000437 disconnected: Protocol error. 1663166313: Client UB1000000437 disconnected: Protocol error. 1663166322: Client UB1000000217 disconnected: Protocol error. 1663166376: Client UB1000000437 disconnected: Protocol error. 1663166382: Client UB1000000217 disconnected: Protocol error. 1663166515: Client UB10000002EA disconnected: Protocol error. 1663166531: Client UB10000002EA disconnected: Protocol error.
I think is is related due to MTU packet fragmentation (somehow). It seems like mosquitto tries to read data (SSL_read), and that the read function returns "0"; this is somehow misinterpreted in net__handle_ssl as a client connection close. However, I don't think this is correct. Again, i'm investigating. But was wondering of somebody has seen this before ?
I did some changes in the code mux_epoll.c and packet_mosq.c Issue seems to be solved. I still need to cleanup the fix, and will submit it then.
But large amounts of disconnects (TLS connections, with messages > 100kbyte), seem to be fixed
Hey @Kbij,
Any update on this issue I'm am having the same issue. Using a bridge connection local connection still working.
Thanks
Hi,
I created a patch, connections are no Longer being dropped. However, i think i have issues now with 100% cpu usage. So it needs further fixing (and cleanup). If you want, i can send you the patch files.
@Kbij : I would appreciate it, if you could send me the changes that you made.
This issue is happening for me again only on one bridge connection (so far). Do you have any more info on what is causing it. I am using self-signed cert - could that be an issue or is internal in the library?
Hi,
I have shared the patch here: https://drive.google.com/file/d/1JXs5m-CltQODG6kDm5horXbqeEmVe-8e/view?usp=share_link
The patch is not complete: it causes 100% cpu usage when a connection is closed. I think i will spend some time this week to fix that 100% cpu usage also.
About the cause of the error: see above. It is caused by a ssl read function retuning 0 bytes (normal meaning: connection closed). But this 0 bytes does not mean that the connection is closed: when using ssl, it means: not all data is complete for decryption, currently 0 decrypted bytes are available.
I tried to overrule this condition, but now I can't detect a normal closure of the socket; i probably need a different approach.
Thanks for all the info.
I will look into the patch as well.
Hi,
I ran into a similar issue where mosquitto client disconnected with rc = 14 (ERRNO). Unfortunately, we don't have the errno printed. But strangely after this disconnect, mosquitto never reconnects (mosquitto_for_loop) When we restarted the process, the initial async_connect succeeded. This pattern happens frequently. The reason for rc=14 is still not known and I believe it happens only in 2.0.15. My suspect is the errno may be EPROTO (because that's the only case where loop returns)
The problem might be similar to this ticket. May I know if this patch with a proper fix (100% CPU issue) has been submitted to the git main branch?
Thanks
Hi,
Unfortunately, i was unable to fix the bug. For multiple reasons ;-)
- Issue only occurs on our live system; need to restart our Live Mqtt server.
- I don't really understand the issue correctly. It seems complicated (ssl/socket issue)
And as sudded as it appeared, as sudden it disappeared. The 100% cpu issue is now gone, but the version with the patch is still active. So i don't know if the original (ssl) issue is gone also. Again: i need to restart our live Mqtt server, and thats something we don't really want ;-)
But I understand where the 100% cpu comes from: The patch that i shared (^here above ^); it creates the following loop: when a socket is closed, then a epoll event is generated -> but i don't close the socket (+- ssl fix) -> a new epoll event is generated-> etc.. I don't know how to differentiate between different "0 bytes read":
- The ssl error: the issue that I noticed: the ssl library says 0 bytes read: while the socket is still open. There are SSL library calls to make the difference: but made no real difference.
- Socket says: 0 bytes read because the socket is really closed
@Kbij Thanks again for the info.
I was only seeing this on one of our local servers (running on windows where the client installed some anti-virus software). The issue only only occurs when bridge connection (used for remote debugging) is enabled. I made the decision to disabled the bridge for that client (not the best solution I know, but I don't have the time atm to really dig down into this). Thanks again for your help, I really appreciate the time you took to look into this, hopefully the next version of mosquitto will have a patch.
@gopicisco , about mosquitto never reconnecting, I had to map the return codes of mosquitto_loop_forever and make my application do the reconnect to the MQTT broker when they happen (I was expecting it was always automatically reconnect, without my application intervention, but I'm unsure the behavior is expected or not).
Had to map the reconnect to the following return codes, because the reconnect will not happen automatically:
- MOSQ_ERR_NOMEM
- MOSQ_ERR_PROTOCOL:
- MOSQ_ERR_INVAL:
- MOSQ_ERR_NOT_FOUND:
- MOSQ_ERR_TLS:
- MOSQ_ERR_PAYLOAD_SIZE:
- MOSQ_ERR_NOT_SUPPORTED:
- MOSQ_ERR_AUTH:
- MOSQ_ERR_ACL_DENIED:
- MOSQ_ERR_UNKNOWN:
- MOSQ_ERR_EAI:
- MOSQ_ERR_PROXY:
- MOSQ_ERR_ERRNO (could not confirm if should do it for all cases or just specific errno cases);