mosquitto icon indicating copy to clipboard operation
mosquitto copied to clipboard

Regression: 2.0.21 disconnects clients with `exceeded timeout, disconnecting.`

Open CRCinAU opened this issue 9 months ago • 12 comments

mosquitto.conf:

listener 1883
allow_anonymous true
connection_messages true
max_keepalive 60

Clients get disconnected rapidly with:

1741309598: Client auto-BF835801-605E-6610-8414-02AC56F590B8 has exceeded timeout, disconnecting.
1741309599: Client auto-A0C182C5-5451-C80E-654B-E3EEF4E44634 has exceeded timeout, disconnecting.
1741309599: New connection from 10.0.0.172:43825 on port 1883.
1741309599: New client connected from 10.0.0.172:43825 as auto-276EC781-505D-1292-9504-C020B814E5A4 (p2, c1, k60).
1741309600: Client auto-EBBE082C-0525-CC21-619B-A8A5CBD9ECCE has exceeded timeout, disconnecting.
1741309601: Client auto-734C449E-F32C-E244-3509-A412B8768D71 has exceeded timeout, disconnecting.
1741309601: New connection from 10.0.0.93:40210 on port 1883.
…1741309605: New client connected from 10.0.0.93:38393 as auto-01CE8C8E-F219-E6AC-6C62-C32BF0E8010C (p2, c1, k60).
1741309605: New connection from 10.0.0.196:56509 on port 1883.

Changing either max_keepalive from 60 -> 600, or reverting to 2.0.20 seems to fix the issue in the meantime.

CRCinAU avatar Mar 07 '25 01:03 CRCinAU

I've seen something similar, it happens randomly, and in this case I think we weren't even able to send the CONNECT packet:

2025-03-07T06:30:33+0000: New connection from 10.1.116.103:55396 on port 1883.
2025-03-07T06:30:33+0000: Client <unknown> has exceeded timeout, disconnecting.

Edit: this is the new 2.0.21 version.

thejan2009 avatar Mar 07 '25 06:03 thejan2009

@CRCinAU Can you show us a matching New client connected and exceeded timeout message pair (same clientid) from your log?

ckrey avatar Mar 07 '25 09:03 ckrey

It looks like a lot of the logs were ingest rate limited - however the one UUID pasted above that I could find that made it through the ingest filters is:

1741309605: New client connected from 10.0.0.93:38393 as auto-01CE8C8E-F219-E6AC-6C62-C32BF0E8010C (p2, c1, k60)." container_name="mqtt-server" timestamp="2025-03-07 01:06:45.032203"
1741309651: Client auto-01CE8C8E-F219-E6AC-6C62-C32BF0E8010C has exceeded timeout, disconnecting." container_name="mqtt-server" timestamp="2025-03-07 01:07:31.024867"

The embedded boards don't have persistent logging, so I'm unable to retrieve further information from the board itself.

I did find a number of Client <unknown> has exceeded timeout, disconnecting. - which I can't track back to a start log message.

I hunted through some of the logs that did make it through the ingest rate limit and found a couple of examples - however I'm having to search for the timeout message, then backtrack for the connect message.

"1741308715: New client connected from 10.0.0.172:46954 as auto-0EE9CFDF-F136-F5E3-FE92-F4001B5D688E (p2, c1, k60)." container_name="mqtt-server" timestamp="2025-03-07 00:51:55.449111"
"1741308718: Client auto-0EE9CFDF-F136-F5E3-FE92-F4001B5D688E has exceeded timeout, disconnecting." container_name="mqtt-server" timestamp="2025-03-07 00:51:58.072345"

"1741309180: New client connected from 10.0.0.196:36757 as auto-550FD48A-B111-1962-3204-D7DB9D06EBA0 (p2, c1, k60)." container_name="mqtt-server" timestamp="2025-03-07 00:59:40.482495"
"1741309183: Client auto-550FD48A-B111-1962-3204-D7DB9D06EBA0 has exceeded timeout, disconnecting." container_name="mqtt-server" timestamp="2025-03-07 00:59:43.005568"

"1741309117: New client connected from 10.0.0.81:36110 as auto-9C16EE06-263D-6E65-1019-8C029992691C (p2, c1, k60)." container_name="mqtt-server" timestamp="2025-03-07 00:58:37.253039"
"1741309121: Client auto-9C16EE06-263D-6E65-1019-8C029992691C has exceeded timeout, disconnecting." container_name="mqtt-server" timestamp="2025-03-07 00:58:41.003367"

Our application logs were showing the easiest to track symptoms along the lines of "mqtt server went away, reconnecting" that was just on an endless loop. I'm not sure if this was related to a subscription, or a publish.

Setting the timeout from 60 -> 600 was the first thing attempted, and that stopped the initial rush of reconnections and some normal behaviour seemed to occur, but the decision was made to roll back to 2.0.20 - where the above behaviour immediately stopped - even with the timeout set to 60.

CRCinAU avatar Mar 08 '25 01:03 CRCinAU

@CRCinAU on which platform/machines do you run mosquitto?

ckrey avatar Mar 08 '25 08:03 ckrey

This was using the official docker images on x86_64.

CRCinAU avatar Mar 08 '25 08:03 CRCinAU

@CRCinAU Great, thank you. I can see what the problem is and am working on a fix. @thejan2009 Can I just confirm with you whether you were setting max_keepalive?

ralight avatar Mar 08 '25 09:03 ralight

The problem only occurs for the case where the client keepalive == max_keepalive

ralight avatar Mar 08 '25 11:03 ralight

@ralight yes, mosquitto uses max_keepalive 60 while the client has keepalive set to 20 using paho.golang package. The client says it was unable to connect with the reason being EOF. I can probably make a minimal reproduction tomorrow if needed.

thejan2009 avatar Mar 09 '25 07:03 thejan2009

Quick question - does this issue only affect MQTT 5 clients (max_keepalive) ?

I reported the same symptoms on 2.0.18 with MQTT 3.1 Paho Java client.

A solution was reportedly added to 2.0.21 ...

https://github.com/eclipse-mosquitto/mosquitto/commit/4304ac0af06153fed3766b595e7fb8fd18122bf3

RobLeighton67 avatar Mar 14 '25 11:03 RobLeighton67

Does this issue only affect the server/broker-side? Or could the clients using libmosquitto also be affected?

dietssa avatar Mar 14 '25 13:03 dietssa

I had to roll back my docker container to 2.0 to solve this issue. I've tried every setting on the moon, including decreasing keepalive message on clients to 15 and increasing max_keepalive to 3600. Every single client (Home Assistant, Zigbee2Mqtt, Teslamate, HASS Windows agent) kept getting disconnected with "timeout exceeded"

Rene-Sackers avatar Apr 18 '25 10:04 Rene-Sackers

An analysis and options to fix are in #3286

ckrey avatar Jun 11 '25 05:06 ckrey