mosquitto icon indicating copy to clipboard operation
mosquitto copied to clipboard

Windows broker seems to get connection errors with many broker connections at start up

Open NiclasLindgren opened this issue 1 month ago • 6 comments

I can't provide the exact config or logs as it is, I am posting here to see if anyone experience the same issues

We have a broker where we have split each topic into its own connection, so there are about 40 connections. (the other side has a broker that wants to load balance via connections)

When restarting the broker it can take up to 5 minutes of retries until all these connections are "connected"

I tried with 2.0.22 as well, had the same behavior. On 1.6.4 it works just fine

Is there something obvious wrong here with the config that has changed with 2.X mosquitto?

I have obfuscated parts of the logs, but to show the sequence.

2025-12-04T02:27:44: mosquitto version 2.0.15 starting 2025-12-04T02:27:44: Config loaded from D:\i4m\etc\mosquitto-external-v2.conf. 2025-12-04T02:27:44: Opening ipv6 listen socket on port 1883. 2025-12-04T02:27:44: Opening ipv4 listen socket on port 1883. 2025-12-04T02:27:44: Connecting bridge connection1 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection3 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection4 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection5 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection6 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection7 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection8 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection9 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection10 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection11 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection12 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection13 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection14 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection15 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection16 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection17 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection18 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection19 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection20 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection21 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection22 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection23 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection24 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection25 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection26 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection27 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection28 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection29 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection30 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection31 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection32 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection33 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection34 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection35 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection36 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection37 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection38 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection39 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection40 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection41 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection42 (xyz:8883) 2025-12-04T02:27:44: Connecting bridge connection2 (xyz:8883) 2025-12-04T02:27:44: mosquitto version 2.0.15 running 2025-12-04T02:27:44: Client local.connection3 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection4 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection5 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection6 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection7 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection8 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection9 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection10 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection11 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection12 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection13 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection14 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection15 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection16 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection17 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection18 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection19 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection20 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection21 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection22 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection23 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection24 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection25 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection26 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection27 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection28 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection29 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection30 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection31 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection32 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection33 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection34 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection35 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection36 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection37 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection38 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection39 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection40 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection41 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection42 disconnected: Unknown error. 2025-12-04T02:27:44: Client local.connection2 disconnected: Unknown error. 2025-12-04T02:27:45: New connection from 192.168.5.247:59096 on port 1883. . . . Eventually it stops reporting errors slowly, until they all are fine

2025-12-04T02:34:29: Connecting bridge connection1 (xyz:8883) 2025-12-04T02:34:44: Connecting bridge connection2 (xyz:8883) 2025-12-04T02:34:44: Client local.connection2 disconnected: Unknown error. 2025-12-04T02:35:26: Connecting bridge connection2 (xyz:8883) 2025-12-04T02:35:26: Client local.connection2 disconnected: Unknown error. 2025-12-04T02:36:08: Connecting bridge connection2 (xyz:8883)

Connections look like this

connection connection1 address xyz:8883

try_private true bridge_cafile D:\tmp\ca.crt start_type automatic remote_username user remote_password password restart_timeout 44 keepalive_interval 10 notifications false cleansession true

topic +/topic2 out 0 testtest/from/ test/from/

connection connection2 address xyz:8883

try_private true bridge_cafile D:\tmp\ca.crt start_type automatic remote_username user remote_password password restart_timeout 41 keepalive_interval 10 notifications false cleansession true

topic +/topic1 out 0 testtest/from/ test/from/

NiclasLindgren avatar Dec 04 '25 02:12 NiclasLindgren

I tried with 1.6.13 and 2.0.0, it works fine in 1.6.13 and not on 2.0.0

NiclasLindgren avatar Dec 04 '25 02:12 NiclasLindgren

I managed to build and run in the debugger, same thing happens on head. Funny observation though, if I put a break point on

Image

Row 283 in packet_mosq.c, on the switch when something goes wrong with a write, and hold the debugger på 1-2 seconds and then continue, then only 1 connection fails (the connection that the breakpoint was connection). And then it will just connect again for that one and it all works.

So there seems to be some kind of timing/init issue

NiclasLindgren avatar Dec 04 '25 03:12 NiclasLindgren

I think the issue is that bridge_check_pending sets the state for the bridge to mosq_cs_new, too early. if I just skip that until at least one poll event has happened it all works

this would also explain why breaking in the debugger for a second makes it work, the TCP connection finalizes before the code runs.

I don't know what the proper solution is, commenting out this makes it "work"

Image

NiclasLindgren avatar Dec 04 '25 03:12 NiclasLindgren

A very minimal "fix", which isn't a fix, but the last_check variable isn't really used in bridge_check, if we use that and return on the first ever check (maybe even do now + 10s or so) it will work fine

Image

NiclasLindgren avatar Dec 04 '25 03:12 NiclasLindgren

To pin point the issue I added a flag that mux_poll sets on the connect event

Image

If bridge_check checks that to make sure it can actually set the state to "new", it all works too

Image

It still seems to me the fix in check is still good, so that call skips tight checks, by updating the time stamp

The reason this is different for Linux and Windows I think is because getsockopt in windows will return "0" before connection happens.

NiclasLindgren avatar Dec 04 '25 04:12 NiclasLindgren

I could provide a pull request, but I have not fully understand when bridge__connect_step3 has to be called. I would have wanted this to either happen in the mux_poll, or there is a state flag that when bridge_check_pending sees it, it calls step3, but to me it seems that could should maybe not even be there at all and just removed.

NiclasLindgren avatar Dec 04 '25 18:12 NiclasLindgren