mosquitto icon indicating copy to clipboard operation
mosquitto copied to clipboard

1.6 bridge won't reconnect on bridge connection loss

Open caevv opened this issue 5 years ago • 35 comments

After upgrading to 1.6.3 from 1.5.x I found that when the bridge connection is lost the bridge won't recover when the connection is re-established.

Only solution has been restarting mosquitto.

Mosquitto running on docker.

mosquitto.conf

port 1883

pid_file /var/run/mosquitto.pid

persistence true
persistence_location /mosquitto/data/

# queue unlimited messages
max_queued_messages 0

log_type all
log_dest file /mosquitto/log/mosquitto.log

connection bridge-01
address mqtt.host.com:8883
topic # out 2 v1/vessel/ v1.parenttopic/xxxx/
remote_clientid xxxx
notifications true
cleansession false
keepalive_interval 180
restart_timeout 60 300
bridge_cafile /mosquitto/config/mqtt-ca.cert
bridge_certfile /mosquitto/config/mqtt-bridge-client.cert
bridge_keyfile /mosquitto/config/mqtt-bridge-client.key

Bridge name: bridge-01 Connection lost at 1562929130 Connection re-established at 1562929523

1562929130: Bridge local.xxxx doing local SUBSCRIBE on topic v1/topic/#
1562929130: Connecting bridge bridge-01 (mqtt.host.com:8883)
1562929135: Bridge xxxx sending CONNECT
...
1562929267: Socket error on client local.xxxx, disconnecting.
1562929273: Received PINGREQ from gateway.publisher
...
1562929328: Bridge local.xxxx doing local SUBSCRIBE on topic v1/topic/#
1562929328: Connecting bridge bridge-01 (mqtt.host.com:8883)
1562929333: Bridge xxxx sending CONNECT
...
1562929523: Bridge local.xxxx doing local SUBSCRIBE on topic v1/topic/#
1562929523: Connecting bridge bridge-01 (mqtt.host.com:8883)
1562929524: Bridge xxxx sending CONNECT
1562929525: Saving in-memory database to /mosquitto/data/mosquitto.db.
1562929529: Received CONNACK on connection local.xxxx.
1562929529: Bridge local.xxxx sending UNSUBSCRIBE (Mid: 11973, Topic: v1.parenttopic/xxxx/#)
1562929529: Sending PUBLISH to local.xxxx (d1, q2, r0, m11954, 'v1.parenttopic/xxxx/subtopic', ... (154 bytes))```

caevv avatar Jul 09 '19 09:07 caevv

Hi there,

I am having the same issue with mosquitto running on 1.6.3 and bridge on 1.6.3 or 1.5.8 after a client lost connection 1564044750: Client dev-someid has exceeded timeout, disconnecting.

Logs from the broker are very interesting as they show that messages are actually received but no PUBREL is sent back! Which is the case in working conditions. After a restart of my mosquitto client PUBREL are sent again and everything comes back to normal.

Expected logs (From a working state)

1564051594: New connection from x.x.x.x on port 8883.
1564051594: New bridge connected from x.x.x.x as dev-someid (p2, c0, k180, u'bridge.host.com').
1564051594: Will message specified (1 bytes) (r1, q1).
1564051594: 	$SYS/broker/connection/dev-someid/state
1564051594: Sending CONNACK to dev-someid (1, 0)
1564051594: Received PUBLISH from dev-someid (d0, q1, r1, m1, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564051594: Sending PUBACK to dev-someid (m1, rc0)
1564051594: Sending PUBLISH to auto-DA5F9AAD-5070-C704-6D22-89ECCBE38A0B (d0, q0, r0, m0, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564051594: Sending PUBREC to dev-someid (m1870, rc0)
1564051594: Received UNSUBSCRIBE from dev-someid
1564051594: 	some/topic/dev-someid/#
1564051594: dev-someid some/topic/dev-someid/#
1564051594: Sending UNSUBACK to dev-someid
1564051594: Received PUBREL from dev-someid (Mid: 1870)
1564051594: Sending PUBCOMP to dev-someid (m1870)
1564051594: Sending PUBREC to dev-someid (m1871, rc0)

Observed logs (No PUBREC is send)

1564048435: New connection from x.x.x.x on port 8883.
1564048435: New bridge connected from x.x.x.x as dev-someid (p2, c0, k180, u'bridge.host.com').
1564048435: Will message specified (1 bytes) (r1, q1).
1564048435: 	$SYS/broker/connection/dev-someid/state
1564048435: Sending CONNACK to dev-someid (1, 0)
1564048435: Received PUBLISH from dev-someid (d0, q1, r1, m1973, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564048435: Sending PUBACK to dev-someid (m1973, rc0)
1564048435: Sending PUBLISH to auto-DA5F9AAD-5070-C704-6D22-89ECCBE38A0B (d0, q0, r0, m0, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564048435: Received UNSUBSCRIBE from dev-someid
1564048435: 	some/topic/dev-someid/#
1564048435: dev-someid some/topic/dev-someid/#
1564048435: Sending UNSUBACK to dev-someid
1564048435: Received PUBLISH from dev-someid (d1, q2, r0, m1870, 'some/topic/dev-someid/my/topic', ... (79 bytes))
1564048435: Received PUBLISH from dev-someid (d1, q2, r0, m1871, 'some/topic/dev-someid/another/topic', ... (153 bytes))
1564048442: Received PUBLISH from dev-someid (d0, q2, r0, m1977, 'some/topic/dev-someid/not/the/same/topic', ... (74 bytes))

minjatJ avatar Jul 25 '19 12:07 minjatJ

Thanks for the second report @minjatJ, I haven't yet reproduced this. Are you also running on Docker?

ralight avatar Jul 25 '19 16:07 ralight

Thanks for the different reports both of you, I'm struggling to reproduce this. Do you have any more hints on what you're seeing?

ralight avatar Jul 25 '19 19:07 ralight

Hi @ralight thanks for looking into this. I am running on Docker as well. Unfortunately the issue is very intermittent (Seems to be depending a lot on when the connection is dropped and comes back). I can't manage to reproduce it consistently. I am still trying to figure out what is the exact scenario. I'll post updates if I have any through the day.

minjatJ avatar Jul 26 '19 08:07 minjatJ

I can reproduce it:

  • broker to bridge to (host: "bim"): mosquitto, 1.6.9-1 (Ubuntu package)
  • broker to bridge from (host: "denise"): mosquitto, 2.0.7 (official docker image)

Having an existing and working bridge, when restarting the broker on "bim", connection from "denise" dies and is not able to recover without restarting the Mosquitto instance on "denise".

Logs from denise around the time of restart of "bim":

1613997839: Client local.denise closed its connection.
1613997845: Connecting bridge bim (bim.fritz.box:1883)
1613997845: Client local.denise disconnected: Broken pipe.
1613997851: Connecting bridge bim (bim.fritz.box:1883)
1613997917: Connecting bridge bim (bim.fritz.box:1883)
1613997983: Connecting bridge bim (bim.fritz.box:1883)
1613998049: Connecting bridge bim (bim.fritz.box:1883)
1613998115: Connecting bridge bim (bim.fritz.box:188

Logs from bim after having it restarted:

1613997849: mosquitto version 1.6.9 starting
1613997849: Config loaded from /etc/mosquitto/mosquitto.conf.
1613997849: Opening ipv4 listen socket on port 1883.
1613997849: Opening ipv6 listen socket on port 1883.
1613997851: New connection from 192.168.33.51 on port 1883.
1613997854: New connection from 192.168.33.31 on port 1883.
1613997911: Socket error on client <unknown>, disconnecting.
1613997917: New connection from 192.168.33.51 on port 1883.
1613997977: Socket error on client <unknown>, disconnecting.
1613997983: New connection from 192.168.33.51 on port 1883.
1613998043: Socket error on client <unknown>, disconnecting.
1613998049: New connection from 192.168.33.51 on port 1883.
1613998109: Socket error on client <unknown>, disconnecting.
1613998115: New connection from 192.168.33.51 on port 1883.

m0wlheld avatar Feb 22 '21 12:02 m0wlheld

@m0wlheld I've just tried that with a fresh remote ubuntu 20.04 VM and can't reproduce it. Is there anything special in your configuration files I should know about?

ralight avatar Feb 25 '21 14:02 ralight

Nothing special (I think), I'll share.

/etc/mosquitto/mosquitto.conf on Ubuntu host "bim.fritz.box":

pid_file /var/run/mosquitto.pid
persistence true
persistence_location /var/lib/mosquitto/
log_dest file /var/log/mosquitto/mosquitto.log
include_dir /etc/mosquitto/conf.d

/etc/mosquitto/conf.d/auth.conf on Ubuntu host "bim.fritz.box":

allow_anonymous false
password_file /etc/mosquitto/passwd

/mosquitto/config/mosquitto.conf for container instance on Raspbian host "denise.fritz.box":

listener 1883 0.0.0.0
persistence false
allow_anonymous false
password_file /mosquitto/config/mosquitto.passwd
connection bim
address bim.fritz.box:1883
remote_clientid denise
remote_password ***************
remote_username denise

m0wlheld avatar Feb 25 '21 15:02 m0wlheld

Thanks - I agree it doesn't look much special, but it's always nice to have exactly the same conditions to work with.

ralight avatar Feb 25 '21 17:02 ralight

I observe the same problem here: The setup is a bridge connection:

  • to broker (host: "A"): mosquitto, 2.0.7 (official docker image: eclipse-mosquitto:2.0.7, x64)
  • from broker (host: "B"): mosquitto, 2.0.7 (official docker image: arm32v6/eclipse-mosquitto:2.0.7)

Procedure:

  1. start A, then start B -> everything works fine: logs A:
1614955646: mosquitto version 2.0.7 starting
1614955646: Config loaded from /mosquitto/config/mosquitto.conf.
1614955646: Opening ipv4 listen socket on port 1883.
1614955646: mosquitto version 2.0.7 running
1614955677: New connection from 172.20.0.1:33862 on port 1883.
1614955677: New client connected from 172.20.0.1:33862 as A-bridge (p5, c1, k60, u'A').

logs B:

1614956548: mosquitto version 2.0.7 starting
1614956548: Config loaded from /mosquitto/config/mosquitto.conf.
1614956548: Opening ipv4 listen socket on port 1883.
1614956548: Connecting bridge A-bridge (<ip-A>:1883)
  1. start B, then start A -> bridge won't connect: logs A:
1614956924: mosquitto version 2.0.7 starting
1614956924: Config loaded from /mosquitto/config/mosquitto.conf.
1614956924: Opening ipv4 listen socket on port 1883.
1614956924: mosquitto version 2.0.7 running
1614956936: New connection from 172.20.0.1:33936 on port 1883.
1614956996: Client <unknown> closed its connection.
1614957017: New connection from 172.20.0.1:33940 on port 1883.
1614957077: Client <unknown> closed its connection.

logs B:

1614956753: mosquitto version 2.0.7 starting
1614956753: Config loaded from /mosquitto/config/mosquitto.conf.
1614956753: Opening ipv4 listen socket on port 1883.
1614956753: Connecting bridge A-bridge (<ip-A>:1883)
1614956753: mosquitto version 2.0.7 running
1614956754: Client local.A-bridge disconnected: Broken pipe.
1614956774: Connecting bridge A-bridge (<ip-A>:1883)
1614956855: Connecting bridge A-bridge (<ip-A>:1883)
1614956936: Connecting bridge A-bridge (<ip-A>:1883)

A: mosquitto.conf:

log_dest stdout
log_type error
log_type warning
log_type notice
log_type information
log_type subscribe
log_type unsubscribe
log_type websockets

password_file /run/secrets/passwd

listener 1883 0.0.0.0

allow_anonymous false

B: mosquitto.conf

persistence true
persistence_location /mosquitto/data/

log_dest stdout
log_type error
log_type warning
log_type notice
log_type information
log_type subscribe
log_type unsubscribe

password_file /mosquitto/config/passwd

listener 1883 0.0.0.0
allow_anonymous true
 
connection A-bridge
bridge_protocol_version mqttv50
address <ip-A>:1883
topic # both
remote_username A
remote_password *******
remote_clientid A-bridge
restart_timeout 20 600

notifications true
cleansession true
start_type automatic
try_private true

toschoch avatar Mar 05 '21 15:03 toschoch

I see exactly same problem as Toschoch with Mosquitto 2.07 installed on Windows 10. Anybody know how to make a quick workaround?

perfeldbak avatar Apr 06 '21 20:04 perfeldbak

I see exactly same problem as Toschoch with Mosquitto 2.07 installed on Windows 10. Anybody know how to make a quick workaround?

Ubuntu 20.04 mosquitto version 2.0.8

I've found a workaround using docker-compose healthcheck.

a couple of steps

version: "3.8"
services:
  autoheal:
    image: "willfarrell/autoheal:latest"
    tty: true
    container_name: "autoheal"
    restart: always
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
    
  mosquitto:
    image: "eclipse-mosquitto"
    ports:
      - "1883:1883/tcp"
      - "9001:9001/tcp"
    volumes:
      - "/host/mosquitto/config:/mosquitto/config"
      - "/host/mosquitto/data:/mosquitto/data"
      - "/host/mosquitto/log:/mosquitto/log"
    labels:
      - "autoheal=true"
    healthcheck:
      test: ["CMD", "/mosquitto/config/hc.sh"]
      interval: 5s
      timeout: 10s
      retries: 3
    restart: unless-stopped

/mosquitto/config/hc.sh

#!/bin/sh

function_to_fork() {
  HC=$( mosquitto_sub -h 127.0.0.1 -p 1883 -t "test/topic_hc" -u hass -P "MeGaPa$$wD" -C 1 )
  if test "$HC" != 'alive'; then
    echo "> $HC result"
    exit 1
  fi
}

function_to_fork &
mosquitto_pub -h 10.0.0.1 -p 1883 -t 'test/topic_hc' -m 'alive' -u hass -P 'MeGaPa$$wD'
wait -n

mike-sirs avatar Apr 07 '21 22:04 mike-sirs

I had this issue in 2.0.10, but in 2.0.11 it seems fine. (using docker)

peter-vanpoucke avatar Aug 18 '21 20:08 peter-vanpoucke

I'm still having the issue on 2.0.14, where I've got three brokers. When restarting everything (docker compose), two brokers start up at nearly the same time, which means that one isn't ready when the other is trying to connect. The other then never tries again. A workaround shouldn't be needed, right?

Here's how I've set it up: Broker A has a bridge set up to broker B Broker B has a bridge set up to broker C

If broker A restarts, it'll reconnect to broker B. If broker B restarts, it'll reconnect to broker C, but broker A will state Client broker_b closed its connection., and it never tries again. The same goes for broker C restarting: broker B will never try again.

In all mosquitto.conf I have the lines:

keepalive_interval 60
start_type automatic
restart_timeout 10 180

Why does it not reconnect?

vomov avatar Mar 29 '22 10:03 vomov

Just noticed something odd.

I actually have a fourth broker, which I don't really use. This broker is also connected as a bridge, and this one DOES reconnect no matter what. Brokers A, B, and C are all running as docker containers on RPi's (one on a RPi Zero W), and none of them attempt to reconnect. Broker D is running as a docker container on Debian.

Can this explain the issue?

vomov avatar Mar 31 '22 08:03 vomov

Hi,

I have a similar problem.

I created a Broker on a Raspberry PI and connected it to my main MQTT.

The device connects to the Raspberry Broker and works without problems.

However, the Broker disconnects from the main MQTT after some time and no longer reconnects.

When I restart it, it works again until it disconnects again.

Does anyone know why this is happening?

This is a very strange behavior, communication between MQTT servers works smoothly in both directions until disconnected.

17:55:07:: 1649699707: New bridge connected from 172.17.0.1:40846 as bridge01 (p2, c1, k15, u'hassio'). 17:55:43:: 1649699743: Client bridge01 has exceeded timeout, disconnecting.

Is it necessary to set an interval for the check or turn it off completely or something similar?

Thank you in advance for any answer, M.

michaelfanta avatar Apr 11 '22 18:04 michaelfanta

@michaelfanta; the timeout should be caught by keepalive_interval 60, which you may need to reduce to 30 or something. However, the restart_timeout 10 180 should make it reconnect as well.

vomov avatar Apr 12 '22 07:04 vomov

I just tried that and it still disconnects from the main MQTT server. Here I attach my config file. No need to set protocol type or something? It's weird that it always disconnects after a while.

listener 1883 0.0.0.0

persistence true
persistence_location /mosquitto/data/

allow_anonymous false
password_file /mosquitto/data/passwd

log_dest file /mosquitto/log/mosquitto.log

connection bridge01
address xxx.xxx.xxx.xxx:1883
topic # both 0
cleansession false
remote_username test
remote_password test
remote_clientid bridge01
restart_timeout 10 180
keepalive_interval 30

notifications true
start_type automatic
try_private true

I'm testing it on a Raspberry. I have there connected to MQTT one SonOff MINI (tasmota).

If I change the states of the switch, the connection seems to be stable. However, when the status stops changing, the MQTT Bridge disconnects from the main MQTT.

Shouldn't the MQTT Bridge somehow send regular data to the main MQTT to avoid disconnection?

I also tried to set "start_type" to "lazy" but without effect.

EDIT I also watched that when "keepalive_interval" was set to "30", it disconnected after 30 seconds. Therefore, I set this interval to the maximum value of 65535 and has not disconnected yet.

I also watched that the date and time did not display correctly in the log file. The time is displayed as: Sun Jan 100 4174038:2123955838:00 1900

So I set the value "privileged" to "true" in "docker-compose.yml" and the time and date are displayed correctly. Do you think this could be related to a previous disconnection?

michaelfanta avatar Apr 12 '22 10:04 michaelfanta

Privileged containers is almost never a good idea, especially not when used to circumvent another bug. For system time in the container, you can best do this: volumes: - "/etc/timezone:/etc/timezone:ro" - "/etc/localtime:/etc/localtime:ro" I suspect your loss-of-connection-bug is due to some other reason, where somehow it disconnects when the keepalive_interval passes. Perhaps it would be good to open another issue, with the logfiles of both sides of the bridge.

However, the fact that it doesn't reconnect after (wrongly) losing the connection is the same bug I am experiencing.

vomov avatar Apr 13 '22 15:04 vomov

Thank you so much, this is how it works without problems :)

As for MQTT, he probably didn't mind that there was no communication. Now that sensor states are being sent, the MQTT does not disconnect.

But as you say, when a connection is lost, it doesn't reconnect, which is weird.

I'm trying to solve something similar with Tasmot FW. It connects to the MQTT and when the MQTT is Offline for some time, the Tasmota device no longer logs on. Each time a connection attempt fails, the time is shifted until it finally stops trying to connect, which is incorrect.

I'm still trying to find some information about why "restart_timeout" isn't responding .. I was hoping that it would be enough to set "start_type" to "lazy", but even that doesn't affect me.

I even tried to set "threshold = 1" to respond to each message, but it will never reconnect.

For now, I'm dealing with a Python-based script, when the mqtt connection to the broker is lost, the MQTT on RPI restarts to reconnect, but it's not really a permanent solution.

michaelfanta avatar Apr 13 '22 22:04 michaelfanta

I have the same issue with 2.0.11 on both bridge and broker (Docker). I notice that sometimes when the bridge starts up and tries to connect the broker, it doesn't receive the CONNACK (even though I can see the broker sending the CONNACK in its logs, another problem I'm looking into). After that the keep alive fires every 30 seconds and the ping requests work fine on both ends, but the bridge never publishes a message to the broker after that and never tries to reconnect. When it does connect ok everything works perfectly. It's almost like it's waiting on the CONNACK forever. How can I get the bridge to retry the connection when it doesn't receive a CONNACK?

greg-mcnamara avatar Apr 14 '22 03:04 greg-mcnamara

Reading the mosquitto conf manual page here I see with the restart_timeout setting the default is to initially retry after 5 seconds and then increase up to maximum retry period of 30 seconds. So in theory I should see a reconnect attempt every 30 seconds (without configuring the restart_timeout) but I don't. I do see a keep alive (which is successful) and also happens to be set to 30 seconds. Is it possible the successful keep alive overrides the reconnect attempt and leaves the bridge in a permanent "happily disconnected" state?

greg-mcnamara avatar Apr 14 '22 03:04 greg-mcnamara

"Happily disconnected" does seem appropriate in this case, as the broker never appears to retry connecting. I'm starting to suspect there are multiple problems here:

  • It drops the connection, which may have something to do with a keep_alive
  • The missing CONNACK is worrying, and may also play a role
  • It does not attempt to reconnect

Let's say it does reconnect after five seconds, and it is indeed due to the keep_alive; this would mean that the broker drops messages for five seconds every time it tries to keep the connection alive.

I experienced the problem when restarting brokers, where the order in which they are restarted is dependent on which broker initiates the connection, and then only on a RPi Zero W.

vomov avatar Apr 14 '22 11:04 vomov

I've recently found this to be an issue with a friends' setup as well. He's also using a rPi Zero W which doesn't reconnect.

vomov avatar Jun 21 '22 11:06 vomov

I believe we may have a similar issue happening but it's intermittent (meaning if we restart mosquitto it may or may not happen) but it does seem roughly 50% of the time it happens.

I've noticed that for us it appears as though this log statement is always present when it happens:

2022-08-17T21:40:34Z mosquitto[24606]: 1660772434: Expiring client brXXXXXXX due to timeout.

everactivetim avatar Aug 18 '22 15:08 everactivetim

Confirming that 1.5.8 does not exhibit the issue (just tested 5 times and would have expected to see it once).

everactivetim avatar Aug 18 '22 16:08 everactivetim

1.6.15 did not exhibit the issue either. Actually I'm not sure this is the same issue for us. I think this broke for us between 2.0.14 and 2.0.15 when this (guessing):

/* FIXME - we should expire clients here if they have exceeded their time */

was replaced by this:

session_expiry__add_from_persistence(context, chunk.F.session_expiry_time);

Sorry for the noise on this issue.

everactivetim avatar Aug 18 '22 19:08 everactivetim

Reading the mosquitto conf manual page here I see with the restart_timeout setting the default is to initially retry after 5 seconds and then increase up to maximum retry period of 30 seconds. So in theory I should see a reconnect attempt every 30 seconds (without configuring the restart_timeout) but I don't. I do see a keep alive (which is successful) and also happens to be set to 30 seconds. Is it possible the successful keep alive overrides the reconnect attempt and leaves the bridge in a permanent "happily disconnected" state?

/* FIXME - we should expire clients here if they have exceeded their time */

was replaced by this:

session_expiry__add_from_persistence(context, chunk.F.session_expiry_time);

I'd say it's likely that the problem originates from something like this. Could we somehow get this issue 'bumped'?

vomov avatar Aug 21 '22 14:08 vomov

I'm not sure if this is related or not, but on 2.0.15, we're experiencing the same issue that @everactivetim described. We have a number of bridged Mosquitto brokers, and in many cases (maybe close to 50%) the message

Expiring client local.<bridge-name> due to timeout.

appears when Mosquitto is starting up. There is no actual connection timeout at play here - which happens within a second of service start. The bridge is then never actually created.

We're reverting to 2.0.14 to see if the issue is present there also; it may be a regression due to https://github.com/eclipse/mosquitto/commit/c99502a2567bf9147b3d811cb17e38ad51f08660 as @everactivetim suggested. The most relevant parts of the configuration are (probably)

# General
persistence true

# Bridge
cleansession false
start_type automatic
restart_timeout 10 120

svet-b avatar Sep 09 '22 14:09 svet-b

This may be a different problem, just like the problem mentioned by @everactivetim , but they all appear related.

What device/architecture are you using? I'm able to reproduce the problem on a(ny) RPi Zero W, but not on any other device.

vomov avatar Sep 12 '22 08:09 vomov

Yes after further investigation I do think it's a separate issue, caused by a regression in 2.0.15. Reverting to 2.0.14 appears to fix the Expiring client due to timeout issue I was experiencing - I'll create a ticket for this. This is on Raspberry Pi 3B+ / armv7l architecture.

svet-b avatar Sep 12 '22 09:09 svet-b