paho.mqtt.python icon indicating copy to clipboard operation
paho.mqtt.python copied to clipboard

Client disconnect with code 16 and cannot connect again

Open Flock1993 opened this issue 1 year ago • 13 comments

I have this paho-mqtt python client (I have omitted some script logic and DB accessing for simplification):

import json
import logging
import time

from paho.mqtt import client as mqtt

logging.basicConfig(level=logging.INFO)
mqtt_logger = logging.getLogger("MqttCon")


def on_connect(client, userdata, flags, rc):
    mqtt_logger.info("Device connected with result code: " + str(rc))
    client.subscribe("some_topic/#", qos=0)


def on_disconnect(client, userdata, rc):
    mqtt_logger.info("Device disconnected with result code: " + str(rc))


def pub_values(client, base_topic, values):
    msg = json.dumps(values)
    target_topic = base_topic
    mqtt_logger.info("Sending message to wheel...")
    client.publish(target_topic, msg, qos=0)


def on_message(client, userdata, msg):
    """
    Input messages format (can be one or list):
    {
        "tag_id": "string",
        "value": "int or float",
        "timestamp": "string format YYYY-MM-DDTHH:MM:SS.F"
    }
    """
    try:
        info = json.loads(str(msg.payload.decode("utf-8")))
        mqtt_logger.info(msg.topic)
        pub_msg = {"tag_ids": [], "timestamp": ""}
        mqtt_logger.info(f"Received message topic is {msg.topic}")
        if isinstance(info, list):
            # some logic
        else:
            # some logic, writing values to InfluxDB
        pub_values(client=client, base_topic="some_topic", values=pub_msg)
    except Exception as e:
        mqtt_logger.error(str(e))


def on_log(client, userdata, level, buf):
    mqtt_logger.info(f"SYSTEM: {buf}")


if __name__ == "__main__":
    client = mqtt.Client(protocol=mqtt.MQTTv311)
    client.on_connect = on_connect
    client.on_disconnect = on_disconnect
    client.on_message = on_message
    client.on_log = on_log
    client.connect_async('<my_mqtt_url>', port=1883)

    client.loop_start()
    try:
        while True:
            time.sleep(0)
    except KeyboardInterrupt:
        pass
    finally:
        mqtt_logger.warning("Stopping client")
        client.loop_stop()
        client.disconnect()

It works good for a few hours, but at some time it have disconnection:

INFO:MqttCon:Received message topic is telemetry/MESPIV/batch
INFO:MqttCon:Sending message to wheel...
INFO:MqttCon:SYSTEM: Sending PUBLISH (d0, q0, r0, m4228), 'b'router/newdata'', ... (11336 bytes)
INFO:MqttCon:Sending message to wheel...
INFO:MqttCon:SYSTEM: Sending PUBLISH (d0, q0, r0, m4229), 'b'router/newdata'', ... (11336 bytes)
INFO:MqttCon:Device disconnected with result code: 16
INFO:MqttCon:Device disconnected with result code: 16
INFO:MqttCon:SYSTEM: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b''
INFO:MqttCon:SYSTEM: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b''
# more Sending CONNECT

The client disconnects with code 16 (MQTT_ERR_KEEPALIVE) and tries to connect in an infinite loop, but it fails to connect (don't recive CONNACK). How can I solve this problem? For mqtt docker container i use eclipse-mosquitto image.

My first code version worked by usial connect, but it has the same problem.

if __name__ == "__main__":
    # setting mqtt client
    client.connect('<my_mqtt_url>', port=1883)
    client.loop_forever()
    client.disconnect()

An important addition: in the client logs there are disconnections with the code 16 with a successful reconnection, but it is impossible to reconnect from some kind.

Flock1993 avatar Jun 20 '23 15:06 Flock1993

Could it be linked to a momentary peak of messages? as described in #683 ?

raomin avatar Jun 20 '23 21:06 raomin

No i tested this case. I launched mqtt client which publish messages in infinite loop without time.sleep and this didn't drop service.

Flock1993 avatar Jun 21 '23 07:06 Flock1993

Kludge solution:

  • Becouse MQTT client works in Docker container, i modify on_disconnect in such way:
def on_disconnect(client, userdata, rc):
    mqtt_logger.info("Device disconnected with result code: " + str(rc))
    if rc == 16:
        exit(1)
  • Add restart: always to container. If there is dissconnect with code 16 -> application exits and container drops -> restarting container and it works for few hours.

At the time of restarting the container, there is a period of time during which data loss occurs from the mqtt broker.

Flock1993 avatar Jun 23 '23 06:06 Flock1993

Exit and restart... that's not really an option for me, I need to process all messages. @ralight, @PierreF any idea what could cause this? or any workaround?

raomin avatar Jun 26 '23 20:06 raomin

Similar (maybe same) problem here, after a first disconnection with rc=16 the client connect (at least enter the on_connect callback) and subscribe to the topics again. After that it immediately disconnect with rc=7 and keep trying to connect.

At the broker side I can see that the connection and subscription are successful, I don't know why the disconnection occurs after that. I am using the iotcore broker from AWS.

The script is running in 50 computers and the problem happens once every 2 days in one of the computers. It seems to be happening more where the internet connection is not very stable. I have already added a queue to limit less than 50 publish requests per second for a single client - a broker limit that triggers a disconnection.

rodolphoiwano avatar Sep 22 '23 21:09 rodolphoiwano

@rodolphoiwano sometimes i think about rewriting code on Golang Maybe it's python library specific problem.

Flock1993 avatar Nov 23 '23 14:11 Flock1993

@Flock1993 after some time I found a situation that exceeded the 50 publish requests per second for the AWS iotcore broker. It happened after a long disconnection that triggered the keepalive timeout.

Fixing it and using clean session always, may have solved the problem. It seems that after a server forced disconnection, it is not possible to reconnect to the same session.

However, the problem has never happened here when using a local broker (mosquitto), so I am considering it a problem related to the AWS iotcore broker limitations.

rodolphoiwano avatar Nov 23 '23 17:11 rodolphoiwano

in the client logs there are disconnections with the code 16 with a successful reconnection

Can you please share more of the logs (ideally from the last successful publish through to the second code 16 error). Can you please also confirm that you can duplicate this with the code provided (changing only <my_mqtt_url>). I have a few ideas but need more info to confirm.

MattBrittan avatar Dec 24 '23 01:12 MattBrittan

image MattBrittan,

  1. To duplicate problem i need to add some another code, which writes data from mqtt messages to InfluxDB. I will do some experiments on this week and return to you with results and logs
  2. Also, problem described above is accompanied by high CPU consumption, which you can see on attached message. Disconnection with code 16 occurs at the peak of CPU consumption, after that Docker container restart - we see low CPU consumption. When I rewrote the code on golang the problem is gone.

Flock1993 avatar Dec 25 '23 09:12 Flock1993

To duplicate problem i need to add some another code,

Thanks; I understand this can be difficult to provide, but unfortunately it's also difficult to trace an issue when you cannot duplicate it (and as there are not that many people working on issues in this repo the more you can do to help us replicate the issue the more likely it is that we will be able to solve it).

at the peak of CPU consumption,

It would be interesting to know what is causing the CPU use (i.e. is this high message volume, something to do with the number of messages written to the database, something else..). These issues can be difficult to track down (the log info I requested may help) and it's not uncommon for the issue to be in user code (rather than the library). I've put a lot of effort into tracking down a few of these issues in the Go client so happy to hear that's working for you!

MattBrittan avatar Dec 26 '23 00:12 MattBrittan

Exit and restart... that's not really an option for me, I need to process all messages. @ralight, @PierreF any idea what could cause this? or any workaround?

what we do is that we trigger a reconnect in the on_disconnect hook, however, that still leaves you up to lose some messages, if they come in very fast. This is a limitation of the MQTT Protocol, AFAIK there is no Clean way in the MQTT world to prevent this from happening.

Gornoka avatar Jan 26 '24 14:01 Gornoka

@Flock1993 on you initial logs, there is one thing that look really odd:

INFO:MqttCon:Device disconnected with result code: 16
INFO:MqttCon:Device disconnected with result code: 16
INFO:MqttCon:SYSTEM: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b''
INFO:MqttCon:SYSTEM: Sending CONNECT (u0, p0, wr0, wq0, wf0, c1, k60) client_id=b''

You really had 2 disconnect and 2 connect consecutively ? Could you add timestamp on those log to known whether they happen at the same time ?

Having two consecutive CONNECT might be possible during auto-reconnection. A first connection is trying and fail (at TCP level), it then retry later. Two consecutive disconnect seems like a bug. Is it possible that two network loop are running ?

If you can still reproduce this issue, I suggest you trying to do with the 2.0.0rc2 because their is some improvement on few network error handling and there is improved warning if you try to run two network loop (it's a pre-release, but available on PyPI).

PierreF avatar Jan 26 '24 19:01 PierreF

@PierreF Maybe as bit of notice. I'm also struggling with the rc 16 disconnect. But interestingly similar as to @Flock1993 I see the also 2x the on_disconnect being called.

2024-07-02T13:25:34+0000 [WARNING] Unexpected source MQTT disconnection. Will auto-reconnect
2024-07-02T13:25:34+0000 [WARNING] Disconnection code 16: No matching subscribers
2024-07-02T13:25:34+0000 [WARNING] Unexpected source MQTT disconnection. Will auto-reconnect
2024-07-02T13:25:34+0000 [WARNING] Disconnection code 16: No matching subscribers
2024-07-02T13:25:35+0000 [INFO] Source broker Connection Established
2024-07-02T13:27:35+0000 [WARNING] Unexpected source MQTT disconnection. Will auto-reconnect
2024-07-02T13:27:35+0000 [WARNING] Disconnection code 16: No matching subscribers
2024-07-02T13:27:35+0000 [WARNING] Unexpected source MQTT disconnection. Will auto-reconnect
2024-07-02T13:27:35+0000 [WARNING] Disconnection code 16: No matching subscribers
2024-07-02T13:27:36+0000 [INFO] Source broker Connection Established
2024-07-02T13:30:38+0000 [WARNING] Unexpected source MQTT disconnection. Will auto-reconnect
2024-07-02T13:30:38+0000 [WARNING] Disconnection code 16: No matching subscribers
2024-07-02T13:30:38+0000 [WARNING] Unexpected source MQTT disconnection. Will auto-reconnect
2024-07-02T13:30:38+0000 [WARNING] Disconnection code 16: No matching subscribers

note: this was on 1.6.1, but similar result in 2.1.0. Note2: https://github.com/marcelrv/victron-dbus-mqttmapper is what is running. It has 2 instances but connecting to 2 different mqtt servers.

marcelrv avatar Jul 02 '24 13:07 marcelrv