esp-mqtt icon indicating copy to clipboard operation
esp-mqtt copied to clipboard

mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack (IDFGH-16316)

Open hachi opened this issue 7 months ago • 9 comments

Checklist

  • [x] Checked the issue tracker for similar issues to ensure this is not a duplicate
  • [x] Read the documentation to confirm the issue is not addressed there and your configuration is set correctly
  • [x] Tested with the latest version to ensure the issue hasn't been fixed

How often does this bug occurs?

always

Expected behavior

No warning

Actual behavior (suspected bug)

When the underlying transport is able to deliver messages to the server, but PUBACK is not received back fast enough, this message is triggered. Client documentation says _publish should block on first attempt till done, but it isn't blocking till PUBACK... so _publish and _enqueue have similar behavior and we can run out of publish slots.

I believe the proper fix is that _publish should block until PUBACK for first delivery attempt, and a high/low watermark event must be added to the system so application code can be told that it is going to start dropping messages if we call _publish again and an automatic retry causes us to queue instead.

Error logs or terminal output

E (00:33:14.513) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack                                                                                                             I (00:33:14.516) mqtt_client: MQTT5 publish check fail                                                                                                      
E (00:33:14.613) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack
I (00:33:14.615) mqtt_client: MQTT5 publish check fail

Steps to reproduce the behavior

If you have a spotty internet connection it will just show up eventually

Project release version

ESP-IDF 5.5

System architecture

Intel/AMD 64-bit (modern PC, older Mac)

Operating system

Linux

Operating system version

This is not a build time problem

Shell

ZSH

Additional context

I am aware of #261 but I believe this should have been fixed long before ESP-IDF 5.5

I do not know how to debug this further, but am curious if you think this bug is fixed in ESP-IDF 5.5

#261 does not list a commit where the fix was applied

hachi avatar Aug 20 '25 01:08 hachi

I am less sure about how this is getting triggered, it is not directly caused by reconnects... in testing it takes me days to have it occur. I am fairly certain that once it happens publishing stops working, so the MQTT v5 client is unsafe for me.

hachi avatar Aug 20 '25 15:08 hachi

Ah, reconnect is /clearing/ the error... something before that is happening.

hachi avatar Aug 20 '25 16:08 hachi

In my testing I am finding that esp_mqtt_client_publish is not blocking until PUBACK... is this intended behavior?

hachi avatar Aug 20 '25 19:08 hachi

I (00:55:32.120) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:32.772) mqtt5_client: PUBLISHED 0 WITHOUT ACK 
I (00:55:32.926) mqtt5_client: PUBLISHED 1 WITHOUT ACK 
I (00:55:33.079) mqtt5_client: PUBLISHED 2 WITHOUT ACK 
I (00:55:33.179) mqtt5_client: PUBLISHED 3 WITHOUT ACK 
I (00:55:33.279) mqtt5_client: PUBLISHED 4 WITHOUT ACK 
I (00:55:33.379) mqtt5_client: PUBLISHED 5 WITHOUT ACK 
I (00:55:33.478) mqtt5_client: PUBLISHED 6 WITHOUT ACK 
I (00:55:33.540) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:33.553) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:33.578) mqtt5_client: PUBLISHED 5 WITHOUT ACK 
I (00:55:33.678) mqtt5_client: PUBLISHED 6 WITHOUT ACK 
I (00:55:34.011) mqtt5_client: PUBLISHED 7 WITHOUT ACK 
I (00:55:34.060) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.065) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.070) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.074) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.082) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.089) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.137) mqtt5_client: PUBLISHED 2 WITHOUT ACK 
I (00:55:34.165) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.171) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.268) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:35.313) mqtt5_client: PUBLISHED 0 WITHOUT ACK 

I added a debug log line of the current count any time it is evaluated. It looks like during times of high latency I can end up with multiple messages in flight before the PUBACK is returned for them. As a result I can sometimes overrun the limit set on the server.

The documentation states: QoS 1 and 2 messages that may need retransmission are always enqueued, but first transmission try occurs immediately if esp_mqtt_client_publish is used.

But then also states: This API might block for several seconds, either due to network timeout (10s) or if publishing payloads longer than internal buffer (due to message fragmentation)

This seems to imply that unless we time out (and need to retry) it should block until PUBACK is received, but isn't doing that.

The flow control isn't controlling the flow, so you can run above the high watermark.

Additionally though, the safety check then drops the message rather than enqueueing it for delivery once the flow control frees up.

I am not reaching the timeout, and I am only publishing from a single task/thread.

hachi avatar Aug 21 '25 01:08 hachi

@euripedesrocha @bogdankolendovskyy is this a bug or not?

hachi avatar Sep 03 '25 13:09 hachi

Well, for anyone who might happen across this... as of ESP-IDF 5.5 I would consider MQTT 5 to be unreliable or inconsistent behavior. I will not be using it till flow control is fixed.

hachi avatar Sep 23 '25 21:09 hachi

@hachi, thanks for reporting. The publish function will block only until the message is sent. We don't block until PUBACK. Similar behavior occurs when you are using QoS2 messages. I might be biased here, but that documentation never implied the blocking until completing the full message processing, we'll clarify it thanks for pointing out.

For the message dropping behavior, we can improve the outbox to track message state+number of messages sent. I'll try to address it in next release.

Thanks again for your contributions in pointing these improvements, very much appreciated.

euripedesrocha avatar Sep 30 '25 08:09 euripedesrocha

I am actually talking about QoS 2 messages here, which are being dropped /before/ their first delivery attempt, without being placed into the queue.

The docs do actually say that it will be queued if the first delivery attempt fails, even if you don't talk about the particulars of blocking until PUBACK arrives.

I would think that this implies that the library should not drop the message if PUBACK is clogged and at minimum it should put it into the transport queue.

Please don't turn this into a double-queue system, it would be better if we could just examine the state of the flow control and have events fired when it changes from 'full' to 'not full' and back again.

hachi avatar Oct 02 '25 09:10 hachi

Ok, I need to double-check the log messages as there are no PUBACK involved in QoS 2, anyway the behavior for enqueueing should be the same.

I don't intend to create a double-queue system, but I understand your point here. Our internal check function is restrictive in this sense. I was a bit off from your usage of the word drop, I was understanding that you were enqueueing and message being dropped after that, sorry for misunderstanding you.

This was a decision to simplify outbox management by preventing messages that we wouldn't be able to publish to enter. We'll put some work on this to improve the behavior.

euripedesrocha avatar Oct 02 '25 14:10 euripedesrocha