mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack (IDFGH-16316)
Checklist
- [x] Checked the issue tracker for similar issues to ensure this is not a duplicate
- [x] Read the documentation to confirm the issue is not addressed there and your configuration is set correctly
- [x] Tested with the latest version to ensure the issue hasn't been fixed
How often does this bug occurs?
always
Expected behavior
No warning
Actual behavior (suspected bug)
When the underlying transport is able to deliver messages to the server, but PUBACK is not received back fast enough, this message is triggered. Client documentation says _publish should block on first attempt till done, but it isn't blocking till PUBACK... so _publish and _enqueue have similar behavior and we can run out of publish slots.
I believe the proper fix is that _publish should block until PUBACK for first delivery attempt, and a high/low watermark event must be added to the system so application code can be told that it is going to start dropping messages if we call _publish again and an automatic retry causes us to queue instead.
Error logs or terminal output
E (00:33:14.513) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack I (00:33:14.516) mqtt_client: MQTT5 publish check fail
E (00:33:14.613) mqtt5_client: Client send more than 20 QoS1 and QoS2 PUBLISH packet without no ack
I (00:33:14.615) mqtt_client: MQTT5 publish check fail
Steps to reproduce the behavior
If you have a spotty internet connection it will just show up eventually
Project release version
ESP-IDF 5.5
System architecture
Intel/AMD 64-bit (modern PC, older Mac)
Operating system
Linux
Operating system version
This is not a build time problem
Shell
ZSH
Additional context
I am aware of #261 but I believe this should have been fixed long before ESP-IDF 5.5
I do not know how to debug this further, but am curious if you think this bug is fixed in ESP-IDF 5.5
#261 does not list a commit where the fix was applied
I am less sure about how this is getting triggered, it is not directly caused by reconnects... in testing it takes me days to have it occur. I am fairly certain that once it happens publishing stops working, so the MQTT v5 client is unsafe for me.
Ah, reconnect is /clearing/ the error... something before that is happening.
In my testing I am finding that esp_mqtt_client_publish is not blocking until PUBACK... is this intended behavior?
I (00:55:32.120) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:32.772) mqtt5_client: PUBLISHED 0 WITHOUT ACK
I (00:55:32.926) mqtt5_client: PUBLISHED 1 WITHOUT ACK
I (00:55:33.079) mqtt5_client: PUBLISHED 2 WITHOUT ACK
I (00:55:33.179) mqtt5_client: PUBLISHED 3 WITHOUT ACK
I (00:55:33.279) mqtt5_client: PUBLISHED 4 WITHOUT ACK
I (00:55:33.379) mqtt5_client: PUBLISHED 5 WITHOUT ACK
I (00:55:33.478) mqtt5_client: PUBLISHED 6 WITHOUT ACK
I (00:55:33.540) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:33.553) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:33.578) mqtt5_client: PUBLISHED 5 WITHOUT ACK
I (00:55:33.678) mqtt5_client: PUBLISHED 6 WITHOUT ACK
I (00:55:34.011) mqtt5_client: PUBLISHED 7 WITHOUT ACK
I (00:55:34.060) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.065) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.070) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.074) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.082) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.089) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.137) mqtt5_client: PUBLISHED 2 WITHOUT ACK
I (00:55:34.165) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.171) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:34.268) mqtt5_client: MQTT_MSG_TYPE_PUBACK return code is 0
I (00:55:35.313) mqtt5_client: PUBLISHED 0 WITHOUT ACK
I added a debug log line of the current count any time it is evaluated. It looks like during times of high latency I can end up with multiple messages in flight before the PUBACK is returned for them. As a result I can sometimes overrun the limit set on the server.
The documentation states: QoS 1 and 2 messages that may need retransmission are always enqueued, but first transmission try occurs immediately if esp_mqtt_client_publish is used.
But then also states: This API might block for several seconds, either due to network timeout (10s) or if publishing payloads longer than internal buffer (due to message fragmentation)
This seems to imply that unless we time out (and need to retry) it should block until PUBACK is received, but isn't doing that.
The flow control isn't controlling the flow, so you can run above the high watermark.
Additionally though, the safety check then drops the message rather than enqueueing it for delivery once the flow control frees up.
I am not reaching the timeout, and I am only publishing from a single task/thread.
@euripedesrocha @bogdankolendovskyy is this a bug or not?
Well, for anyone who might happen across this... as of ESP-IDF 5.5 I would consider MQTT 5 to be unreliable or inconsistent behavior. I will not be using it till flow control is fixed.
@hachi, thanks for reporting. The publish function will block only until the message is sent. We don't block until PUBACK. Similar behavior occurs when you are using QoS2 messages. I might be biased here, but that documentation never implied the blocking until completing the full message processing, we'll clarify it thanks for pointing out.
For the message dropping behavior, we can improve the outbox to track message state+number of messages sent. I'll try to address it in next release.
Thanks again for your contributions in pointing these improvements, very much appreciated.
I am actually talking about QoS 2 messages here, which are being dropped /before/ their first delivery attempt, without being placed into the queue.
The docs do actually say that it will be queued if the first delivery attempt fails, even if you don't talk about the particulars of blocking until PUBACK arrives.
I would think that this implies that the library should not drop the message if PUBACK is clogged and at minimum it should put it into the transport queue.
Please don't turn this into a double-queue system, it would be better if we could just examine the state of the flow control and have events fired when it changes from 'full' to 'not full' and back again.
Ok, I need to double-check the log messages as there are no PUBACK involved in QoS 2, anyway the behavior for enqueueing should be the same.
I don't intend to create a double-queue system, but I understand your point here. Our internal check function is restrictive in this sense. I was a bit off from your usage of the word drop, I was understanding that you were enqueueing and message being dropped after that, sorry for misunderstanding you.
This was a decision to simplify outbox management by preventing messages that we wouldn't be able to publish to enter. We'll put some work on this to improve the behavior.