esp-mqtt
esp-mqtt copied to clipboard
Improve/fix handling of read errors (IDFGH-7937)
(Edited)
This change simplifies and should correct socket error handling in the MQTT implementation for all ESP-IDF and MQTT versions, giving improved performance and correctness.
Explanation:
Previously, an error (return value <= 0) from esp_transport_read() from any locus was followed by a call to esp_mqtt_handle_transport_read_error(), whose semantics are unclear but would return 0 for closed connection OR timeout, and return -1 (and log an error) for any other error type. The calling sites would generally return error for -1, and return success (but still abandon the operation) for 0.
This was often not quite right.
- When waiting on an idle socket for a message, a timeout is not an error, but a closed connection IS an error. Handling a closed connection as a non-error delays recognition of a closed connection (it has to be discovered when a message/keepalive is sent, or by failing an idleness check).
- When waiting for the rest of a message body, both timeout (generally the 10sec network timeout) and closed connection are errors. In any case returning ESP_OK when the operation did not actually succeed is not right.
Change:
Revise the error handling to make esp_mqtt_handle_transport_read_error() unconditionally describe the error and call esp_mqtt_client_dispatch_transport_error(). The one place where timeouts should be accepted (the first-byte read) wraps that call accordingly. Comments are added and error messages improved.
Hi @egnor
We have already branched-off of the v4.x support, just to keep the code simple(-er), with less #ifdef's.
There's a branch idf_v4.x for IDF<=v4.4 that would still accept all fixes and important updates.
@david-cermak Thank you for the quick reply. On further investigation I now think this is an improvement (bug fix) even for ESP-IDF v5. I've updated the PR description accordingly. Can you take another look and let me know if this is something that makes sense to you?
Sorry for the delay and thanks for updating the PR. I agree that in certain places we can (and should!) treat the timeouts as errors.
Also please update the docs of the internal function, e.g.
/*
* Returns:
- * -1 in case of failure
+ * -1 in case of failure or timeout while one message reading is in progress
* 0 if no message has been received
* 1 if a message has been received and placed to client->mqtt_state:
* message length: client->mqtt_state.message_length
* message content: client->mqtt_state.in_buffer
*/
static int mqtt_message_receive(esp_mqtt_client_handle_t client, int read_poll_timeout_ms)
Sorry about the long hiatus! I hope you can remember what this is all about. I've updated the PR to be against current head, and did my best to address your requests; see what you think?
@egnor @ESP-YJM Thanks for taking the time and effort in posting the changes and reviewing. There were indeed few issues with the mqtt client, but the changes are very much different to those you proposed, so I fixed it in a separate commit. You can check ddde502 for more details.
There were two issues:
-
Treat EOF as an error, since the connection is closed (FIN) from the server side. If we didn't we would try to read (in the next iteration) from the same socket that has been already closed and get an error.
-
Treat timeouts in the middle of MQTT message reading as errors (if timeouted for the second time and didn't read a byte)
The point 2) is a bit complicated, since the we know about the problem only in the second iteration.