issues icon indicating copy to clipboard operation
issues copied to clipboard

MQTT component causes loop watchdog resets, measurement errors

Open dwmw2 opened this issue 1 year ago • 3 comments

The problem

Originally commented on #4717 but I think this deserves its own separate bug as it's causing real problems, not just a warning.

With poor WiFi connectivity, I see the MQTT component taking so long in the loop that I've had to disable the watchdog timer on the loop thread, as it was causing resets. Even with the watchdog disabled, the mqtt component still sometimes spends over 10 seconds in its call_loop(), and it's causing other sensors to break (in my case a rotary encoder is missing full revolutions as discussed in https://github.com/joshr120/PD-Stepper/issues/23#issuecomment-2678120522).

With some added debugging, I've seen the call to this->send_initial_state() in MQTTComponent::call_loop() taking over 20 seconds.

--- a/esphome/components/mqtt/mqtt_component.cpp
+++ b/esphome/components/mqtt/mqtt_component.cpp
@@ -261,7 +261,10 @@ void MQTTComponent::call_loop() {
   if (this->is_internal())
     return;
 
+  uint32_t times[4];
+  times[0] = millis();
   this->loop();
+  times[1] = millis();
 
   if (!this->resend_state_ || !this->is_connected_()) {
     return;
@@ -273,9 +276,16 @@ void MQTTComponent::call_loop() {
       this->schedule_resend_state();
     }
   }
+  times[2] = millis();
   if (!this->send_initial_state()) {
     this->schedule_resend_state();
   }
+  times[3] = millis();
+  if (times[3] > times[0] + 30) {
+	 ESP_LOGE(TAG, "call_loop() took too long: %d ms (%d + %d + %d)",
+		  times[3] - times[0], times[1] - times[0],
+		  times[2] - times[1], times[3] - times[2]);
+  }
 }
 void MQTTComponent::call_dump_config() {
   if (this->is_internal())

Feb 24 13:13:01 landing-blind-3 mqtt.component [E][mqtt.component:284]: call_loop() took too long: 389 ms (0 + 388 + 1)
Feb 24 13:13:01 landing-blind-3 component [W][component:237]: Component mqtt took a long time for an operation (392 ms).
Feb 24 13:13:01 landing-blind-3 component [W][component:238]: Components should block for at most 30 ms.
Feb 24 13:13:20 landing-blind-3 esp-idf [I][esp-idf:000]: #033[0;33mW (515582) transport_base: Poll timeout or error, errno=No more processes, fd=55, timeout_ms=10000#033[0m
Feb 24 13:13:20 landing-blind-3 esp-idf [I][esp-idf:000]: #033[0;31mE (525587) mqtt_client: Writing didn't complete in specified timeout: errno=11#033[0m
Feb 24 13:13:20 landing-blind-3 esp-idf [I][esp-idf:000]: #033[0;33mW (525589) mqtt_client: Publish: Losing qos0 data when client not connected#033[0m
Feb 24 13:13:20 landing-blind-3 esp-idf [I][esp-idf:000]: #033[0;33mW (525591) mqtt_client: Publish: Losing qos0 data when client not connected#033[0m
Feb 24 13:13:20 landing-blind-3 component [W][component:157]: Component mqtt set Warning flag: unspecified
Feb 24 13:13:20 landing-blind-3 mqtt.component [E][mqtt.component:284]: call_loop() took too long: 20019 ms (0 + 3 + 20016)
Feb 24 13:13:20 landing-blind-3 component [W][component:237]: Component mqtt took a long time for an operation (20022 ms).
Feb 24 13:13:20 landing-blind-3 component [W][component:238]: Components should block for at most 30 ms.

That log seems to show this->send_discovery_() taking 388ms, followed by this->send_initial_state() taking 20016ms. It seems the latter is actually waiting for a network write to complete, which we should never do synchronously from the loop.

Which version of ESPHome has the issue?

2025.3.0-dev

What type of installation are you using?

pip

Which version of Home Assistant has the issue?

No response

What platform are you using?

ESP32-IDF

Board

https://github.com/joshr120/PD-Stepper

Component causing the issue

mqtt

YAML Config


Anything in the logs that might be useful for us?


Additional information

No response

dwmw2 avatar Feb 24 '25 13:02 dwmw2

Using idf_send_async causes other problems of delayed and dropped messages due to https://github.com/espressif/esp-idf/issues/15458

dwmw2 avatar Feb 24 '25 16:02 dwmw2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 25 '25 02:06 github-actions[bot]

The PR has been waiting for review for a long time, finally got some feedback which is being addressed...

dwmw2 avatar Jun 25 '25 06:06 dwmw2

I also have this problem .. on the latest ESPHome ESPHome version 2025.7.3 Has the fix made it into the latest ESPhome yet?

[10:58:25][W][component:407]: mqtt took a long time for an operation (105 ms) [10:58:25][W][component:408]: Components should block for at most 30 ms [10:58:25][W][component:407]: mqtt took a long time for an operation (290 ms) [10:58:25][W][component:408]: Components should block for at most 30 ms [10:58:28][W][component:407]: mqtt took a long time for an operation (2418 ms) [10:58:28][W][component:408]: Components should block for at most 30 ms [10:58:28][W][component:407]: mqtt took a long time for an operation (226 ms) [10:58:28][W][component:408]: Components should block for at most 30 ms [10:58:33]E (15622) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time: [10:58:33]E (15622) task_wdt: - loopTask (CPU 0) [10:58:33]E (15622) task_wdt: Tasks currently running: [10:58:33]E (15622) task_wdt: CPU 0: IDLE [10:58:33]E (15622) task_wdt: Aborting.

mikejlaurie avatar Jul 24 '25 09:07 mikejlaurie