pubsubclient icon indicating copy to clipboard operation
pubsubclient copied to clipboard

Soft WDT reset on MQTT connect

Open lafrank opened this issue 2 months ago • 4 comments

My platform is NodeMCU 1.0 (ESP-12E)

I read similar posts here and there concluding on FastLED or other library issues, but I am not using these libraries.

Almost 60% of the time the connection attempt is causing a Soft WDT reset as the connection attempt is blocking too long and apparently I have no control over it., setting the socket timeout to 1 sec also doesn't help.

There is obviously a slight chance that something else is causing the issue, but I doubt it since I always have the watchdog issue appearing when MQTT tries connecting. Please let me know how to find out if something else is blocking ?

I also tried adding below in vain:

mqtt_client.setSocketTimeout(1);

The neuralgic code part:


#define MQTT_SOCKET_TIMEOUT 1
// ...
bool mqtt_connected = false;
while (!mqtt_client.connected() && retries < 10)
{
	time_t now = time(NULL);
	SerialPrint("-> MQTT connecting ... ");
	// PubSubClient::connect(const char *id, const char *user, const char *pass, const char* willTopic, uint8_t willQos, boolean willRetain, const char* willMessage, boolean cleanSession
	if (!CleanSession) mqtt_connected = mqtt_client.connect(mqtt_client_id, mqtt_username, sas_token, 0, 0, 0, 0, 0);
	else mqtt_connected = mqtt_client.connect(mqtt_client_id, mqtt_username, sas_token);
	if (mqtt_connected)
	{
		char connMsg[60];
		snprintf(connMsg, sizeof(connMsg), "connected in %d ms", millis() - connectStart);
		SerialPrintln(connMsg);
	}
	else
	{
		int c = mqtt_client.state();
		char errMsg[64];
		if (c == MQTT_CONNECTION_TIMEOUT) snprintf(errMsg, sizeof(errMsg), " connection timeout, status code = %d", c);
		if (c == MQTT_CONNECTION_LOST) snprintf(errMsg, sizeof(errMsg), " connection lost, status code = %d", c);
		if (c == MQTT_CONNECT_FAILED) snprintf(errMsg, sizeof(errMsg), " connection failed, status code = %d", c);
		if (c == MQTT_DISCONNECTED) snprintf(errMsg, sizeof(errMsg), " disconnected, status code = %d", c);
		if (c == MQTT_CONNECT_BAD_PROTOCOL) snprintf(errMsg, sizeof(errMsg), " bad protocol, status code = %d", c);
		if (c == MQTT_CONNECT_BAD_CLIENT_ID) snprintf(errMsg, sizeof(errMsg), " bad client id, status code = %d", c);
		if (c == MQTT_CONNECT_UNAVAILABLE) snprintf(errMsg, sizeof(errMsg), " unavailable, status code = %d", c);
		if (c == MQTT_CONNECT_BAD_CREDENTIALS) snprintf(errMsg, sizeof(errMsg), " bad credentials, status code = %d", c);
		if (c == MQTT_CONNECT_UNAUTHORIZED) snprintf(errMsg, sizeof(errMsg), " unauthorized, status code = %d", c);
		SerialPrint(errMsg);
		char sslErr[100];
		if (wifi_client.getLastSSLError(sslErr, 100) < 0)
		{
			char sslMsg[128];
			snprintf(sslMsg, sizeof(sslMsg), " , SSL error : %s", sslErr);
			SerialPrintln(sslMsg);
		}
		else SerialPrintln();
		SerialPrintln();
		SerialPrintln("-> Retrying MQTT connection in 3 seconds...");
		delay(3000);
	}
	retries++;
}`


Debug print:

-> MQTT connecting ... 
--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Soft WDT reset

lafrank avatar Oct 16 '25 13:10 lafrank

You might try pubsubclient3. This lib here is missing a yield() call in connect().

hmueller01 avatar Oct 16 '25 17:10 hmueller01

Thanks @hmueller01 for ponting out. I tried this library but the issue is the same. I don't understand why the connect is never returning with a socket timeout or other error but instead a wdt reset occurres. Might be that some underlying component is blocking. The mqtt client is instantiated upon a BearSSL secure wifi client (TLS is requied by Azure IOT Hub client), could be the block happens there. But what makes it super weird is that 3 times out of 10 attempts the same code works and connection is made. I am investigating this for weeks now and getting kind of clueless...

lafrank avatar Oct 18 '25 15:10 lafrank

I see, this is sad. Does you net client connect? I need to the the fingerprint, so I can't use the PubSubClient internal net connect, and connect to the broker before ...

WiFiClientSecure m_net_client;
PubSubClient m_mqtt_client(m_net_client);

  INFOF_P("%s: Attempting connection to server %s:%s" LF, __func__, mqtt_host, mqtt_port);
  m_net_client.setFingerprint(mqtt_fingerprint);
  if (m_net_client.connect(mqtt_host, atoi(mqtt_port)) == 0) {
    INFOF_P("%s: Connection failed!" LF, __func__);
  } else {
    INFOF_P("%s: Sucsessfully connected." LF, __func__);
  }

and later ...

  if (m_mqtt_client.connect(host_name, mqtt_user, mqtt_pass, topic, 1, true, "offline", true)) {
    INFOF_P("%s: connected using PubSubClient" LF, __func__);

  } else {
    // m_mqtt_client.connect() failed
    int mqtt_state = m_mqtt_client.state();
    ERRORF_P("%s: mqttReconnect failed, rc=%d" LF, __func__, mqtt_state);
  }

hmueller01 avatar Oct 19 '25 12:10 hmueller01

I managed to drill down to wifi_client.connect() is causing the WDT reset.

Despite the fact that I am using Azure IO Hub which supports TLS Fragment Length negotiation and hence I am setting client buffer size to 512 Bytes to reduce memory consumption, it looks like it is unpredictable if BearSSL library manage to make the call or not.

See https://forum.arduino.cc/t/soft-wdt-reset-on-second-call-to-wificlientsecure-connect-help/694398

Looks like my code has grown to a point where the free heap size (~20k) makes BearSSL to crash in an unpredictable way. I either need to simplify the code to achieve a larger heap size, or switch to ESP32, but the latter is a real pain in my case since I am using my purpose built PCB with the ESP-12E chip :-(

So I would say this is not an issue with PubSubClient library, and the Soft WDT reset is a bit deceiving here which can't be fixed with carefully placed yield() calls. It appears to be a memory corruption happening within BearSSL.

Will update this thread if any tangible results.

lafrank avatar Oct 24 '25 12:10 lafrank