OpenBK7231T_App Possible MQTT issue on initial flash/default config - could point to a cause of long term stability issues?

trafficstars

I just updated from an ancient dev version to Build on Oct 6 2022 13:15:37 version 1.12.88 Chipset BK7231T

The device died quickly (drops from wifi after a minute or two) until I setup the MQTT server, and now it's been stable for >20 mins.

So the implication is that the MQTT server connect retries kill it somehow.

The default server is on 192.168.0.x, whereas my lan is 192.168.1.x, so the serve rIP should have been unreachable.

This will not be an issue for stable lan and mqtt server, but could be the source of odd behaviour over time in less stable environments?

Oct 07 '22 07:10 btsimonh

@valeklubomir can you see it?

for the record, I've been using T chips without MQTT server online for very long time with no issues. It must have started after @valeklubomir changes...

Oct 07 '22 07:10 openshwprojects

@openshwprojects I have not seen this behavior. But this issues is reported on T version. Which could also benefit from LWIP update. Once we confirm the update is better. I run yesterday my device from initial flashing. Erased entire flash, and configured device from scratch. It was not killed at any time in the process.

Oct 07 '22 09:10 valeklubomir

I feel that we should remove the default 192.168.0.113 from mqtt host setting. The user would have to enable mqtt by putting in the correct host.

Oct 07 '22 17:10 iprak

as a sticking plaster for first flash, yes. But if it's highlighting a general issue on MQTT reconnect which could be a cause for some reported failures.... it's a clue we should investigate.

Oct 08 '22 08:10 btsimonh

So I got my T device to hang, I had been experimenting with invalid mqtt host and got log as following and then it hung. I only had access to web log. Funny thing is that the device was still responding to button press, only the web end points were stuck. I restarted it and then it did not happen even thought I let it run for hours like that.

Info:MAIN:Time 353, free 104392, MQTT 0(31), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 354, free 104392, MQTT 0(31), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 355, free 104392, MQTT 0(31), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 356, free 104392, MQTT 0(31), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MQTT:mqtt_userName demo
mqtt_pass demo
mqtt_clientID enbrighten1
mqtt_host 192.168.0.61:1883
Info:MQTT:Connect error in mqtt_client_connect - code: -10 (4?A)
Info:MAIN:Time 357, free 104392, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 358, free 104392, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 359, free 104392, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 360, free 104088, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:GEN:dhcp=0 ip=192.168.1.211 gate=192.168.1.2 mask=255.255.255.0 mac=c8:47:8c:00:00:00 
Info:MAIN:Time 361, free 104392, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 362, free 104392, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 2/38
Info:MAIN:Time 363, free 103784, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 3/38
Info:NTP:Seconds since Jan 1 1900 = 3874236945
Info:NTP:Unix time = 1665248145
Info:MAIN:Time 364, free 104392, MQTT 0(32), bWifi 1, secondsWithNoPing -1, socks 2/38

Oct 09 '22 00:10 iprak

@iprak Info:MQTT:Connect error in mqtt_client_connect - code: -10 (4?A) ERR_ISCONN = -10
Means that connection to MQTT server is active. I am not sure how is it possible. LWIP stack is wrongly pressuming active open socket. I am not sure if LWIP mqtt.c apps is holding this connection also. May be server did send out some data to device and network forwared this to device identified by MAC/IP. LWIP could somehow refreshed the conenctions. ????? But this situation I tried to solve with action on -10 with attempt to request LWIP to close connection.

Strange is also "-10 (4?A)" because it should show: "-10 (ERR_ISCONN)". Wrong build ????? Write error during flashing???? I am not sure if OTA image is protected with CRC, MD5 or SHA type of checksum.

Oct 09 '22 09:10 valeklubomir

a) The logging is incorrect which I am fixing in a small merge.

addLogAdv(LOG_INFO,LOG_FEATURE_MQTT,"Connect error in mqtt_client_connect - code: %d (%s)\n", res);

b) And yes mqtt_disconnect is being done in that case.

c) I noted this comment

    /* Initiate client and connect to server, if this fails immediately an error code is returned
      otherwise mqtt_connection_cb will be called with connection result after attempting
      to establish a connection with the server.
      For now MQTT version 3.1.1 is always used */

which explains the behavior in W600 where similar setup gives this log

Info:MAIN:Time 32463, free 30296, MQTT 0(2950), bWifi 1, secondsWithNoPing -1, socks 2/8
Info:MAIN:Time 32464, free 30296, MQTT 0(2950), bWifi 1, secondsWithNoPing -1, socks 2/8
Info:MQTT:mqtt_connection_cb: Disconnected, reason: 256
Info:MAIN:Time 32465, free 30296, MQTT 0(2950), bWifi 1, secondsWithNoPing -1, socks 2/8
Info:MQTT:mqtt_userName hass

W600 is using lwip 2.0.3 and T is based on 2.0.2. I could not tell any enhancement in this regards from https://git.savannah.nongnu.org/cgit/lwip.git/log/?h=STABLE-2_0_3_RELEASE

d) Another interesting thing in my log before the hang was that the socket count went up and then down. I could not make that happen.

e) Finally, my T device ran straight for 10 hrs trying to connect and no hang. I even adjusted the every second to every 500ms at some point but nothing changed.

Oct 09 '22 11:10 iprak

@iprak a) Sorry I missed this. This may be cause of crashing/freezing/problems when failed to connect. call stack should contain pointer to string. Without setting correct pointer, function uses random data left in stack. Which may result in undefined behavior.

Oct 09 '22 14:10 valeklubomir

@iprak wow that's a great find, thank you! Can @btsimonh check if it fixes that?

Oct 09 '22 14:10 openshwprojects

I feel that there might be more to all this... maybe in W600.

I am now testing setup with no mqtt host defined in which case T (on 1.12.96) has been running fine.

But W600 gets hung after 2nd attempt. I am getting direct serial logging for it and have added more prints statements to addLogAdv to see what might be happening.

Oct 09 '22 15:10 iprak

logMemory.log just wraps around so this would get handled, right? Here head starts with 4046 and a string of size 83 is being added.

	addLogAdv %sTime %i, free %d, MQTT %i(%i), bWifi %i, secondsWithNoPing %i, socks %i/%i

	addLogAdv after malloc
	addLogAdv len=83 Info:MAIN:Time 60, free 30296, MQTT 0(5), bWifi 1, secondsWithNoPing 0, socks 2/8<CR>

Info:MAIN:Time 60, free 30296, MQTT 0(5), bWifi 1, secondsWithNoPing 0, socks 2/8<CR>
	Updating logMemory head=4046
	logMemory updated
	ALL DONE before free Info:MAIN:Time 60, free 30296, MQTT 0(5), bWifi 1, secondsWithNoPing 0, socks 2/8<CR>

At some point, I get this in the log. Here the regular logging is taking place but tmp has become #<ENQ>. It seems as if os_malloc returned some random location.

** mqtt_host empty after free<CR>
	addLogAdv %sTime %i, free %d, MQTT %i(%i), bWifi %i, secondsWithNoPing %i, socks %i/%i

	addLogAdv after malloc
	addLogAdv len=4 #<ENQ>
#<ENQ>	Updating logMemory head=1428
	logMemory updated

Here is my modified addLogAv

    tmp = (char*)os_malloc(tmp_len);
    if (tmp != NULL)
    {
				printf("\taddLogAdv after malloc\n");
        memset(tmp, 0, tmp_len);
        t = tmp;
        BaseType_t taken = xSemaphoreTake( logMemory.mutex, 100 );
    
        if(feature == LOG_FEATURE_RAW) 
        {
            // raw means no prefixes
        } else {
            strncpy(t, loglevelnames[level], (tmp_len-(3+t-tmp)));
            t += strlen(t);
            if (feature < sizeof(logfeaturenames)/sizeof(*logfeaturenames))
            {
                strncpy(t, logfeaturenames[feature], (tmp_len-(3+t-tmp)));
                t += strlen(t);
            }
        }

				va_start(argList, fmt);
        vsnprintf(t, (tmp_len-(3+t-tmp)), fmt, argList);
        va_end(argList);
        if (tmp[strlen(tmp)-1]=='\n') tmp[strlen(tmp)-1]='\0';
        if (tmp[strlen(tmp)-1]=='\r') tmp[strlen(tmp)-1]='\0';

        int len = strlen(tmp); // save 3 bytes at end for /r/n/0
        tmp[len++] = '\r';
        tmp[len++] = '\n';
        tmp[len] = '\0';

	printf("\taddLogAdv len=%i %s\n", len, tmp);

Oct 09 '22 15:10 iprak

@iprak @openshwprojects Why do you always remove all dynamic allocations and replace with static memory, rather searching for real cause where memory overflow occurs, because replacing with static memory reduces temporarily, but problem not is solved. Dynamic allocation is more sensitive and in this cases may help solving by catching the issue easier. I know that someone somewhere wrote that dynamic allocation is risky, but how long since it was published? And may be was it not already solved, but not removed the notice. There is still couple hundred cases whem malloc is used on OS, drivers, LWIP (both versions), apps,....

Oct 10 '22 10:10 valeklubomir

static memory -> no heap fragmentation. heap fragmentation -> certain death over a period of time. Ideally, in an embedded system, you would have NO regular malloc in normal use. (i.e. outside initialisation, or OTA or other terminal code.). Yes, there are many mallocs in many places in the 'OS' code - and where possible, we should mitigate their use - e.g. with custom memory allocation routines rather than the general ones.

Oct 10 '22 10:10 btsimonh

@btsimonh That's all old, out of date point of view. I have build applications, which have been running for month, without restart. Heap fragmentation was never issue. Every time when app crashed and dynamic memory was involved it was due to some code reached read/write outside of allocated RAM space. Which can also occur within static memory. It is just observation. But more important is that. Using static memory in multi threading system is more riskier, is my opinion. You need to use mutex and it is again dynamic allocation.

Oct 10 '22 10:10 valeklubomir

@valeklubomir one of the reasons is that some of the platforms (W800 and maybe W600, I guess?) are using old heap_2 algorithm from RTOS instead of new heap_4.c algorithm and Heap_2 is not able to merge adjacent blocks https://cdmana.com/2022/03/202203310754244532.html someone would need to iterate all our platforms and make sure correct heap is used - the heap4

The second reason is that I've been debugging a strange crash few months ago on N platform (but maybe your LWIP fix fixed that) and I noticed that it happened after MQTT crash and free memory went suddenly down to the point that malloc crashes. In this rare scenario, it's better to have AT LEAST LOGGING using a non-malloc version so we AT LEAST get log output to know what happened.

To say it differently - I'd say that non-malloc logging is more predictable and safe

"n multi threading system is more riskier, " - logging is having a mutex already

Oct 11 '22 05:10 openshwprojects

OpenBK7231T_App OpenBK7231T_App copied to clipboard

Possible MQTT issue on initial flash/default config - could point to a cause of long term stability issues?

OpenBK7231T_App
OpenBK7231T_App copied to clipboard