easyMesh
easyMesh copied to clipboard
after about an hour the mesh breaks apart ...
after about an hour the mesh breaks apart and devices start only sending messages to themselves ... I think it is related to the timestamps that I believe roll over after 71 minutes which is right about the time the mesh breaks apart ...
startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707 startHere: Received from 796707 msg=Hello from node 796707
startHere: Received from 12593318 msg=Hello from node 12593318 startHere: Received from 12593318 msg=Hello from node 12593318 startHere: Received from 12593318 msg=Hello from node 12593318 startHere: Received from 12593318 msg=Hello from node 12593318 startHere: Received from 12593318 msg=Hello from node 12593318 startHere: Received from 12593318 msg=Hello from node 12593318 startHere: Received from 12593318 msg=Hello from node 12593318
I finally have been able to try easymesh today. Interesting. I have also found that it falls apart in time. I have not done any time checks. I think I have had problems before that point. Individual units dropping out of the mesh and taking a while connecting again. But I'm not sure about it or why.
I have it running on five ESP12 modules. I have a 4x20 LCD connected to each and am sending information there. I'm showing the number of connections and the current channel number. Also the unit's ID number (so I know who it is) and the message and who it was received from.
Too late to do more tonight.
OK, this is interesting. I have had to do resets to the units when the communications fell apart. And as sfranzyshen has outlined, the units would be receiving messages from themself. Along with a lot more frequent transmissions.
This afternoon I had to leave suddenly and I didn't have time to shut down my computer or turn off the ESPs. Four hours later and I'm back. And everything seems to be working. I don't know if everything is perfect but it isn't obvious that anything is wrong. It's hard to tell as I had been in the middle of adding more diagnostic routines. One being a switch/case function that takes the unit IDs and provides a string associated with the ID. A lot nicer than looking at the MAC address.
So I still don't quite know what is going wrong but I'm still looking at it.
`Received from msg=103 msg=Hi from 103 F9E12D9C Received from msg=103 msg=Hi from 103 FA395F36 Received from msg=103 msg=Hi from 103 FAC0AE00 Received from msg=103 msg=Hi from 103 FB668983 Received from msg=103 msg=Hi from 103 FBD422F9 Received from msg=103 msg=Hi from 103 FC2EBF8F Received from msg=103 msg=Hi from 103 FCD7BFA1 Received from msg=103 msg=Hi from 103 FD25C6DA Received from msg=103 msg=Hi from 103 FD9CCC2E Received from msg=103 msg=Hi from 103 FDF5F84A Received from msg=103 msg=Hi from 103 FE53EA10 Received from msg=103 msg=Hi from 103 FEEA7BD1 Received from msg=103 msg=Hi from 103 FF4BC512 Received from msg=103 msg=Hi from 103 FFA12C82 Received from msg=103 msg=Hi from 103 FFA2970E
Exception (29): epc1=0x40206794 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000
ctx: cont sp: 3fff0900 end: 3fff0bc0 offset: 01a0
stack>>> 3fff0aa0: 3fff0a80 3fff0af0 3fff27b0 40206794
3fff0ab0: 3fff145c 0000003f 0000003e 00d8a536
3fff0ac0: 00d8a536 00000008 3ffefb8c 00000008
3fff0ad0: 3fff0b80 3fff27b0 3ffef8a8 40206a54
3fff0ae0: 3fff3304 00000000 3ffef8a8 3ffefb40
3fff0af0: 3fff1414 0000003f 0000003e 4010053d
3fff0b00: 00d8a536 3ffef8a8 3fff0b30 00f1fef6
3fff0b10: 00000008 3fff3304 ffea0b40 00000000
3fff0b20: 3fff0b80 00000008 3ffef8a8 40206b46
3fff0b30: 3ffe84f4 00000008 3ffef9bc 40204a7d
3fff0b40: 3fff27b0 3ffef8a4 3fff0b80 3fff3304
3fff0b50: 00ff0000 ff000000 3fff0b80 00080ff6
3fff0b60: 0089bd6e 3fff0b80 3ffef8a8 40206502
3fff0b70: 00000000 3ffef8a4 3ffef8a8 402026d9
3fff0b80: 3fff3304 0000000f 0000000b 00000000
3fff0b90: 00000000 00000000 00000001 3ffefb8c
3fff0ba0: 3fffdad0 00000000 3ffefb85 40204e98
3fff0bb0: feefeffe feefeffe 3ffefba0 40100718
<<<stack<<<
ets Jan 8 2013,rst cause:2, boot mode:(1,7)
ets Jan 8 2013,rst cause:4, boot mode:(1,7)
wdt reset
r[18][02][00]lœ˜rß[00]Œ#[02]ân€[04][08]à[0C][18]Œ[0C][1C]‚ì[1C]pŒ|Ž‚ß[00]ì8[02]’ßÇ’ÜäŒ[1C]p[0C][18][0C]ònnä[02]Ä;ònÄ’Üä[0C][1B]Ž[0C]bç$ r[18][02]r[18][02]pònàÃÜ[00][0C][1C]à‚ÇÀl[1C]€ŒœÀ[0C][1C]€[0C]bÀ[04]nâãnÀ$ŽŽ[0C][1F][0C]b€Ä>~ònî[03]ÄÁŒŽ[00]l[1B]ü[12]Ü#‚nÀ[04]r[18][02][0E][02]nrŽ’Ÿ;[02]ÄÀ[0C]?Œ[0C]œ;[18][02][0E]rßÛ’nÀ[04]ŽàbÇ[18]’
[1B]ÄžÜ[0C][1C]€[0C][1C]€ì[1C]‚[0C][1F]r[18][02]ü‚nÄbà)þsetDebugTypes 0x3
0x2 init():
0x2 apInit(): Starting AP with SSID=ESPmesh15859446 IP=192.168.246.1 GW=192.168.246.1 NM=255.255.255.0
0x2 DHCP server started
0x2 AP tcp server established on port 5555
0x2 stationInit():
startHere: New Connection, adopt=1
Received from msg=102 msg=Hi from 107 A3E18E
Received from msg=102 msg=Hi from 103 CD80BB
Received from msg=102 msg=Hi from 102 EF2C09
`
this must be system get time(getnodetime()) overflowing.
That example actually doesn't take the overflow into account, so maybe it is just the example that stops working, not the mesh itself?
No dude, uint32t getNodeTime failed in nodesync messaging. time overflowing after 71 minutes.. I deleted all timesync functions on this code. problem in nodesync communication. i'm doing test.
Seven modules. Started going crazy fast before it crashed. Here is what I got.
it appears as if the fix listed in https://github.com/Coopdis/easyMesh/issues/13 fixes the problem with the nodes sending themselves messages ... and the fix listed in https://github.com/Coopdis/easyMesh/issues/8 fixes the corrupt data problem ... but timeouts and wdt resets still plague the code ... I believe the getNodeTime() is responsible for the wdt resets with this line
uint32_t ret = system_get_time() + timeAdjuster;
If system_get_time() + timeAdjuster ... exceeds uint32_t ... boom
I took an example mentioned by muratdemirtas and removed all of the timesync stuff ... and just left the nodesync in ... I then started two nodes. NodeA and NodeB. NodeA starts first and scans the network and finds no other nodes ... NodeB starts second and scans the network and finds NodeA and makes a STA connection ... Now here is where things are different. NodeA since it didn't find a AP during the scan a timer is started that calls startStationScan() to rescan the wifi again in SCAN_INTERVAL (10000) [and never gets turned off again so it is called over and and over again even if a connection is made] ... BUT NodeB never has a timer set ... and scanning stops the wifi connection is established. and several nodesync messages are exchanged successfully ... However, NodeA does not have a STA connection and keeps scanning wifi for a AP ... it is during one of these scans that the AP reaches it's NODE_TIMEOUT (3000000 //uSecs) for NodeB's TCP connection and drops it ... Where NodeB see's the lost TCP connection and it calls meshDisconCb() ... then wifi_station_disconnect() ... that kicks off connectToBestAP() ... and a wifi STA scans and connections again ... then everything repeats.
STA:
0x20\0x09meshRecvCb(): data={"dest":10469232,"from":10417291,"type":5,"subs":[]} fromId=10417291
0x10\0x09handleNodeSync(): with 10417291
0x10\0x09handleNodeSync(): valid NODE_SYNC_REQUEST 10417291 sending NODE_SYNC_REPLY
0x20\0x09sendMessage(conn): conn-chipId=10417291 destId=10417291 type=6 msg=[]
0x20\0x09Sending to 10417291-->{"dest":10417291,"from":10469232,"type":6,"subs":[]}<--
0x20\0x09meshRecvCb(): lastRecieved=215338692 fromId=10417291
0x8\0x09meshDisconCb(): 0x8\0x09Station Connection! Find new node. local_port=17370
0x8\0x09wifiEventCb(): EVENT_STAMODE_DISCONNECTED
0x8\0x09connectToBestAP():0x8\0x09connectToBestAP(): no nodes left in list, rescanning
0x4\0x09stationStatus Changed to STATION_IDLE
0x8\0x09manageConnections(): dropping 10417291 ESPCONN_CLOSE
0x8\0x09closeConnection(): conn-chipId=10417291
0x8\0x09-->scan started @ 227302978<--
0x8\0x09stationScanCb():-- > scan finished @ 229429473 < --
AP:
0x10\0x09manageConnections(): start nodeSync with 10469232
0x10\0x09startNodeSync(): with 10469232
0x20\0x09sendMessage(conn): conn-chipId=10469232 destId=10469232 type=5 msg=[]
0x20\0x09Sending to 10469232-->{"dest":10469232,"from":10417291,"type":5,"subs":[]}<--
0x20\0x09meshRecvCb(): data={"dest":10417291,"from":10469232,"type":6,"subs":[]} fromId=10469232
0x10\0x09handleNodeSync(): with 10469232
0x10\0x09handleNodeSync(): valid NODE_SYNC_REPLY from 10469232
0x20\0x09meshRecvCb(): lastRecieved=198334261 fromId=10469232
0x8\0x09-->scan started @ 199169317<--
0x10\0x09manageConnections(): start nodeSync with 10469232
0x10\0x09startNodeSync(): with 10469232
0x20\0x09sendMessage(conn): conn-chipId=10469232 destId=10469232 type=5 msg=[]
0x20\0x09Sending to 10469232-->{"dest":10469232,"from":10417291,"type":5,"subs":[]}<--
0x8\0x09stationScanCb():-- > scan finished @ 201307844 < --
0x8\0x09\0x09found : Mesh10469232, -19dBm0x8\0x09 MESH_PRE< ---0x8\0x09
0x8\0x09\0x09found : Sasquatch Sighting, -75dBm0x8\0x09
0x8\0x09\0x09Found 1 nodes with _meshPrefix = "Mesh"
0x8\0x09connectToBestAP():0x8\0x09connectToBestAP(): no nodes left in list, rescanning
0x8\0x09manageConnections(): dropping 10469232 NODE_TIMEOUT last=198334261 node=202334800
0x8\0x09closeConnection(): conn-chipId=10469232
0x8\0x09meshDisconCb(): 0x8\0x09AP connection. No new action needed. local_port=5555
0x8\0x09wifiEventCb(): EVENT_SOFTAPMODE_STADISCONNECTED
After changing NodeA to NOT perform rescans (disabled the timer) ... the NODE_TIMEOUT is never reached and everything stays connected and exchanging nodesync messages .. indefinitely ... my conclusion is we need to handle scans better ... or not timeout if in a scan ??? .
here's the diff between the original and what I did here for testing ... https://github.com/sfranzyshen/easyMesh/compare/master...sfranzyshen:no-timing?expand=1 and here is the branch ... https://github.com/sfranzyshen/easyMesh/tree/no-timing
UPDATE: With the NODE_TIMEOUT set to 4 sec ... I still had 9 timeouts happen in over 9 hrs ... with nothing going on except nodesync messages being handed back and forth ... so I have set it up to 10 sec ... so far ( couple hours) no timeouts ... so my guess is it's more about how efficient or speedy the code is ... than messages being lost or dropped ... we just need a better handshake mechanism ... or/and at very least not fail so hard at the first sign of timeout ...
I found a problem in the manageConnections() function in the easyMeshConnection.cpp file ... after stripping all of the timesync stuff from the code and just running the connection and nodesync code I still get STA disconnects (dropping by the AP) for NO apparent reason ... even when the timeout is set as high as 10 sec ... so I made changes to take account for the clock rollover ... still testing see http://www.esp8266.com/viewtopic.php?f=6&t=11849&start=36#p57102
easyMesh-no-timing
I have it all up and running. I had to change all my getNodeID to getChipId. I had grown attached to getChipId but I think I will get over it. (maybe)
The network seems to be a little harder to start. It seems that if a node is having a hard time connecting it just doesn't. (take with a grain of salt) I did have five connected and the other two connected but they would not join the five. Or I wasn't patient enough, I ended up resetting those modules.
I have had all seven connected and one had dropped out. But soon reconnected. :)
One thing was odd. I display the system node count and the stations connected on the LCD. The connection count was higher than the node count. Maybe the connection count was old or the node count wasn't updated yet. I don't feel those are what it was but I'm not worried about it.
I manually injected broadcast packets via the serial-USB connection to the five that are still connected to my computer and all modules received the packet. All good. Two I have remote and don't have serial connections to so I rely on the LCD.
I'll try and keep my hands off and let it run all night.
Not so good. It looked like I had 2 + 2 + 1 + 1 + 1 node counts for the 7 modules. It seems that the network breaks up and is happy to not bother trying to join with others.
I did look through the serial history and it still looks like rollover causes wdt resets. Not every time. I ended up resetting three and then building connections happened again, with modules I had not reset.
I didn't have time to look through the details of this version so I don't know what the limitations are.
try my codes for 71 minutes. https://github.com/muratdemirtas/ESP8266_MQTT_MESH
dont touch mqtt prefix, you must be use in mesh mode.
i updated. so you can install
try my codes for 71 minutes. https://github.com/muratdemirtas/ESP8266_MQTT_MESH
Can I ask which routing protocol you are using? https://en.wikipedia.org/wiki/List_of_ad_hoc_routing_protocols
Same with this code. I'm fixing errors when I'm free
I will look at this when I get home from work in 9+ hours.
the no-timing build is built from the original code ... plus the fixes discussed in this thread only ... so NO major changes have been done ... I ran the code for 9 hours on two node ... NodeA starts first and doesn't try to connect to anything ... NodeB starts and connects as a STA to nodeA's AP ... they handle nodesync messages back and forth and nothing else ... The STA only dropped once (in 9 hours) ... but it was correctly in a TIMEOUT state ... so the rollover problem seams to be gone ... at least in the nodesync
manageConnections(): dropping 10417291 now= 2054447821 - last= 2051447248 ( 3000573 ) > timeout= 3000000
but when I run this code with the AP constantly scanning the wifi ... I get more timeouts and at least one wdt reset was observed ... So now I am now convinced that the wdt reset and remaining timeouts are being caused by wifi scanning and/or os_timer stuff ... I have noticed that the timer seams to stay set across resets ... I am thinking abut splitting the call to connectToBestAP() function out of the stationScanCb() function so that program flow is returned faster (so that other things can get done ... like process nodesync messages ...) also ... I'm going to eliminate the os_timer and use a loop counter instead ... the timer interrupt is firing off at a bad time ...
@BlackEdder If I had to choose from which routing protocol we are using here ... I would go with Table-driven (proactive) routing.
One thing I noticed was that the scan count shows that it goes through 14 channels. Should it? A little time may be gained if the higher channels we dropped. Not that I expect that it would make any significant difference.
@RudyFiero In my devel branch ... I have taken this into account ... I only scan the mesh channel and mesh ssid ... ignoring everything else ... and it speeds the scan up by a lot!
@muratdemirtas your code is going to drop connections because of the clock rollover in the manageConnections() function ... change the equation to be rollover safe ... The trick is to always calculate the time difference, and not compare the two time values. instead of doing (last+timeout < now) do (now - last > timeout) So lets say last is 4294967290 (just before rollover), and now is 10 (just after rollover). Then (now - last) is actual 16 (not -4294967280) since the result will be calculated as an unsigned long (which can't be negative, so itself will roll around)
i said try my codes for 71 minutes. https://github.com/muratdemirtas/ESP8266_MQTT_MESH
and
get nodetime overflow..
i will fix.
we must talk all nodes in 3 second. lastseen-getnodetime() is not good. i know.
Hey Scott, I just loaded your last Dev and I get a compiling error.
->error: 'struct meshConnectionType' has no member named 'chipId'
C:\Users\Rudy\Documents\Arduino\libraries\easyMesh-devel\src\easyMeshConnection.cpp: In member function 'void easyMesh::manageConnections()':
C:\Users\Rudy\Documents\Arduino\libraries\easyMesh-devel\src\easyMeshConnection.cpp:60:126: error: 'struct meshConnectionType' has no member named 'chipId'
debugMsg( CONNECTION, "manageConnections(): dropping %d now= %u - last= %u ( %u ) > timeout= %u \n", connection->chipId, nowNodeTime, connLastRecieved, nowNodeTime - connLastRecieved, nodeTimeOut );
^
C:\Users\Rudy\Documents\Arduino\libraries\easyMesh-devel\src\easyMeshConnection.cpp: In static member function 'static void easyMesh::meshRecvCb(void_, char_, short unsigned int)':
C:\Users\Rudy\Documents\Arduino\libraries\easyMesh-devel\src\easyMeshConnection.cpp:336:127: error: 'struct meshConnectionType' has no member named 'chipId'
staticThis->debugMsg( COMMUNICATION, "meshRecvCb(): lastRecieved=%u fromId=%d\n", receiveConn->lastRecieved, receiveConn->chipId );
oops! ... fixed
I just renamed it to nodeID. It compiles so I think it is OK.
What I am seeing so far. It works better than the last one. It recovers better but there still is a lot of dropping of nodes.
As I have said before, I am testing with seven modules. Three currently on/near my desk, within one meter of each other. The next one is four meters away. The next is about five meters away but in a slightly different direction of the four meter one. The next one is ten meters away. The final one is probably 15 meter away. RSSI values range from -36 to -77 with more around -57. What is odd are RSSI of 311 - 309. This will often be followed a few seconds later by a scan, but not always. Sometimes the signal gets back into a normal range.
About half an hour ago I was getting a lot of traffic, say about ten per second. Hard to say. The norm tends to be an average of one per second. While this was happening I just watched it. It was at the second furthest node so I could only see what was happening on it's LCD. I didn't have any serial terminals enabled so I didn't log it. But with this large increase in messages I saw the node count go up to eight. A minute later and the message volume went back to normal and the node count went back to seven.
I should log this with time stamps but I don't think I will get to it tonight. I forgot to mention, the five closer modules are connected via usb-serial.
The firmware I am using is the simple example with lcd routines to display messages From and Received (in a shorter form). The RSSI, the node count, and the number of clients connected. I also have a server running with websockets to control a relay. This works :) but I never know who I'm connected to so I don't know what IP to use.
It's going nuts again. I said it was like ten per second, no faster than that. It was cycling through messages. As if there was a buffer of messages and it was rotating around in a loop through the buffer. Oh yeah, I also have a serial entry routine so that on the serial connected modules I can insert my own message and have it broadcast. I do 106 XXXXXXXXXXXXXXX or 107 TTTTTTTTTTTTT, a pattern for each different module so that I can easily see it from the normal automated message creation. And when it went nuts I inserted some of those and I then saw it repeating every (checking now) ... I'll just include it below.
I stopped it by resetting a couple of modules and now they are all back to normal. Behaving.
muratdemirtas - I wasn't able to look at your stuff. I took a quick peek but I ran out of time.
... this is the core problem with easyMesh ...
OK, so after stripping this code down to the bare bones ... minimum application layer ... just connection and nodesync code ... with two nodes ... I still reach timeouts on the AP during wifi scanning ... resulting in the STA disconnecting and reconnecting ... shaking everything up ... generating a lot of messages ... eventually overloading the sendQueue ... leading to wdt resets ... some nodes recover wdt resets ... but not always
... if I disable wifi scanning on the AP node ... I do not reach timeouts ...
I have pushed these changes (hack) upto the no-timing branch if anyone else wants to experiment with it ... it here ... https://github.com/sfranzyshen/easyMesh/tree/no-timing ... If you change one node (the intended AP node) to not scan the network see the code in easyMeshSTA.cpp ... and just un-remark the idle line ...
if ( staticThis->_meshAPs.empty() ) { // no meshNodes left in most recent scan
// debugMsg( GENERAL, "connectToBestAP(): no nodes left in list\n");
// wait 5 seconds and rescan;
debugMsg( CONNECTION, "connectToBestAP(): no nodes left in list, rescanning\n");
// os_timer_setfn( &_scanTimer, scanTimerCallback, NULL );
// os_timer_arm( &_scanTimer, SCAN_INTERVAL, 0 );
_lastScanned = staticThis->getNodeTime();
_scanStatus = RESCAN;
// _scanStatus = IDLE; //un-remark this to disable rescanning over and over on AP ... for test
return false;
}
I understand why scanning needs to be done. One reason is to see if there are other parts of the network on other channels, so that they may become joined. My understanding is that scanning is a disconnect of whatever channel you were on so that you can view other channels. And those disconnects cause broken communications. I don't know enough about TCP, whether it will recover the packets lost during a scan. In theory yes but how much time is the maximum before it fails? What if there were other collisions on top of the scan time?
Again, I don't know the details of where things are and what is allowable before breakage. But one of my thoughts, if scanning delays are an issue , is that we don't need to scan all channels at once. Once a network has been established we can peek at other channels over a period of time, not requiring all channels to be looked at in the same instance. Just a thought on a potential optimization.
I'll have a look at the no-timing version when I get home.
@sfranzyshen dude you can not do anything for dropping after 71 minutes in mesh network without deleting 32 bit all system_get_time() based functions.
read this from easymesh mainpage
uint32_t easyMesh::getNodeTime( void )
Returns the mesh timebase microsecond counter. Rolls over 71 minutes from startup of the first node.
2^32= 4,294,967,296 microsecond = 4294 Seconds = 71 Minutes
if ( connection->lastRecieved + NODE_TIMEOUT < getNodeTime() ) { debugMsg( CONNECTION, "manageConnections(): dropping %d NODE_TIMEOUT last=%u node=%u\n", connection->chipId, connection->lastRecieved, getNodeTime() );
connection = closeConnection( connection );
continue;
}
after getnodetime count reset but problem is lastReceived.
look example, last received=4,294,967,296 , timeout= 3000, but getnodetime is 200:)
I'm not clear on what you mean "you can not anything for dropping mesh network without deleting all system_get_time based functions."
Seems like a missing word or two.
@RudyFiero my english not good. :(
@muratdemirtas
uint32_t nowNodeTime;
uint32_t nodeTimeOut = NODE_TIMEOUT;
uint32_t connLastRecieved;
SimpleList<meshConnectionType>::iterator connection = _connections.begin();
while ( connection != _connections.end() ) {
nowNodeTime = getNodeTime();
connLastRecieved = connection->lastRecieved;
// The trick is to always calculate the time difference, and not compare the two time values.
if ( nowNodeTime - connLastRecieved > nodeTimeOut ) {
debugMsg( CONNECTION, "manageConnections(): dropping %d now= %u - last= %u ( %u ) > timeout= %u \n", connection->chipId, nowNodeTime, connLastRecieved, nowNodeTime - connLastRecieved, nodeTimeOut );
connection = closeConnection( connection );
continue;
}
working for me past 71 minutes ;)
@sfranzyshen
When you're using with 5+ node, esp8266 suffocated while calculating (nowNodeTime - connLastRecieved > nodeTimeOut ) (same with easymesh). This makes wdt resets for some esp8266s. calculating process must be remove.
@muratdemirtas are you sure this is a "load" issue? I have esp's do this type of calculation with millis() and micros() all the time ... and keep up with other things going ... but I am intrigued as to how to handle this differently ... if you have any ideas for eliminating the NODE_TIMEOUT i'm open for suggestions ... ? until then the equation i am using ... works past 71 minutes ... so that's not my problem anymore ... also I do not agree that the wdt resets are happening because of the (Now - Last > Timeout) equation ... but rather it is coming from the sendQueue SimpleList ... it's backing up in the sendPackage() function ...
if ( connection->sendReady == true ) {
blah blah
}
else {
connection->sendQueue.push_back( package ); // <<-- This is line 89
}
The wdt reset exception is actually pointing at simplelist ...
Exception (29):
epc1=0x40204a23 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000
ctx: cont
sp: 3fff04b0 end: 3fff07b0 offset: 01a0
Decoding 1 results
0x40204a23: SimpleList ::AllocOneBlock(bool) at /home/user/Arduino/libraries/SimpleList/SimpleList.h line 230
: (inlined by) SimpleList ::push_back(String) at /home/user/Arduino/libraries/SimpleList/SimpleList.h line 77
: (inlined by) easyMesh::sendPackage(meshConnectionType*, String&) at /home/user/Arduino/libraries/easyMesh/src/easyMeshComm.cpp line 89 //<<-- Here is the point of failure
also the code in the above message will still work with the timeAdjuster ...
Serial.println( ( uint32_t ) ( 10 - 4294967290 ) ); // 16
Serial.println( ( uint32_t ) ( 4294967290 + 10 ) ); // 4
So NOW my problem is how to handle the messaging ... during scans .
@muratdemirtas "my english not good. :("
Much better than my German. (only other language I understand, but only a little)
since we are tied to the esp8266 (for now) we only have one radio to work with. the STA and AP are always set to be the same channel. if the STA changes the channel to connect to another AP ... it's AP channel is also changed. for now ... we have only one radio to work with. so, when we scan all 14 wifi channels we are switching the radio away from the current channel (the one we are messaging over) to the other channels to scan for APs ... when done ... we return back to our current (mesh) channel and try to catch up with the incoming messages we missed ... and the outgoing messages that were queued up ... it's during this period in the protocol that things don't always catch-up ... timing out connected STAs ... overloading sendQueue (on all connected nodes) ... and it all breaks apart. as of right now the protocol dictates that any node connected as a STA to another node's AP ... doesn't re-scan the wifi network ... yet if a node doesn't have a STA connection to another node's AP ... it will re-scan the wifi network every 5 sec (SCAN_INTERVAL) ... looking for a node that is not already part of the mesh to connect to. this means that at least one node of any easymesh mesh will perpetually be in this scanning loop ... doomed to fail. so we need to add some controls into the protocol to handle messaging during wifi scanning. I feel that once this problem is addressed ... we will finally be able to create something stable. here is a list of mechanisms i'm playing with to try and address this issue ... all feedback is welcome ...
- limit scans to mesh wifi channel & ssid (already in devel branch)
- scan one channel at a time ... returning back to the mesh channel in between each scan
- stop outgoing message sending before scans
- handle timeouts (incoming) messages during scans
- notify other nodes about scans
Yes. That was what I was thinking of. The way I see it is that if we have been attached to as network, and been operating for a while, we have no reason to be scanning on different channels. I think that only when we power up and are trying to find a network, then we should be looking at all channels.
I'm not quite sure why scanning is taking place, currently. I have had modules one foot away from other modules and it is trying to connect elsewhere, and I don't see why.
One thought. If we Searching on other channels it should be done in an orderly fashion. We should notify the other nodes on the network to not send messages for a scan duration. Of course this will cause problems if there are a number of nodes connected to the scanning node as no message can go through it. I think that when a scan is done from a networked module that it should only scan one channel then get back as soon as possible and resume normal operation.
I don't remember what I have written before but your list matches what I thought would be a preferable method.
The other think I was wondering is what happens when we have a web page we need to serve to a mobile device. We shouldn't be breaking connections by scanning. I see scanning a very bad to a network. In an established network it should be done sparingly. Maybe only done by non-AP nodes. In order to reduce disruption. Not an ideal thing but it might be workable.
OK I went back and read your comment again. And again I think we see things the same way. That scanning is a major problem for stability.
Another thing that should be considered is AP upgrades. I have often seen nodes connected to an AP with a poor signal while it has another node beside it that it is talking to indirectly through the poorer node. And here is where it makes my head spin. I think there should be node connection upgrades as long as there is a significant improvement. Just upgrading because you found someone better shouldn't be done unless there was a significant difference but even then it should not be done if the current connection is decent. If there were nodes connected the the one considering the upgrade then it should require a good reason if it could disrupt existing communications.
anyone want to development with this topology with me?
@muratdemirtas I assume you are talking about your project: https://github.com/muratdemirtas/ESP8266_MQTT_MESH
Are there any particular issues you need help with?
i can work my mesh system for 10+ hour with 5 nodemcu. my new aim is i will port this library to Raspberry Pi with QT C++. i dont need help, thx. i deleted mqtt functions. Mqtt will work on embedded linux.
https://scontent-frt3-1.xx.fbcdn.net/t31.0-8/14884634_193379897779767_324231978280270728_o.jpg
I looked at what you had and because of the MQTT I didn't want to try it. It is something I have not used yet and I don't have any accounts that I can use yet. I do think it is a desirable component but it is not something I wanted to have issues with when trying to verify the mesh operation. If you have something that did not need the MQTT then I would like to have a look at it and maybe test it.