Long term stability
I have experienced a number of issues attempting to run v0.4.2 simultaneously as an NTRIP server and TCP server which may be related. To rule out power fluctuations power is supplied to simplertk2b simultaneously and independently via both USB sockets.
-
when running continuously esp32-xbee disconnects both NTRIP and TCP every 20865 seconds. This results in a loss of coverage eg because RTKLIB STRSVR which connects to esp-xbee socket takes 10 seconds to timeout retrying and a further 10 seconds after reconnecting to get up and running with useful data.
-
esp32-xbee is also experiencing "panic" resets. I have attached logs and core_dumps of a couple of recent occasions. In the second instance I have included some heap_info outputs to see if this was caused by stack overflow. In each instance you can see effect no.1 happening 5hrs 50 mins after the TCP socket was first set up.
-
after a few days esp32-xbee seems to lose wifi connectivity needing sometimes a full reset (>5 secs on boot button) to get running again. Because of issues no.1 and no.2 above I've not investigated whether this is a problem with my wifi or with xbee.
Are any of thee issues addressed in the new v0.4.3 just posted?
core_dumpand log2.zip core_dumpandlog1.zip
PS I am not sure what purpose uart_log(buffer, n) has in log.c in the current implementation but I wonder whether it should be called regardless of whether xRingbufferSend returns an error when the ring buffer gets full. PPS Just a suggestion - could there be an option for a fixed wifi IP (rather than DCHP) to ensure the socket server always comes up on the same address?
UPDATE:
Now it is disconnecting and reconnecting even more frequently c4hours and then c2hours see extended log ... espxbeelogpanicagaincontinued.txt
UPDATE:
Just in case TWO output streams was a bit much I turned off the socket server, but still getting panic resets every few hours (and the disconnecting and reconecting from the caster). Does the core_dump indicate the cause of the "panic"?
UPDATE:
Running with NO server and NO client enabled seems (so far) to have cured the panic resets but still periodically "httpd_txrx: httpd_sock_err: error in recv : 113" messages are coming up (also errno 104) which is what has been happening all along each time esp32-xbee disconnected the TCP Server socket and NTRIP so is probably not a coincidence.
Unfortunately it doesn't say which socket (!) so I guess it is socket 80 listening for www.
No idea why errors should be occurring, but I have checked it is not a local network issue because disconnecting Wifi produces many different logs.
Question: Is there something in the httpd_sock_err recovery code which causes ALL sockets to be reset? Or something which is corrupting all the handles on the stack? If so that could explain issue no.1.
PS Now going to test with just socket server enabled to see what happens.
Hi Andrew, Thank you for all your research! I will be back from vacation tomorrow and will look through the logs to figure out a solution. Let me know if you have any new information.
UPDATE:
Running with just the socket server enabled still results in panic resets.
Here is a log and 2 core_dumps.
CONCLUSION
- "Panic" resets
These occur when either server is enabled or both. The only stable solution seems to be when no output is enabled (!). [NB I have not tried it long term with clients or caster enabled]
- Socket disconnections
I have noticed that rtk2go.com lists mountpoint ne36 as using "NTRIP_ESP32_XBee_Ntrip_Server/1.0" but achieving uninterrupted periods of 23hrs+ so I think this is my issue not a feature of the code.
To check that I swapped out the esp-xbee for a RS232 xbee module hard wired to an HP thin client (windows XP running STRSVR in the background) connected to the same wifi and this too loses connection to rtk2go at random intervals min 5 minutes max 7hrs+ so I think this is a network issue - maybe a feature of CGNAT used by EE - maybe the intervals are now random rather than the predictable 20865 seconds because it depends on total network traffic. Perhaps these disconnections and reconnections are what is upsetting the code causing the panics?
With the wired RS232 connection the solution is reasonably robust as the stream recovers in 13 seconds so this is less of an issue than the esp32 panic resets which take longer or wifi problems which are terminal.
- Losing wifi connectivity
This is less frequent but the most inconvenient because it requires manual intervention. I have no clue what is going on. The logs above include a couple of instances with different symptoms.
Thanks for offering to look into this - I hope this gives you enough to narrow down the cause.
NOTE
In other circumstances esp32-xbee running the same version NTRIP server can do longterm continuous - see extract from rtk2go.com status report below. I guess there is just some limitation in my setup which it does not tolerate well.
Base Station Mount Point Details: ne36 Station details for stream: ne36 Located at: Donavon, Canada NTRIP Agent: NTRIP_ESP32_XBee_Ntrip_Server/1.0 ... Recent Uptime Stats and Plots Last Restart: 5 Days 14:08 Up(21/46) (5 Days 14:08 (HH:MM) 100.0% overall) is 46th connection
For some reason all of your core dumps are giving me this error so I can't get anything useful out of them, you didn't happen to compile your own v0.4.2 binaries? The ELF file probably needs to match the core dump.
espcoredump.py v0.4-dev
WARNING: Skip task's (3ffc6690) stack 456 bytes @ 0x3ffc6690. (Reason: Can not add overlapping region [3ffc6690..3ffc6857] to ELF file. Conflict with existing [3ffc6690..3ffc67f3].)
ERROR: Growing up stacks are not supported for now!
ERROR: Failed to create corefile!
The httpd_txrx errors are ECONNRESET 104 Connection reset by peer and EHOSTUNREACH 113 No route to host which don't seem too serious.
WiFi and panics are still a mystery to me from those logs unfortunately. I'll release a new version today with some minor fixes, would be good to see if there is any improvement.
Thanks again for all the research!
No, I am using the binaries which came pre-loaded from Ardusimple (the only reason I looked at the code was to see if I could get any clue as to what was going wrong).
But googling for that error found someone who claimed to have solved the growing up stacks error ...
"Got this figured out. If I remove the first four bytes (the magic number) it all works fine! Looks like the flash loader removes these automatically and the raw loader doesn't expect them."
Does this help get any clues from the core_dumps (but as far as I can tell the first 4 bytes are the file length not a magic number so perhaps not!)?
Otherwise I'll look out for your new binaries (0.4.4?) and let you know when I get a chance to swap the ESPXbee back in.
Tried uploading v0.4.4 as per the guide. I am using POWER+XBEE port on SimpleRTK2B to upload. How is this wired? Is it necessary to disable all UART2 output to avoid conflict?
I didn't do so at first because the instructions didn't ask me to do that, but I found that turning off UART2 was the only way to get a green FINISH box as pictured below. Perhaps it is worth clarifying this in the firmware update instructions.
BUT when I restart the ESP2Xbee it still reports v0.4.2!
Any ideas what I am doing wrong (eg is it correct to have the 32MBit box ticked when 128Mbit is detected?)

Yes you are correct, I've updated the guide with a warning about that.
That is very unusual... were you ever able to successfully complete an update? The flash size doesn't matter. Could you try flash this? Same locations as above, just no www.bin. blink.zip
No I never succeeded. Got the green FINISH box several times as above but no change in the XBee, Couldn't find any kind of a log to see what's what.
blink.zip is no different. Still comes back as 0.4.2 when power-cycled or reset.
NB I am using download tool v3.6.8
Found a console window in the background. It says …
======
CONNECT BAUD: 115200
============
....Efuse CONSOLE_DEBUG_DISABLE is already burned.
Uploading stub...
Running stub...
Stub running...
is stub and send flash finish
This alone doesn't seem incorrect. Could it be because you have not checked the boxes beside the bin files in the flasher tool?
It was! v0.4.4 loaded now thanks. Will see how it runs.
I am sorry but it is still panic resetting (at least twice already).
See attached (also some new messages from ntrip_server_task)
Are these core_dumps readable/useful?
Still getting odd behaviour (see log) and panic resets (do you need any more core dumps?)
I don't know how to investigate the cause of the panic_resets. but I suspect much of the odd behaviour is because the wifi router connects over a 4G link (EE) rather that copper/fibre (I guess 4G is more common in rovers than bases). So I have been looking at ntrip_server.c to see if it could be made less sensitive to temporary dropouts. I am not familiar with ESP/FreeRTOS but I assume it already includes generous retries/timeouts in socket write() etc.
BUT I have noticed that when ANY error occurs relaying a UART stream to the TCP socket you call destroy_socket straight away (setting sock=-1 etc) and any further UART stream is dumped (and server_keep_alive is no longer zeroed). BUT the keep alive loop does not know that and keeps looping for 10s and only eventually breaks out when it tries to send newline to sock no -1.
Do we need that 10s delay in this instance? Could we change the keep alive loop from while(true) to while(sock!=-1) so that it drops out after <1 second if the socket has gone and enters the reconnection phase (via the retry_delay in the outer while(true) loop which stops us hamering the caster anyway).
For that matter once streaming has begun would we ever want to send newline keep alive characters in the middle of a binary RTCM stream (and if ever it was triggered would that confuse the caster?).
Also I have noticed that the uart_handler starts relaying as soon as the socket is connected (ie whenever sock!=-1) and does not wait for the NTRIP logon dialog to complete! Probably it is contrary to NTRIP protocol and may confuse the login if things happen in the wrong order?
Log to date attached with added comments status reports etc.
I have been watching this thread ever since the RTK2go.com NTRIP Caster (use-snip.com) was mentioned on the site. A couple of basic NTRIP design issues to be aware of are mentioned below as the code is improved.
For that matter once streaming has begun would we ever want to send newline keep alive characters in the middle of a binary RTCM stream (and if ever it was triggered would that confuse the caster?).
The presence of a newline or return character is not allowed in an RTCM 3.x message stream and will cause message loss issues with Casters and Client devices. Do not do this. [Aside some older RTCM 2.x message stream sources do contain \r\n but this is also incorrect, some NTRIP Client devices can cope, others cannot.] If you are looking at the raw binary data, the first char of most RTCM3 messages can be detected by 0xD3.
and does not wait for the NTRIP logon dialog to complete! Probably it is contrary to NTRIP protocol and may confuse the login if things happen in the wrong order?
Ideally the 200 Ok return is sent back by the Caster to the connecting socket (the NTRIP Client or Server), and then binary data transmission begins. It may be of value to know the original NTRIP protocol was derived in part from the simple 'shoutcast' protocol. More then a few NTRIP Clients do not strictly follow this, so the SNIP Caster (like many other Caster designs) can cope either way. The 'error reply' for NTRIP (Rev1) is extremely simple; if anything goes wrong with the initial request, the Caster sends the 'caster table' back and disconnects. In the above case, were \r\n to be found in the data stream, the NTRIP Clients using that data would not be able to reliably decode the messages.
The only other general remark I think needs to be made is that it is best that any NTRIP Server does not connect to the remote NTRIP Caster unless it in fact has some message content to be sent. Otherwise, (using SNIP as a typical example), the connection is made, no further data is ever sent, the Caster notices this and after some delay (10~15 seconds) disconnects the connection. Then the NTRIP Server connects again and this cycle continues forever (in the case of SNIP certain abuse logic will in time ban the IP from connecting for a variable period of time). The takeaway is that this code should confirm there is data to be sent before establishing the connection in the first place. [Aside, RTKLIB has this issue as well, although the RTK Explorer version has added a test for the presence of data.]
That is very illuminating thank you.
In the context of SimpleRTK2b with ZED-F9P I think in practice ...
-
Once streaming has started the newline in the 10sec timeout loop will never be triggered if the RTCM stream is coming in every second. In this scenario perhaps it is sufficient to keep the loop but have it exit when it times out (or earlier if sock becomes -1) because the caster is going to disconnect "after some delay (10~15 seconds)" anyway.
-
I don't use it but ZED-F9P has a survey-in mode with bases which can be relocated and I guess the UART may be silent for a few minutes while that happens(?) and perhaps this is what the newline keep alive was intended for (ie between text-mode connection and the binary starting). If so I can see the logic but as @DavidKelleySCSC points out probably not best practice also I fear a few minutes of newline may confuse his SNIP's clever auto-set caster table feature if there is no real data to parse in the first couple of minutes of connection.
So perhaps a way forward is - (a) for ntrip_server_task to pause before opening the socket and wait until there is some uart data.
(b) for ntrip_server_uart_handler to dump data unless ntrip_server_task has set a flag - which it would do just after sending the "successfully connected" log and reset just prior the to the "disconnected" log. If necessary also have the handler set another flag if data is received while sock=-1 so as to trigger opening the socket as in (a).
(c) trim the keep alive loop thus
while (sock != -1) {
if (server_keep_alive >= NTRIP_KEEP_ALIVE_THRESHOLD) break;
server_keep_alive += NTRIP_KEEP_ALIVE_THRESHOLD / 10;
vTaskDelay(pdMS_TO_TICKS(NTRIP_KEEP_ALIVE_THRESHOLD / 10));
}
NB brief disconnections and re connections although preferably avoided don't seem to upset the rover's lock (but I haven't confirmed that by having it drive a tractor yet) just longer interruptions like the panic resets or wifi issues.
Today's panic reset. Is it possible to determine the cause of the panic and/or which task was active when it happened to see if there is a pattern?
esp32_xbee_v0.4.4_core_dump_2020-01-19_10_02_49_79465b9f.zip
Had achieved over 19hours uninterrupted connection to rtk2go then another panic reset ,,, esp32_xbee_v0.4.4_core_dump_2020-01-23_19_50_53_79465b9f.zip
@DavidKelleySCSC Thank you for this information. I have implemented a system now similar to @AndrewR-L's suggestion to deal with this (bb131ab).
Before the initial connection it will wait for some data on UART, then once connected there will be a task similar to the previous keep alive that waits for 10 seconds of no data to (un)set the data ready flag. It will never disconnect from the caster itself, but if the caster chooses to kick the server it will not attempt to reconnect again without new data being sent.
@AndrewR-L Unfortunately all of these core dumps continue to give me the same error and still I don't get the same issue with my own core dumps. I'm going to increase the core dump partition size just in case but from what I've read if the partition size was an issue the core dump simply wouldn't be saved.
BUT I have noticed that when ANY error occurs relaying a UART stream to the TCP socket you call destroy_socket straight away (setting sock=-1 etc) and any further UART stream is dumped (and server_keep_alive is no longer zeroed). BUT the keep alive loop does not know that and keeps looping for 10s and only eventually breaks out when it tries to send newline to sock no -1.
Also I have noticed that the uart_handler starts relaying as soon as the socket is connected (ie whenever sock!=-1) and does not wait for the NTRIP logon dialog to complete! Probably it is contrary to NTRIP protocol and may confuse the login if things happen in the wrong order?
Implemented in server/client (c3dc326). Thanks again for all your investigation! Hopefully this will solve some of the panics.
Still haven't come across anything obvious that would have caused all the panics but there were a lot of small fixes in the last release so it would be great if you could let me know how that performs for you.
Great, thanks, I'll try it out and let you know. The last time v0.4.4 reset itself out of the blue it said "reset reason: INTERRUPT_WATCHDOG" which was a new one to me.
It is a shame you couldn't read the core_dumps - I can't think what I am doing different. Hopefully there will be no more "panic resets" with v0.5. What does an ESP "panic" indicate generally - is that a hardware thing (eg overheating) or a software thing (eg memory issues)?
PS thanks for picking up the wifi suggested enhancements.
https://docs.espressif.com/projects/esp-idf/en/latest/api-guides/fatal-errors.html
This has the full list, its almost certainly a software issue. ESP32 has the added complexity of having two cores so usually problems are caused by memory issues.
Gentlemen, sounds like good progress, congratulations. If AndrewR-L can tell me the mountPt name that is being used, I will set up some extra monitoring on the RTK2go.com SNIP NTRIP Caster to record the connection times and help you validate things.
@DavidKelleySCSC that is very kind. It is on NR152QB, still using the old firmware for now, but I'll swap it to the new issue at the weekend.
On that topic was rtk2go.com supposed to be uncontactable for a few hours leading up to around 15:00 (UTC) last Saturday (25th) or was that an issue this end? Either way it demonstrated that the server did gracefully escalate its retry interval when it couldn't contact the mountpoint 1sec - 2 sec - 5 sec - 10 sec - 15 sec - 30 sec and so on eventually ending up with hourly retries.
Got it recording now in case detailed records are needed. You can always get a real time status report (when it is connected) with: http://rtk2go.com:2101/SNIP::MOUNTPT?NAME=NR152QB
Here is a quick plot from SNIP shpowing the past 4 weeks, you can see a fair number of up/dwn events as well as the two recent periods that RTK2go was offline. [One came form a Microsoft bug check reboot, not sure why. The other other was quite odd, all IPs on the machine appeared open from the console yet no traffic would flow. We have seen this twice on Window7 machine in the past] (Click to see larger)

What you really cannot see from the above is that there are several re-connect events that seemed to me to take way too long (minutes rather than seconds), I will keep an eye on these and report what is seen.
Now ~24 hours later, no sign of odd drop out and restarts as well. While it is too early to say for sure, looks much improved.