mongoose-os icon indicating copy to clipboard operation
mongoose-os copied to clipboard

Update to be thread safe and add multi-core support (ie ESP32)

Open tripflex opened this issue 6 years ago • 46 comments

Mongoose OS only currently supports running ESP32 on single core, basically leaving the second core completely unused.

Yes i've seen the thread at: https://forum.mongoose-os.com/index.php?p=/discussion/comment/2356/#Comment_2356

And ticket #253

I opened this issue as technically this is an issue that only single core devices are supported, and multi-core devices that are supported ... only a single core is supported on them.

By closing ticket #253 does that mean that Cesanta has no plans for ever supporting multi-core devices?

While I understand that currently MOS is not thread safe, eventually over time one would assume that new devices will become available, many of them will have multi-core ... and even if it's not an ESP32, i'm sure you guys will want to add new devices ... but does that mean that even though support may be added, that they will be crippled by only supporting a single core?

tripflex avatar Feb 15 '19 22:02 tripflex

I haven't read this fully but it seems like the esp-idf supports the two threads... they could at least get two cores running and pin all the mongoose functions to one core. This could possible let a user add additional independent task to the 2nd core to start with?

https://docs.espressif.com/projects/esp-idf/en/latest/api-guides/freertos-smp.html

rtrue983 avatar Feb 25 '19 21:02 rtrue983

Then again, this may go back to making mongoose thread safe as the first step and let users create their own task with less worry about threading and interacting with mongoose in those threads

rtrue983 avatar Feb 25 '19 21:02 rtrue983

I'm also very interested in this.

markterrill avatar Mar 27 '19 23:03 markterrill

At very least it would be very helpfull to have thread-safe LOG() macro

Harvie avatar Oct 18 '20 11:10 Harvie

it is, actually. logging is thread-safe. the only problem is sometimes the task it's logging from doesn't have enough stack... *printf calls are quite expensive in terms of stack.

rojer avatar Oct 18 '20 11:10 rojer

@rojer Do you think the stack running out is the case in the following backtrace? I can't really tell.

Reading symbols from /fw.elf...done.
Remote debugging using 127.0.0.1:1234
0x400880df in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
156	        *((int *) 0) = 0;
#0  0x400880df in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
#1  0x40088264 in abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:171
#2  0x400828ee in lock_acquire_generic (lock=0x3ffae9d8, delay=4294967295, 
    mutex_type=4 '\004')
    at /opt/Espressif/esp-idf/components/newlib/locks.c:143
#3  0x40082a0c in _lock_acquire_recursive (lock=0x3ffae9d8)
    at /opt/Espressif/esp-idf/components/newlib/locks.c:171
#4  0x4014e389 in _fwrite_r (ptr=0x3ffc8d58, buf=<optimized out>, size=1, 
    count=24, fp=0x3ffae980) at ../../../.././newlib/libc/stdio/fwrite.c:168
#5  0x4014e404 in fwrite (buf=0x3ffc8b40, size=1, count=24, fp=0x3ffae980)
    at ../../../.././newlib/libc/stdio/fwrite.c:211
#6  0x4017b2a0 in cs_log_print_prefix (level=LL_WARN, file=<optimized out>, 
    ln=<optimized out>) at src/common/cs_dbg.c:86
#7  0x400850b0 in siterRF_RX_cb (pqh=<optimized out>, cb_arg=<optimized out>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/siterRF/src/siterRF.cpp:210
#8  0x40084cec in pq_task (arg=0x3ffb634c <siter_pqh>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/pq/src/pq.c:33
Detaching from program: /fw.elf, Remote target

i am using stack size of (MGOS_TASK_STACK_SIZE_BYTES / STACK_SIZE_UNIT) i've even tried to increase it 10 times and it still crashes the same way

Harvie avatar Oct 18 '20 11:10 Harvie

Ok, sorry. The problem was in directly calling printf() rather than LOG(). Unfortunately i need printf() for debugging, since LOG is too slow for my purposes :-/ But i guess i will have to figure this out...

Harvie avatar Oct 18 '20 11:10 Harvie

No. I take that back. It actualy happens in " cs_log_print_prefix() " call...

Harvie avatar Oct 18 '20 11:10 Harvie

    if (!xPortCanYield()) {
        /* In ISR Context */
        if (mutex_type == queueQUEUE_TYPE_RECURSIVE_MUTEX) {
            abort(); /* recursive mutexes make no sense in ISR context */
        }

code around line 143. looks like you're trying to log from an ISR handler? this is not going to work.

rojer avatar Oct 18 '20 12:10 rojer

from ISR you can use mgos_cd_printf that will print to UART

rojer avatar Oct 18 '20 12:10 rojer

Just checked. I am not loging from ISR handler...

Harvie avatar Oct 18 '20 13:10 Harvie

But FreeRTOS scheduler does switch between tasks in ISR, so one task preempting others tasks is basicaly ISR for the other tasks.

Harvie avatar Oct 18 '20 13:10 Harvie

Now after some testing i am pretty sure that this problem happens when i call LOG(...) from freertos task other than main mongoose task. probability is increased when running on the same freertos priority as main mongoose task and 1000Hz tickrate, since context switches are more frequent that way due to time slicing. Perhaps then there is more chance that two tasks call LOG() at once.

Harvie avatar Oct 18 '20 18:10 Harvie

yet, somehow xPortCanYield() ends up being false. is the trace always the same as above?

rojer avatar Oct 18 '20 20:10 rojer

Yes

Harvie avatar Oct 19 '20 07:10 Harvie

I have GPIO ISR handler, which sends job to queue which is later handled by separate task. So it might be possible that after the GPIO ISR freertos scheduler immediately jumps to different task which is receiving the queue, because it gets unblocked by that ISR sending to queue.

Harvie avatar Oct 19 '20 07:10 Harvie

To be clear, in the ISR i am using this macro after sending job to the queue if queue gives me *pxHigherPriorityTaskWoken == true.

#define YIELD_FROM_ISR(should_yield)        \
  {                                         \
    if (should_yield) portYIELD_FROM_ISR(); \
  }

Harvie avatar Oct 19 '20 19:10 Harvie

I think it would make sense for mongoose loging to use FreeRTOS stream buffers https://www.freertos.org/RTOS-stream-buffer-example.html

That way it would be possible to log to such buffer from anywhere in code. Multiple tasks and ISRs. Stream buffer would then synchronize everything to make it thread safe.

Harvie avatar Oct 19 '20 19:10 Harvie

BTW what about mgos_azure_send_d2c_msg() (and respective mqtt api)? Is that thread safe? I am afraid that it will have same issues as LOG()...

Harvie avatar Oct 26 '20 00:10 Harvie

Until it gets fixed, i came up with temporary workaround:

#include "FreeRTOS.h"
#include "task.h"
#include "mgos_core_dump.h" //mgos_cd_printf() prototype
#include "common/cs_dbg.h"

#ifndef __FILENAME__
/// We don't want to log full paths, so we only use the filename
#define __FILENAME__ (strrchr("/" __FILE__, '/') + 1)
#endif //__FILENAME__

/**
 * mgos macro `LOG()` is not thread safe, so we have this instead.
 * unfortunately it can currently log only to UART...
 */
#define PQ_LOG(l, x)                                  \
  do {                                                \
    if(cs_log_level >= l)  {                          \
      mgos_cd_printf("%.24s:%d:%.16s:\t\t", __FILENAME__, __LINE__, pcTaskGetTaskName(NULL)); \
      mgos_cd_printf x;                               \
      mgos_cd_printf("\n");                           \
    }                                                 \
  } while (0)

(unfortunately this does not play very well with power management enabled https://github.com/cesanta/mongoose-os/issues/556 )

Harvie avatar Oct 26 '20 11:10 Harvie

From what i read here https://github.com/espressif/esp-idf/issues/4838 it feels like the 115200 should work with REF_TICK.

So perhaps we should try it and always use REF_TICK unless user specifies higher baudrate?

Harvie avatar Oct 26 '20 22:10 Harvie

BTW what is expected behaviour of LOG() in ISR? It should probably look like this:

LOG mutex is unlocked, ISR can LOG() LOG mutex is locked, ISR cannot LOG() and therefore should ignore the LOG() call

Even better version:

There is dedicated buffer for ISRs to LOG into. It is dumped to LOG() as soon as it gets unlocked.

What currently happens is

No matter if LOG is locked or not, calling LOG() in ISR causes panic

Harvie avatar Oct 27 '20 09:10 Harvie

LOG is not meant to be used in ISR.

On Tue 27 Oct 2020, 09:03 Tomas Mudrunka, [email protected] wrote:

BTW what is expected behaviour of LOG() in ISR? It should probably look like this:

LOG mutex is unlocked, ISR can LOG() LOG mutex is locked, ISR cannot LOG() and therefore should ignore the LOG() call

What currently happens is

No matter if LOG is locked or not, calling LOG() in ISR causes panic

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cesanta/mongoose-os/issues/480#issuecomment-717093511, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBW2YJSWHBVCMK4KJDIH3SM2EHNANCNFSM4GX2J5MA .

rojer avatar Oct 27 '20 10:10 rojer

LOG is not meant to be used in ISR.

Ok, that is not really important for me. Just saying it can be done if needed :-) I will try to write some minimal code needed to reproduce the bug i experience when calling LOG() from multiple tasks, so you can take a look at it.

Harvie avatar Oct 27 '20 10:10 Harvie

I am still strugling to isolate the bug. Might mgos_ints_disable() / mgos_ints_enable() have effect on triggering of this bug by any chance?

Harvie avatar Oct 27 '20 12:10 Harvie

O yeah! I got it. Happens when calling LOG() with ints disabled. At least in some cases...

Remote debugging using 127.0.0.1:1234
0x4008727b in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
156	        *((int *) 0) = 0;
#0  0x4008727b in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
#1  0x40087400 in abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:171
#2  0x40082802 in lock_acquire_generic (lock=0x3ffae9d8, delay=4294967295, 
    mutex_type=4 '\004')
    at /opt/Espressif/esp-idf/components/newlib/locks.c:143
#3  0x40082920 in _lock_acquire_recursive (lock=0x3ffae9d8)
    at /opt/Espressif/esp-idf/components/newlib/locks.c:171
#4  0x4013f59d in _fwrite_r (ptr=0x3ffc8380, buf=<optimized out>, size=1, 
    count=24, fp=0x3ffae980) at ../../../.././newlib/libc/stdio/fwrite.c:168
#5  0x4013f618 in fwrite (buf=0x3ffc8220, size=1, count=24, fp=0x3ffae980)
    at ../../../.././newlib/libc/stdio/fwrite.c:211
#6  0x40169f00 in cs_log_print_prefix (level=LL_INFO, file=<optimized out>, 
    ln=<optimized out>) at src/common/cs_dbg.c:86
#7  0x400e6c06 in task_cb (pvParameters=<optimized out>)
    at /home/harvie/Temp/firmware-github/log-crash-test/src/main.c:21
Detaching from program: /fw.elf, Remote target

Harvie avatar Oct 27 '20 13:10 Harvie

i bet it can. mgos_ints_disable() translates to portENTER_CRITICAL on freertos, you shouldn't be logging from inside a critical section.

rojer avatar Oct 27 '20 13:10 rojer

Can you please add some asserts checking if we are in critical section, so this becomes immediately obvious when somebody runs to this limitation again?

As well as puting big warning to LOG documentation https://mongoose-os.com/docs/mongoose-os/api/core/cs_dbg.h.md that it should not be called from ISR or with disabled interrupts.

//Isolated case of code causing the crash :-)
#include "mgos.h"

enum mgos_app_init_result mgos_app_init(void)
{
	mgos_ints_disable();
	LOG(LL_INFO,("Log from task %d", 456));
	mgos_ints_enable();

	return MGOS_APP_INIT_SUCCESS;
}

Harvie avatar Oct 27 '20 13:10 Harvie

well, there are a lot of things you shouldn't be doing with ints disabled, logging is just one fo them - in fact, the set of thinjgs you can safely do with ints disabled is very limited. i guess it's worth documenting somewhere, though

rojer avatar Oct 27 '20 13:10 rojer

Anyway. I am satisfied with this conclusion, it is no problem to move LOG() out of the critical section. if i only knew week ago... I might have save myself lots of headache :-)

BTW what about stuff like MQTT? Can that be considered thread safe? What parts of mongoose cause biggest troubles regarding thread safety? So far stuff seems workable.

Hopefully i will soon release my module implementing parallel event loops, works well, just needs bit of polishing... Stay tuned :)

Harvie avatar Oct 27 '20 13:10 Harvie