micropython-async icon indicating copy to clipboard operation
micropython-async copied to clipboard

ESP32 exception BREAK instr when running with async v3

Open bhcuong2008 opened this issue 3 years ago • 51 comments

Hi,

I've faced issue with async code. I've tested 2 cases: loop with synchronous code, and loop with async code V3 as follow on ESP32 without PSRAM. I use the latest MicroPython version 1.15.

1. Synchronous code:

When running this function, it could run well 24/7 without any exceptions. Sleep in 1ms.

def _ctask_fn(arg):
   try:
      while True:
         _ccall(arg['_cfn'], arg)
         time.sleep_ms(arg['sleep'])
   except Exception as e:
      print('[_ctask_fn] %s' % str(e))
   finally:
      return

2. Async code

Also sleep in 1ms. But run in tens of seconds, or some minutes (2, 3), it occurs exceptions. It's rather random. Most of them were Guru Meditation Error: Core 1 panic'ed (Unhandled debug exception) Debug exception reason: BREAK instr

arg['_cfn'] is just a callback function, do nothing for test.

If sleep in 50 or 100ms, it's running rather long without exceptions. I've not tested with several hours/days.

async def _ctask_afn(arg):
   try:
      while True:
         _ccall(arg['_cfn'], arg)

         if 'sleep' in arg:
            await asyncio.sleep_ms(arg['sleep'])
         elif 'event' in arg:
           await arg['event'].wait()
           arg['event'].clear()
         elif 'tsf' in arg:
            await arg['tsf'].wait()

   except Exception as e:
      print('[_ctask_afn] %s' % str(e))
   finally:
      return

I really dont know why. It's just a normal/minimal async code. Pls guide me how to debug it. I think that problem is at async scheduler side.

Thanks.

bhcuong2008 avatar May 18 '21 02:05 bhcuong2008

A 1ms sleep is not a realistic way to use uasyncio. In practice the delay will be much longer than that, because other tasks will be scheduled for execution during the delay. That said, a delay of zero is entirely conventional as a means of temporarily yielding to other tasks, so a 1ms value should definitely not cause an error.

I haven't encountered the error you're seeing despite having done a lot of uasyncio development on ESP32. If you believe there is a bug in uasyncio I suggest you construct a minimal test case and raise an issue against MicroPython.

peterhinch avatar May 18 '21 12:05 peterhinch

I also tried with 0. It causes exception quicker. Other tries were 5ms, 10ms, 20ms. All caused exceptions, but longer to happen.

bhcuong2008 avatar May 18 '21 12:05 bhcuong2008

I can only reiterate: please produce a minimal test case enabling the maintainers (and myself) to reproduce the fault, and raise an issue if you are convinced of a bug in uasyncio.

peterhinch avatar May 18 '21 12:05 peterhinch

@peterhinch , here is the minimal code you could try it. In REPL, pls define functions as below and run async task. On my chip, exceptions occur randomly, maybe after some seconds to some minutes (2-5 minutes). My chip runs with ESP-IDF v3.3.5, MicroPython v1.15, uasync v3.

import uasyncio as asyncio

def _cfn():
   pass

async def _ctask_afn():
   try:
      while True:
         _cfn()        
         await asyncio.sleep_ms(0)        
   except Exception as e:
      print('[_ctask_afn] %s' % str(e))
   finally:
      return

asyncio.run(_ctask_afn())

Update: I add minimal sync code too. Sync code will be running without any issues.

import time

def test():
   while 1:
      _cfn()
      time.sleep_ms(0)

test()

Thank u.

bhcuong2008 avatar May 18 '21 13:05 bhcuong2008

@peterhinch , I see that uasyncio has operations related to push/pop task to/from task queue, implemented in moduasyncio.c

Could there be some corruptions during these operations due to high speed yield?

bhcuong2008 avatar May 19 '21 02:05 bhcuong2008

I've tested on a Raspberry Pi Pico and your code sample runs perfectly. I'll hook out an ESP32 and try that.

peterhinch avatar May 19 '21 09:05 peterhinch

Thank u for ur effort. Hope that it has the same issue as mine. Otherwise I could not debug the internal code of module uasyncio.

I've just begun my learning in MicroPython and uasyncio.

bhcuong2008 avatar May 19 '21 09:05 bhcuong2008

OK, I'm running it on the ESP32 reference board and so far there is no exception. I have run it both with and without instantiating a WiFi connection. I'll leave it running and report back if it fails.

I strongly doubt that the fault is in uasyncio, which has been extensively tested by myself and others. I'm starting to suspect your hardware. The BREAK error does not sound like a MicroPython error: perhaps it is emitted by FreeRTOS?

peterhinch avatar May 19 '21 10:05 peterhinch

uasyncio code almost is in Python language, in folder extmod/uasyncio. If sync code run well, then everything in Python language will be fine. Only Task, TaskQueue code is in C, file moduasyncio.c

This sample code is very very basic. I could not imagine what this code has impact on my hardware, having module ESP32-WROOM-32.

When running with sleep time 100ms, it's ok. But when decreasing it to 50, 20, 10, 5, 2, 1, 0ms, exceptions happening. 50ms still causes exceptions, but longer.

Could u give me some guides or test cases to debug this case, especially Task, TaskQueue in moduasync.c to uncover bugs inside or whatever causing exceptions.

Updated: I tesed some cases with mp_stack_set_limit. Normally it set mp_stack_set_limit(stackLen - 1024). But when increasing to 2048, 3072, it help longer to cause exceptions. But increasing more, no help any more.

bhcuong2008 avatar May 19 '21 10:05 bhcuong2008

There is nothing more I can do, I simply can't replicate this fault.

I suggest you ask in the forum - it's possible someone else may have seen this error. But I would double-check your hardware. Many ESP32 errors are caused by voltage drops on the USB cable. There is also a lot of cheap Chinese kit around that isn't well designed: is your module a real Espressif WROOM or a Chinese clone?

peterhinch avatar May 19 '21 10:05 peterhinch

My module and my hardware run my app well over 1.5 years. Now I integrate MicroPython into it. It could run sync code MicroPython well, even in MP thread. Only when run simple async code to test then these exceptions happening.

This is my sync code test. It run very fast and well.

MicroPython

bhcuong2008 avatar May 19 '21 11:05 bhcuong2008

Your code is still running here on the reference board after over two hours. I don't see how I can help when I can't replicate the fault.

peterhinch avatar May 19 '21 12:05 peterhinch

Upon on your precious result and keyword, I've found interesting results. I'm testing it for over a day to validate my judgement.

Tomorrow (GMT+7), I will share my detail results.

@peterhinch , Thank u very much.

bhcuong2008 avatar May 20 '21 17:05 bhcuong2008

here is the minimal code you could try it. In REPL, pls define functions as below and run async task. On my chip, exceptions occur randomly, maybe after some seconds to some minutes (2-5 minutes). My chip runs with ESP-IDF v3.3.5, MicroPython v1.15, uasync v3.

@bhcuong2008 are you building custom firmware? How are you running IDF v3 with MicroPython v1.15? IDF v3 is no longer supported, only IDFv4.

When you get an exception, it would help if you could post the full output on the terminal (the trace back) and also provide the .elf file that was programmed to the board. Then we can see exactly where it crashed.

dpgeorge avatar May 21 '21 04:05 dpgeorge

Firstly, I share my result during 2 days.

Based on testing of @peterhinch, I think that it's a simple application with main thread of Micropython, with/wo Wifi connection. This is a keyword for me to explore the bug.

My app is full, and enough complex. It is in Arduino environment. I build MicroPython 1.15 as a library to my app. It's built with ESP-IDF v3.3.5.

Basically my app has some heavy FreeRTOS threads as follow:

  1. FreeRTOS task, Arduino main loop, priority 1, IO & CPU bound, especially heavy IO bound
  • This loop processes 1 SSL connection
  • 2 MQTT connections to 2 MQTT brokers
  • Websocket servers
  1. FreeRTOS task, MQTT loop connection, IO bound, heavy & blocking task, priority 1, using Arduino PubSubClient. Run every 100ms.
  • It handles MQTT connect. When brokers are down, this task becoming blocking for many seconds.
  • I run MPY while 1 broker connected, 1 broker is down to make it blocking task.
  1. FreeRTOS task, Math calculation loop, CPU bound, priority 3, run every 10ms.

Initially, before report to the forum, MPY thread runs with priority 1. I let it default as in micropython esp32 port. When run async task with sleep_ms 0, 1, 2, 5, 10, 20, even 50ms, the system easily broken, especially 0,1,2 ms. More stack size, it survives more, a little longer.

Then, based on peter result, I think MPY thread not having enough time to process due to sharing with task 1, 2 with the same priority. So I changed MPY thread priority to 2. Now MPY thread has more time to process. It could survive longer, in hours, from 1 to 8 hours, depending on sleep_ms 1, 2, 5, 10. 10ms helps survive longer, up to ~8h, then broken.

Let MPY thread priority higher caused an issue. That is asyncio scheduler loop taking all the time. It makes task 1, 2 with lower priority, run incorrectly, due to not enough time to process, especially sleep_ms 0. Then I add taskYIELD to 3 functions of task queue, in file extmod/uasyncio.c, peek, pop_head, push_head. It means that every task queue operation must do a taskYIELD to give time to lower tasks.

Then lower tasks could run properly, except SSL connection, due to SSL requiring heavy time to process. So I did not test with sleep_ms 0 any more. This is the point to be consider. sleep_ms 0 is a yield to async task. But it dont give lower FreeRTOS tasks time to process. My fix is just workaround.

Next test was that I let MPY thread priority 3, equal to task 3. It was broken easily, quickly. I think the reason is that it shared time with task 3. So my conclusion is that MPY thread running async task must not be shared with heavy task with the same priority. FreeRTOS switching tasks will cause memory corruption, and break. MPY thread priority must be lower or higher.

My last test was with sleep_ms 10, it could survive longest among 1, 2, 3, 5ms sleep_ms, up to ~ 8 hours, then break.

One big note during my test is that if MPY stack size is low (5KB), it could survive about 1-2hour for 1, 2ms sleep_ms. If more stack_size, 8KB, it could survive longer (2-3h), then break. Lower sleep_ms (1, 2, 3ms) would cause exception quicker. Higher sleep_ms (10, 20, 50) has longer time survival. Among my tests so far, sleep_ms 100 is rather safe.

I think these exceptions could be reproduce easily with these steps:

  1. Create a FreeRTOS thread with CPU-bound or IO-bound, about 5-20ms calculation time with priority 1
  2. Run MPY thread with the same priority 1.
  3. Run my minimal MPY async sample code above

I had some trace back before, but not useful. I will find it and post it in later comment.

After all, I think asyncio (task/taskqueue) has issue with allocating/release heap memory.

bhcuong2008 avatar May 21 '21 17:05 bhcuong2008

@dpgeorge, I have learnt from MPY esp32 port in v1.14, v1.15 together with Lobo, to make my own environment to build mpy as a library to use in my main app Arduino.

ESP-IDF v4+ is not enough stable for my app. My app is with IDF 3.3.5. I want to use the latest MPY 1.15.

bhcuong2008 avatar May 21 '21 17:05 bhcuong2008

Here is coredump from ESP32 of break exception. In this case, MPY thread priority is 1.

coredump_break.txt

And this is espcoredump output from coredump above.

espcoredump_break.txt

Every break exceptions or related exceptions may have different paths. Some has related to pair-heap as in https://github.com/micropython/micropython/issues/7289

But most of them are BREAK instr.

bhcuong2008 avatar May 21 '21 17:05 bhcuong2008

Main MPY thread run with these info, stack_size 8KB, heap 42KB, mp_stack_set_limit(stack_size - 1024);

Async loop and its tasks is run inside another thread of MPY for easy interacting with REPL.

System has enough memory while running.

mpy_sys

bhcuong2008 avatar May 21 '21 17:05 bhcuong2008

Another suggestion while I debug this case is in extmod/utime_mphal.c, time_sleep_ms should not check

STATIC mp_obj_t time_sleep_ms(mp_obj_t arg) {
    mp_int_t ms = mp_obj_get_int(arg);
    if (ms > 0) {
        mp_hal_delay_ms(ms);
    }
    return mp_const_none;
}

It should call directly mp_hal_delay_ms(ms) to allow python code do yield when calling time.sleep_ms(0).

STATIC mp_obj_t time_sleep_ms(mp_obj_t arg) {
    mp_int_t ms = mp_obj_get_int(arg);
    mp_hal_delay_ms(ms);
    return mp_const_none;
}

bhcuong2008 avatar May 21 '21 17:05 bhcuong2008

Thanks @bhcuong2008 for the details. A very brief summary of the above would be: you have a complex system that works without a problem. Then integrating MicroPython as a subsystem leads to crashes. The initial reaction to this would be that MicroPython is not integrated properly. For example, it could be that the GC is not tracing all root pointers and that leads to corruption of the MP heap.

Did you run the MicroPython test suite against your integration of MP? I would strongly recommend to do this because the test suite (especially the stress tests) can help show if there any integration issues.

Here is coredump from ESP32 of break exception.

Sorry but these dumps are not enough. You need the backtrace of the stack of the broken task, the list of nested function calls that lead to the crash.

Async loop and its tasks is run inside another thread of MPY for easy interacting with REPL

So the uasyncio code you posted is actually run in _thread.start_new_thread(asyncio.run, (_ctask_afn,))?

To make further progress on this issue you'll need to provide complete instructions to let other people reproduce the issue, including what hardware to use, what source code, how to build, and how to run (and verify that you can reproduce the issue by following these instructions). Ideally you would find a way to reproduce the issue using an easily available piece of hardware (eg ESP32 module) and code based on MicroPython (not your own system).

dpgeorge avatar May 21 '21 23:05 dpgeorge

@dpgeorge , I have problem with only async run. When run python sync code, there's no problem as my above code,

import time

def test():
   while 1:
      _cfn()
      time.sleep_ms(0)

test()

This code is blocking code because time.sleep_ms(0) return immediately. Or you could see my above Arduino terminal, that run MPY code very quickly, in many days without corruption.

I've not run test suites yet. I will try it. But with sync code above, I think MPY integration is good.

About backtrace, it's not useful, as follow:

Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception)
Debug exception reason: BREAK instr 
Core 1 register dump:
PC      : 0x400803c0  PS      : 0x00060a36  A0      : 0x00000000  A1      : 0x3ffd9ac0  
A2      : 0x3ffda0f0  A3      : 0x00000019  A4      : 0x3ffcf670  A5      : 0x3ffd9cdc  
A6      : 0x00000002  A7      : 0x3ffd9c90  A8      : 0x00060023  A9      : 0x3ffb3364  
A10     : 0x00000000  A11     : 0x00000000  A12     : 0x8008aef0  A13     : 0x3ffce720  
A14     : 0x00000000  A15     : 0x3ffd9220  SAR     : 0x00000014  EXCCAUSE: 0x00000001  
EXCVADDR: 0xffffffd0  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0x00000000  

ELF file SHA256: 0000000000000000

Backtrace: 0x400803c0:0x3ffd9ac0 0x7ffffffd:0x00000000

So the uasyncio code you posted is actually run in _thread.start_new_thread(asyncio.run, (_ctask_afn,))?

I've run MPY thread in 3 ways:

  1. _thread.start_new_thread(_cthread, ())
async def _cthread_main():
   try:
      tasks = []
      for t in _ctasks:
         if 'tsf' in t:
            t['tsf_set'] = _tsf_set

         t['status'] = 1
         tasks.append(asyncio.create_task(_ctask_afn(t)))

      await asyncio.gather(*tasks, return_exceptions=True)

   except Exception as e:
      print('[_cthread_main] %s' % str(e))
   finally:
      return

def _cthread():
   try:
      if not len(_ctasks):
         return

      asyncio.run(_cthread_main())
   except Exception as e:
      print('[_cthread] %s' % str(e))
   finally:
      return
  1. run with MPY C-API by mp_call_function_2 for _thread.start_new_thread
  2. use my custom thread creation MPY C-API based on code of _thread.start_new_thread

Ideally you would find a way to reproduce the issue using an easily available piece of hardware (eg ESP32 module) and code based on MicroPython (not your own system).

I will try to do this. I make my own board. It's simple enough.

Thank @dpgeorge

bhcuong2008 avatar May 22 '21 03:05 bhcuong2008

I've just done 2 important tests to get more detail of this bug.

  1. Ideally you would find a way to reproduce the issue using an easily available piece of hardware (eg ESP32 module) and code based on MicroPython (not your own system).
  • I used ESP32 port of MPY 1.14, 1.15, with all defaults settings. I just add 2 more tasks (CPU bound, using delay ). Task 1 has the same priority 1 as MPY thread. Task 2 with priority 3. MPY has a thread to run my async task as above.
void vTask1(void *pvParameters) {
    while(1) {
        ets_delay_us(5000000);
    }
    vTaskDelete(NULL);
}

void vTask2(void *pvParameters) {
    while(1) {
        vTaskDelay(10 / portTICK_PERIOD_MS);
        ets_delay_us(5000);
    }
    vTaskDelete(NULL);
}
  • My system could run well over 2hours. So my assumption of CPU-bound not correct.
  1. The 2nd test is that I'm back to my original Arduino app.
  • I disable all TCP connections, including 1 Websocket client SSL, Websocket server, 2 MQTT.
  • Async task with asyncio.sleep_ms(0)
  • MPY thread with priority 1, same as main loop arduino.
  • Then my app could run well over tens of minutes without quick corruption as before.

So I see that there's relationship between MPY asyncio and TCP connections.

@dpgeorge , @peterhinch, or others, Could u help me to write a simple code based on MPY ESP32 port. This simple code is websocket client SSL connection to websocket server. And doing some traffics every 5 seconds between device and server. This test used to validate exceptions due to interaction between asyncio engine and TCP connections. I often write code on Arduino. It has some difficulties for me to write more code on pure ESP-IDF. So please help me do it.

Update: after some tests and waiting, it's also failed although having no connections :( I will try to test more.

Thank you so much.

bhcuong2008 avatar May 22 '21 11:05 bhcuong2008

Here is backtrace from another exceptions, related to async tasks.

0x401aa8ca: mp_execute_bytecode at /home/cuongbh/esp32/arduino_build/lib_335/components/micropython/py/vm.c line 1436
0x401079db: WebSocketsServerCore::handleClientData() at D:\CloudFERMI\software\arduino-1.8.9_esp_idf3.3.5_cdev\myLib\libraries\arduinoWebSockets-2.3.5\src\WebSocketsServer.cpp line 705
0x40107a7e: WebSocketsServerCore::loop() at D:\CloudFERMI\software\arduino-1.8.9_esp_idf3.3.5_cdev\myLib\libraries\arduinoWebSockets-2.3.5\src\WebSocketsServer.cpp line 958
0x40107a9a: WebSocketsServer::loop() at D:\CloudFERMI\software\arduino-1.8.9_esp_idf3.3.5_cdev\myLib\libraries\arduinoWebSockets-2.3.5\src\WebSocketsServer.cpp line 969
0x400e6433: loop() at D:\CloudFERMI\dev\ESP32\Arduino\cDev\cDev_Gen/cDev_Gen.ino line 400
0x4010fac4: loopTask(void*) at D:\CloudFERMI\software\arduino-1.8.9_esp_idf3.3.5_cdev\hardware\espressif\esp32\cores\esp32\main.cpp line 28
0x4008acde: vPortTaskWrapper at /home/cuongbh/esp32/arduino_build/lib_335/esp-idf/components/freertos/port.c line 143

bhcuong2008 avatar May 22 '21 13:05 bhcuong2008

I have problem with only async run. When run python sync code, there's no problem as my above code,

import time

def test():
   while 1:
      _cfn()
      time.sleep_ms(0)

test()

This code is blocking code because time.sleep_ms(0) return immediately. Or you could see my above Arduino terminal, that run MPY code very quickly, in many days without corruption.

I've not run test suites yet. I will try it. But with sync code above, I think MPY integration is good.

That synchronous test above is very simple and does not stress much of MicroPython (eg it doesn't allocate heap memory). The async test will stress more of MicroPython, and when it crashes that does not conclude that uasyncio is the place with the bug. You need to test other things, and running the test suite will help with that.

Could u help me to write a simple code based on MPY ESP32 port.

No, sorry, I do not have time to help with doing that.

Here is backtrace from another exceptions, related to async tasks.

What is line 1436 of py/vm.c in the version of MP that you use? What was the fault (including register dump) associated with this backtrace? To help diagnose the issue you need to provide all of this information.

dpgeorge avatar May 24 '21 00:05 dpgeorge

@dpgeorge , it's Load Prohibited. I forgot capture register dump. I only have this. When it occurs again, I will capture it.

mpy_exceptions

Line 1436 in original file here, https://github.com/micropython/micropython/blob/master/py/vm.c if (mp_obj_is_subclass_fast(MP_OBJ_FROM_PTR(((mp_obj_base_t*)nlr.ret_val)->type), MP_OBJ_FROM_PTR(&mp_type_StopIteration)))

I dont modify any in source c of MPY in /py

Great news is that, after many failed tests, I create a brand new Arduino project, with very simple code to run MPY. And it's running for nearly 10 hours. I'm waiting for it to run over 24 hours to make sure that the integration is good or not.

Over many tests, I think it may be related to Wifi stack. I will update more in next days.

bhcuong2008 avatar May 24 '21 03:05 bhcuong2008

@dpgeorge , by the way could u suggest me what tests to be run

https://github.com/micropython/micropython/tree/master/tests

In my custom build, there is only MPY core, to be used for data processing engine. No wifi, bluetooth, peripherals, networking, file system, IO,...Here are my total modules:

mpy_sys

Thank you.

bhcuong2008 avatar May 24 '21 06:05 bhcuong2008

by the way could u suggest me what tests to be run

Run all the tests/basics/*.py and tests/stress/*.py tests (ones that aren't supported by your implementation will be skipped automatically). If you have a REPL try this:

$ ./run-tests.py --device /dev/<usb-serial-device> -d basics stress

dpgeorge avatar May 24 '21 06:05 dpgeorge

Here is my new results in this morning.

  1. MPY async run in brand new Arduino project with minimal code to run MPY. No exceptions. It could run over 31 hours, from 01:27:45 24/05/2021 to 09:09:22 25/05/2021 (I must stop it to make another exception test with WifiClient).

  2. I add a small Wifi and WifiClient code, then exception occurs. There may be some conflict between lwip select and MPY async. I dont know whether there are any relationships between "lwip select" and "MPY module uselect" used in asyncio engine.

  • This is wificlient connect test code. mpy_arduino_wificlient

  • This is the exception mpy_arduino_wificlient_exception

  • This is WifiClient.cpp, connect function source code mpy_arduino_wificlient_cpp

  1. In pure ESP-IDF environment, I still dont know how to write code as above Arduino code to show exception due to interaction between lwip select and MPY asyncio. So I can't show here yet.

  2. For MPY test suites, I try to frozen it to run the test, but build failed. I included 1 folder stress, and file run-tests.py to my frozen dir. So I couldn't try the test suites.

bhcuong2008 avatar May 25 '21 04:05 bhcuong2008

* This is the exception

Can you show the disassembly of the .ELF file at the location of the exception 0x400803c0?

dpgeorge avatar May 25 '21 05:05 dpgeorge

@dpgeorge , it's like this, _DoubleExceptionVector

40080340 <_UserExceptionVector>:
40080340:	13d100        	wsr.excsave1	a0
40080343:	068d45        	call0	40086c18 <_xt_user_exc>
	...

400803c0 <_DoubleExceptionVector>:
400803c0:	004140        	break	1, 4
400803c3:	02a002        	movi	a0, 2
400803c6:	13e800        	wsr.exccause	a0
400803c9:	067c45        	call0	40086b90 <_xt_panic>
400803cc:	003200        	rfde
	...

Disassembly of section .iram0.text:

Full file dump is large, 13.3MB.

bhcuong2008 avatar May 25 '21 06:05 bhcuong2008

Finally I could reproduce exceptions on original MPY ESP32 port. But it's not BREAK instr exception. I use MPY 1.15 with ESP32 idf 3.3 by using ESP32 port folder in MPY 1.14, then used in MPY 1.15. There are no modifications. Just copy the whole folder esp32 port in MPY 1.14, to MPY 1.15 port folder. Delete existing esp32 port in MPY 1.15 before copying. Follow the same instructions in README.md to run make with 2 basic steps:

(1) Make python virtual-env, then active it (2) Define env variable ESPIDF, include xtensa-esp32-elf/bin to PATH:

mpy_esp32_port_exception

Exception will occur in from 1 to 2 hours. For me, it was about 1h. I tested 3 times. All had the same exception.

Here are the steps:

  1. Add 2 sdk config in boards/sdkconfig.base, to have the same settings as in my arduino project. CONFIG_FREERTOS_HZ=1000 CONFIG_INT_WDT_TIMEOUT_MS=800

  2. add file _cinit.py to folder modules _cinit.py.txt

  3. Use my main.c main.c.txt

  • Also need to modify these wifi settings in main.c #define EXAMPLE_WIFI_SSID "yourSSID" #define EXAMPLE_WIFI_PASS "yourWifiPass"

Update: I add 2 client connect tasks. It help exception occur quicker. For me, it's about 40 minutes. main.c.txt

bhcuong2008 avatar May 25 '21 15:05 bhcuong2008

I got again exception at file py/vm.c line 1436 as before. This time I capture its detail info.

arduino_mpy_exception_1436

I also often see exception at file py/vm.c line 1471 like this one. arduino_mpy_exception_1471

bhcuong2008 avatar May 25 '21 16:05 bhcuong2008

If you can't freeze the test suite, have you tried running it with mpremote?

peterhinch avatar May 26 '21 04:05 peterhinch

I dont know it before. I just try it. It could go to REPL. But failed to do other things, such cat, mount, run,.. like below: This tool is really useful, especially mount.

(build-venv) cuongbh@ubuntu-srv02:~/micropython/micropython-1.15-esp32-1.14/ports/esp32$ mpremote cat _cinit.py
b'OK\r\nMPY: soft reboot\nraw REPL; CTRL-B to exit\r\n>'
Traceback (most recent call last):
  File "/home/cuongbh/micropython/micropython-1.14/ports/esp32/build-venv/bin/mpremote", line 8, in <module>
    sys.exit(main())
  File "/home/cuongbh/micropython/micropython-1.14/ports/esp32/build-venv/lib/python3.8/site-packages/mpremote/main.py", line 990, in main
    pyb.enter_raw_repl()
  File "/home/cuongbh/micropython/micropython-1.14/ports/esp32/build-venv/lib/python3.8/site-packages/mpremote/pyboard.py", line 343, in enter_raw_repl
    raise PyboardError("could not enter raw repl")
mpremote.pyboard.PyboardError: could not enter raw repl

My device could go to raw/normal mode as below:

(build-venv) cuongbh@ubuntu-srv02:~/micropython/micropython-1.15-esp32-1.14/ports/esp32$ mpremote
Connected to MicroPython at /dev/ttyUSB0
Use Ctrl-] to exit this shell

>>> 
>>> help()
Welcome to MicroPython!

For online docs please visit http://docs.micropython.org/

Control commands:
  CTRL-A        -- on a blank line, enter raw REPL mode
  CTRL-B        -- on a blank line, enter normal REPL mode
  CTRL-C        -- interrupt a running program
  CTRL-D        -- on a blank line, exit or do a soft reset
  CTRL-E        -- on a blank line, enter paste mode

For further help on a specific object, type help(obj)
>>> 
MicroPython v1.15 on 2021-04-27; cThings IoT with ESP32
Type "help()" for more information.
>>> 
raw REPL; CTRL-B to exit
>
MicroPython v1.15 on 2021-04-27; cThings IoT with ESP32
Type "help()" for more information.
>>> 

bhcuong2008 avatar May 26 '21 05:05 bhcuong2008

If you navigate to the tests directory and issue mpremote mount . you should be able to run the test suite as @dpgeorge suggested.

peterhinch avatar May 27 '21 03:05 peterhinch

@peterhinch , I could mount it into device. After that, I changed to dir /remote, and run command as ur guide, but failed

>>> os.chdir('/remote')
>>> os.listdir('.')
['run-internalbench.py', 'io', 'cmdline', 'import', 'internal_bench', 'run-multitests.py', 'pyb', 'micropython', 'run-tests-exp.py', 'net_inet', 'run-perfbench.py', 'inlineasm', 'cpydiff', 'extmod', 'run-natmodtests.py', 'qemu-arm', 'perf_bench', 'run-tests-exp.sh', 'multi_net', 'pybnative', 'net_hosted', 'stress', 'thread', 'unicode', 'unix', 'esp32', 'basics', 'misc', 'run-tests.py', 'feature_check', 'jni', 'float', 'wipy', 'multi_bluetooth', 'README']
>>> run-tests.py --device /dev/ttyUSB0 -d stress
Traceback (most recent call last):
  File "<stdin>", line 1
SyntaxError: invalid syntax

Could u show me how to run it. Thank u.

bhcuong2008 avatar May 27 '21 12:05 bhcuong2008

You are right, it won't run under mpremote, firstly because the filename run-tests.py isn't a valid Python name. But even after correcting this, it failed with an allocation error. My bad.

Running the tests as Damien suggests did work. It is designed to run the tests remotely so there is no need for frozen bytecode. You simply navigate to the tests directory and issue the command. In my case I did a basic check on a Pyboard 1.1 without making any changes to the Pyboard:

[adminpete@capybara]: /mnt/qnap2/data/Projects/MicroPython/micropython/tests
$ ./run-tests.py --device /dev/ttyUSB0
pass  basics/0prelim.py
pass  basics/andor.py
pass  basics/annotate_var.py
pass  basics/array1.py

[REAMS OF OUTPUT OMITTED]

pass  unix/ffi_float2.py
pass  unix/time.py
795 tests performed (20851 individual testcases)
795 tests passed
57 tests skipped: builtin_help builtin_next_arg2 builtin_range_binop class_inplace_op2 io_buffered_writer memoryview_itemsize namedtuple_asdict sys_getsizeof cmd_parsetree repl_words_move framebuf1 framebuf16 framebuf2 framebuf4 framebuf8 framebuf_subclass machine_timer uasyncio_threadsafeflag ucryptolib_aes128_ctr urandom_extra urandom_extra_float urandom_seed_default ure_debug ure_groups ure_span ure_sub ure_sub_unmatched vfs_basic vfs_blockdev vfs_fat_fileio1 vfs_fat_fileio2 vfs_fat_finaliser vfs_fat_more vfs_fat_mtime vfs_fat_oldproto vfs_fat_ramdisk vfs_fat_ramdisklarge vfs_lfs vfs_lfs_corrupt vfs_lfs_error vfs_lfs_file vfs_lfs_mount vfs_lfs_mtime vfs_lfs_superblock vfs_posix vfs_userfs math_factorial_intbig resource_stream heap_locked import_mpy_invalid import_mpy_native_gc import_mpy_native_x64 schedule sys_settrace_features sys_settrace_generator sys_settrace_loop extra_coverage
[adminpete@capybara]: /mnt/qnap2/data/Projects/MicroPython/micropython/tests

In your case, assuming this passes, I'd repeat with the extra args Damien recommended.

I think running the test suite is a vital step in checking your integration.

peterhinch avatar May 28 '21 03:05 peterhinch

@peterhinch , thank u so much for your guide. I've just run it successful on my custom build. All passed.

mpy_custom_test_suites_result.txt

Over many tests I think that my asyncio issue may be related to lwip stack. It may cause heap corruption, affect to MPY. This happens to original MPY build, not only to my custom build, when integrating lwip to main program.

bhcuong2008 avatar May 28 '21 04:05 bhcuong2008

@bhcuong2008 I'm trying to reproduce the issue based on your instructions above in https://github.com/peterhinch/micropython-async/issues/60#issuecomment-847952592 . I have some questions:

  1. what exact version of the ESP IDF do you use?
  2. what exact version of the toolchain do you use (xtensa-esp32-elf-gcc --version)?
  3. do you have any network clients running which connect to the esp32 (does HOST_IP_ADDR need to point to a valid host)?

dpgeorge avatar May 30 '21 12:05 dpgeorge

@dpgeorge , Here are my info of my test on original MPY project.

  1. ESP IDF version:
  • I got it from READ.md instruction of MPY 1.14 repo: Supported git hash (v3.3): 9e70825d1e1cbf7988cf36981774300066580ea7
  1. xtensa-esp32-elf/bin/xtensa-esp32-elf-gcc --version xtensa-esp32-elf-gcc (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 5.2.0

  2. No network clients connect to ESP32.

  • HOST_IP_ADDR is not valid host in the test to make lwip stack retrying to connect. It's just random IP address in your network. Maybe "192.168.1.10"

bhcuong2008 avatar May 30 '21 14:05 bhcuong2008

Thanks @bhcuong2008 , I can now build it and it's running. But no failures yet.

dpgeorge avatar May 30 '21 14:05 dpgeorge

I think that exception may have already occurred. If not, pls check async task is running or not. In REPL, if _ctasks['status'] = 2, then the async task is running.

bhcuong2008 avatar May 30 '21 17:05 bhcuong2008

It had a watchdog timeout after about 40 minutes, but that doesn't give any useful information about why. Is this expected?

dpgeorge avatar May 30 '21 23:05 dpgeorge

I ran it some more and had 2 non-wdt exceptions. The first was:

>> [1] Connecting to 192.168.20.100:1337 with timeout: 5000ms
>> [2] Connecting to 192.168.20.100:1337 with timeout: 5000ms
E (970417) mpy_test: select returned due to timeout 5000 ms for fd 54
Connection failed! Waiting 2 seconds before retrying...
E (970424) mpy_test: select returned due to timeout 5000 ms for fd 55
Connection failed! Waiting 2 seconds before retrying...
>> [1] Connecting to 192.168.20.100:1337 with timeout: 5000ms
>> [2] Connecting to 192.168.20.100:1337 with timeout: 5000ms
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x400ed5b6  PS      : 0x00060530  A0      : 0x8011358b  A1      : 0x3ffb7630
A2      : 0x3ffaff64  A3      : 0x00000000  A4      : 0x3ffaffb0  A5      : 0x00000000
A6      : 0x00000000  A7      : 0x00000000  A8      : 0x800ec512  A9      : 0x3ffdf590
A10     : 0x000ee048  A11     : 0x3f407fa4  A12     : 0x3ffce350  A13     : 0x0000ffff
A14     : 0x3ffdf740  A15     : 0x3ffdf6a0  SAR     : 0x00000019  EXCCAUSE: 0x0000001c
EXCVADDR: 0x000ee048  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0x00000000

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x400ed5b6:0x3ffb7630 0x40113588:0x3ffb7670 0x4009a7ae:0x3ffb76a0

Rebooting...

The address 0x400ed5b6 points to an invalid location within mp_execute_bytecode().

The second exception was right at start up:

ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:5016
load:0x40078000,len:10600
load:0x40080400,len:5684
entry 0x400806bc
***ERROR*** A stack overflow in task mp_task has been detected.
abort() was called at PC 0x40091b70 on core 0

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x4009183b:0x3ffca8e0 0x40091b59:0x3ffca900 0x40091b70:0x3ffca920 0x4009b572:0x3ffca
940 0x4009d028:0x3ffca960 0x4009cfde:0x00000000

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:5016
load:0x40078000,len:10600
load:0x40080400,len:5684
entry 0x400806bc
***ERROR*** A stack overflow in task mp_task has been detected.
abort() was called at PC 0x40091b70 on core 0

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x4009183b:0x3ffca8e0 0x40091b59:0x3ffca900 0x40091b70:0x3ffca920 0x4009b572:0x3ffca940 0x4009d028:0x3ffca960 0x4009cfde:0x00000000

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:5016
load:0x40078000,len:10600
load:0x40080400,len:5684
entry 0x400806bc
MicroPython v1.15-dirty on 2021-05-31; ESP32 module with ESP32
Type "help()" for more information.
>>> Setting WiFi configuration SSID <SSID>...Waiting for AP connection...
SYSTEM_EVENT_STA_START
SYSTEM_EVENT_STA_GOT_IP
SYSTEM_EVENT_STA_GOT_IP6
IPv6: FE80::DAA0:1DFF:FE46:E2C0
Connected to AP

So on power up it reset twice before running the wifi connection code.

I also noticed that most times after reset the thread started by _cinit.py (namely _cthread()) never starts, for some strange reason. It needs multiple power cycles before this thread starts running correctly.

So I think there's a deep problem with the integration of MicroPython, or how FreeRTOS is working in this set up.


I don't have further time to investigate this problem as it currently stands, because there are too many unknowns and conflicting components (it's definitely not supported to copy v1.14 esp32 into v1.15 MicroPython core).

The way forward here would be for someone to find a reproduction of the issue on v1.15 (or the latest master) of MicroPython, using IDF v4.2 and cmake. Then it would be considered a real bug to be fully investigated.

dpgeorge avatar May 31 '21 01:05 dpgeorge

@dpgeorge , Yes, these random exceptions same as mine on custom build. I will run this test on original MPY 1.15 IDF v4.2. Thank you for your test.

bhcuong2008 avatar May 31 '21 01:05 bhcuong2008

@dpgeorge, @peterhinch After many observations during my code developing and tests, I think that the problem is due to heap fragmented because of push/pop task very often on the heap. Memory corruption only happening when using wifi and socket, these 2 may use heap memory a lot.

So I want to implement task queues in C++ advanced list data types, so that it could reserve buffer to avoid allocate/free heap memory too often. But I dont understand pair_heap well, including functions mp_pairheap_meld, mp_pairheap_pairing, mp_pairheap_delete.

Could u give me some guides on this replacement steps so that I could code it my app Arduino C++ to test again my above assumption.

Thank you so much.

bhcuong2008 avatar Jul 15 '21 06:07 bhcuong2008

So I want to implement task queues in C++ advanced list data types, so that it could reserve buffer to avoid allocate/free heap memory too often.

The pairing heap implementation does not allocate any memory, it reuses the nodes and links them to each other when doing push/pop. Memory is only allocated when a new asyncio task is created.

dpgeorge avatar Jul 27 '21 05:07 dpgeorge

@dpgeorge thank you for your precious info.

Now I have the same issue when run sync code in loop. I'm doing many tests to uncover something. Many exceptions have backtrace, py/vm.c, line 1436. And EXCVADDR: 0x00000000.

if (mp_obj_is_subclass_fast(MP_OBJ_FROM_PTR(((mp_obj_base_t*)nlr.ret_val)->type), MP_OBJ_FROM_PTR(&mp_type_StopIteration)))

I think when my exception occurs, nlr.ret_val pointer pointing to NULL, so access it ->type, make double exception. I want to add a check NULL value of nlt.ret_val right before line 1436. But I dont know how to do it properly if it's really NULL. Pls guide me.

Thank you very much.

bhcuong2008 avatar Jul 29 '21 09:07 bhcuong2008

I also got exception here, file py/objfun.c, line 288, 289, also EXCVADDR: 0x00000000

mp_vm_return_kind_t vm_return_kind = mp_execute_bytecode(code_state, MP_OBJ_NULL);
mp_globals_set(code_state->old_globals);

I think that code_state allocation failed. I dont know how to do it properly when code_state NULL. Pls guide me.

UPDATE. Sometimes I got this:

assertion "ATB_GET_KIND(block) == AT_HEAD" failed: file "/home/cuongbh/esp32/arduino_build/lib_335/components/micropython/py/gc.c", line 601, function: gc_free
abort() was called at PC 0x4013b8df on core 1

bhcuong2008 avatar Jul 29 '21 11:07 bhcuong2008

Here is my sync code test. It has the same exceptions as the previous async code.

This could run over 12 hours without any issue with period = 0ms.

x = {}
def test_exception2(period):   
   while 1:
      try:
         x['_cntButton'] = 0 if '_cntButton' not in x or x['_cntButton'] >= N[0] else x['_cntButton'] + 1
         time.sleep_ms(period)
      except Exception as e:
         print(str(e))

But, this one could last tens of seconds to several minutes with period 0, 5, 10, 50, 100ms, then exceptions happens. DObj is a Python class inherited from dict, class DObj(dict).

o = DObj()
def test_exception2(period):   
   while 1:
      try:
         o._cntButton = 0 if not hasattr(o, '_cntButton') or o._cntButton >= N[0] else o._cntButton + 1
         time.sleep_ms(period)
      except Exception as e:
         print(str(e))

These codes just paste to REPL and run. N is a list. N[0] contains max counter.

I wonder why there are too much differences!

bhcuong2008 avatar Jul 30 '21 03:07 bhcuong2008

Should we check every memory allocation by MPY within its specified heap buffer for ESP32?

In the datasheet, it says that its valid range is from 0x3fxxxxxx - 0x6xxxxxxx.

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/fatal-errors.html This CPU exception happens when application attempts to read from or write to an invalid memory location. The address which was written/read is found in EXCVADDR register in the register dump. If this address is zero, it usually means that application attempted to dereference a NULL pointer. If this address is close to zero, it usually means that application attempted to access member of a structure, but the pointer to the structure was NULL. If this address is something else (garbage value, not in 0x3fxxxxxx - 0x6xxxxxxx range), it likely means that the pointer used to access the data was either not initialized or was corrupted.

bhcuong2008 avatar Jul 30 '21 04:07 bhcuong2008