esp-idf icon indicating copy to clipboard operation
esp-idf copied to clipboard

ASSERT_PARAM(-218959118 0), in arch_main.c at line 343 (IDFGH-6126)

Open ChrisEAlfred opened this issue 3 years ago • 99 comments

Opening - same as https://github.com/espressif/esp-idf/issues/6517. Still occurs in 4.3.1

ChrisEAlfred avatar Nov 01 '21 05:11 ChrisEAlfred

Same here, it happens to me on 4.2.2.

It is less frequent than it used to be, but it still happens

ASSERT_PARAM(-218959118 0), in arch_main.c at line 343
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0). 

Core  0 register dump:
PC      : 0x40088b64  PS      : 0x00060434  A0      : 0x801395d0  A1      : 0x3ffbf8f0  
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x60008048  A5      : 0x00000000  
A6      : 0x00000004  A7      : 0x3ffbdcb8  A8      : 0x80088b64  A9      : 0x3ffbf8d0  
A10     : 0x0000f2f2  A11     : 0x00000037  A12     : 0x00000014  A13     : 0xffffffff  
A14     : 0x00000000  A15     : 0xfffffffc  SAR     : 0x00000004  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x40088a9c  LEND    : 0x40088aa3  LCOUNT  : 0x00000000  
Core  0 was running in ISR context:
EPC1    : 0x401ccdf3  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40088b64

Backtrace:0x40088b61:0x3ffbf8f0 0x401395cd:0x3ffbf910 0x40019fb5:0x3ffbf930 0x40046683:0x3ffbf960 0x40047515:0x3ffbf980 0x4008bbb1:0x3ffbf9a0 0x4008b805:0x3ffbf9c0 0x4008df12:0x3ffbf9e0 0x4008f18f:0x3ffbfa00 0x4008645d:0x3ffbfa20 0x401399f5:0x3ffb5f10 0x40139fd5:0x3ffb5f30 0x4009487e:0x3ffb5f60
Core  1 register dump:
PC      : 0x4009296d  PS      : 0x00060d34  A0      : 0x80092ad5  A1      : 0x3ffbcc00  
A2      : 0x3ffc76a8  A3      : 0x00060d20  A4      : 0x800949c4  A5      : 0x3ffbcb20  
A6      : 0x00000000  A7      : 0x3ffbc500  A8      : 0x00000015  A9      : 0x00000001  
A10     : 0x000001a4  A11     : 0x00000000  A12     : 0x00060d20  A13     : 0x00060d23  
A14     : 0x00000001  A15     : 0x3ffbc500  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

Backtrace:0x4009296a:0x3ffbcc00 0x40092ad2:0x3ffbcc20 0x40092cf0:0x3ffbcc40 0x4009487e:0x3ffbcc60

alexmantaut avatar Nov 01 '21 06:11 alexmantaut

@Alvin1Zhang This issue has been reported quite long time ago, and at least 3 people reported that the fix does not really fix the issue. https://github.com/espressif/esp-idf/issues/6517#issuecomment-891186177

AxelLin avatar Nov 01 '21 06:11 AxelLin

This ongoing issue is the reason we removed the ESP32 from a product we're developing.

jmaha avatar Nov 01 '21 07:11 jmaha

Thanks for reporting the issue. The crash was because after the controller ran out of memory, malloc failure happens and assert is triggered. The issue is usually caused by the slow processing of the application layer, which caused the controller memory could not be released in time. We did some optimization before, however, if the latest IDF version still has the issue, please refer to the following suggestions and give it a try.

  1. Reduce time-consuming operations at the application layer, such as a large number of log printing.
  2. Increase the CPU speed, for example, increase the CPU frequency to 240Mhz. If dual-core is supported, run bluedroid host and Controller on different cores.
  3. If you are scanning at the same time: Try to reduce controller memory consumption, for example, by making the following changes in the configuration: (1) Change BLE adv report flow control number to 20 and BLE adv lost event threshold value to 100 (2) Disable BLE full scan feature supported (3) Enable BLE queue congestion check (4) Disable Report adv data and scan response individually when BLE active scan (5) Enable duplicate Scan when setting scan parameters to filter out duplicate broadcast packets

Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.

Tiana, Espressif

TianaESP avatar Nov 03 '21 06:11 TianaESP

@TianaESP we have a similar error but 1) is not possible as we need the tasks todo the stuff they need todo. 2) Update on that side we already run at 240hz more is not possible 3) We can certainly try that but we are not using bluedroid we are using the nimble stack which seems to also use libbtdm in which this assertion seems to happen.

May i ask what is the particular reason why there is not source for this part? As that would make debugging easier. (Not much of a fan of re'ing such stuff or reading xtensa assembly.. )

We can reporduce the issue with a run that takes around +3 hours. Not sure if there can be a minimal repro made as you guy's would still require the hardware on which we do the tests. We can supply stacktraces and registers if that may help. Or if you wish we can certainly setup a remote session or something in order to get this working again. This is basically a show stopper for our entire project so if this can't be fixed we have to move to a diffrent RTOS and do a lot of the work again.

Platin21 avatar Nov 05 '21 08:11 Platin21

I'm not to sure if the controller memory is the problem as we have 4096 allocated to it so that is plenty or do you mean there is a leak inside the controller task? That would be very unfortnunate.

Platin21 avatar Nov 05 '21 08:11 Platin21

Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.

@TianaESP

Just to clarify about your comments above.

So you should be able to reproduce it by

  • Reduce CPU speed
  • Disable dual-core
  • Add more logging

In additional, this issue does not happen immediately. The device can run for hours then it suddenly hit assert. So it's really difficult for application developers to know what could be wrong.

AxelLin avatar Nov 05 '21 08:11 AxelLin

@AxelLin in our case we have dual-core running, we have the highest frequnecy set, we have it still happening. Our host controller has 4096 as a buffer in memory.

Platin21 avatar Nov 05 '21 08:11 Platin21

@Platin21 Can you please provide the specific commit id of IDF and describe what the host task did?

TianaESP avatar Nov 05 '21 09:11 TianaESP

@TianaESP We are using version 4.3.1 and 4.4 but it was the same on ealier builds. We didn't want to go full master as we are not sure about the stability of the other features that we use with that. Are there any recent changes that should have fixed this?

Platin21 avatar Nov 05 '21 10:11 Platin21

@TianaESP I can certainly provide you a trace not sure if that will help alot though. We don't have JTAG on the board so debugging is somewhat difficult.

(3332242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3332242) task_wdt:  - IDLE (CPU 1)
(3332242) task_wdt: Tasks currently running:
(3332242) task_wdt: CPU 0: IDLE
(3332242) task_wdt: CPU 1: btController
(3332242) task_wdt: Print CPU 0 (current core) backtrace


Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

(3332242) task_wdt: Print CPU 1 backtrace


Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3332399) UDS: Read data id fe20
(3332406) UDS: Read data id fe1f
(3332425) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3332426) UDP SERVER: JoDaHomas
(3332430) UDP SERVER: Waiting for data
(3337242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3337242) task_wdt:  - IDLE (CPU 1)
(3337242) task_wdt: Tasks currently running:
(3337242) task_wdt: CPU 0: IDLE
(3337242) task_wdt: CPU 1: btController
(3337242) task_wdt: Print CPU 0 (current core) backtrace


Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

 (3337242) task_wdt: Print CPU 1 backtrace


Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3337321) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3337321) UDP SERVER: JoDaHomas
(3337325) UDP SERVER: Waiting for data
(3337647) UDS: Read data id fe20
(3337653) UDS: Read data id fe1f
(3342242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3342242) task_wdt:  - IDLE (CPU 1)
(3342242) task_wdt: Tasks currently running:
(3342242) task_wdt: CPU 0: IDLE
(3342242) task_wdt: CPU 1: btController
(3342242) task_wdt: Print CPU 0 (current core) backtrace


Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

 (3342242) task_wdt: Print CPU 1 backtrace


Backtrace: 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3342356) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3342357) UDP SERVER: JoDaHomas
(3342360) UDP SERVER: Waiting for data
(3342928) UDS: Read data id fe20
(3342934) UDS: Read data id fe1f
(3347242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3347242) task_wdt:  - IDLE (CPU 1)
(3347242) task_wdt: Tasks currently running:
(3347242) task_wdt: CPU 0: IDLE
(3347242) task_wdt: CPU 1: btController
(3347242) task_wdt: Print CPU 0 (current core) backtrace


Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820 

(3347242) task_wdt: Print CPU 1 backtrace


Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40 

(3347382) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3347383) UDP SERVER: JoDaHomas
(3347388) UDP SERVER: Waiting for data
(3348169) UDS: Read data id fe20
(3348178) UDS: Read data id fe1f
ASSERT_PARAM(-218959118 0), in arch_main.c at line 343                                               <==== The reported ASSERT
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0). 

Core  0 register dump:
PC      : 0x4018bf52  PS      : 0x00060334  A0      : 0x800d4d40  A1      : 0x3ffbc7e0  
A2      : 0x00000000  A3      : 0x00000001  A4      : 0x00000014  A5      : 0x000000a5  
A6      : 0x000000a5  A7      : 0x00060023  A8      : 0x801132ae  A9      : 0x3ffbc7c0  
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x00000014  A13     : 0x000000a5  
A14     : 0x000000a5  A15     : 0x00060023  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  


Backtrace:0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820 


Core  1 register dump:
PC      : 0x40089689  PS      : 0x00060f34  A0      : 0x80173c7c  A1      : 0x3ffbeec0  
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x60008048  A5      : 0x00000000  
A6      : 0x00000004  A7      : 0x3ffc21ec  A8      : 0x80089684  A9      : 0x3ffbeea0  
A10     : 0x00000000  A11     : 0x00000037  A12     : 0x00000014  A13     : 0xffffffff  
A14     : 0x00000000  A15     : 0xfffffffc  SAR     : 0x00000004  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x400895bc  LEND    : 0x400895c3  LCOUNT  : 0x00000000  

(Edit removed color escape sequences)

And as i said ealier we have no problem spending some time on this bug as this is pretty important for us to work.

Platin21 avatar Nov 05 '21 10:11 Platin21

@TianaESP And here is the decoded addresses sadly still some ??'s:

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x40093907: #idf path#/freertos/tasks.c:3823

CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40089686:0x3ffbeec00x40173c79:0x3ffbeee0 0x40019fb5:0x3ffbef00 0x400466c3:0x3ffbef30 0x4008c772:0x3ffbef50 0x4008c45d:0x3ffbef70 0x4008eb6a:0x3ffbef90 0x4008fde7:0x3ffbefb0 0x40083299:0x3ffbefd0 0x40174091:0x3ffe7c20 0x40174659:0x3ffe7c40
0x40089686: ??:?
0x40019fb5: ??:0
0x400466c3: ??:0
0x4008c772: ??:?
0x4008c45d: ??:?
0x4008eb6a: ??:?
0x4008fde7: intc.c:?
0x40083299: #idf path#/freertos/port/xtensa/xtensa_vectors.S:1105
0x40174091: ??:?
0x40174659: ??:?

Platin21 avatar Nov 05 '21 10:11 Platin21

In this case it was running for around 3 hours which is for our use case not that uncommon. And as i said we are willing to take some time to debug the issue but for us it's not a clear case. And we can only see that libbtdm as inside the controller task we have a failure.

Platin21 avatar Nov 05 '21 10:11 Platin21

After this the device has a Guru Mediation Error and Restarts which kills the BLE Connection and makes it worthless for the customer.

Platin21 avatar Nov 05 '21 10:11 Platin21

1.Coud you give the commit id of the idf you are using? 2 Could you describe the demo you are using? We have a demo which the device do scan , adv and connect to four other devices. But we run about one day and could not reproduce the issue.

skyESP avatar Nov 05 '21 12:11 skyESP

@skyESP 1: rel4.4 https://github.com/espressif/esp-idf/tree/release/v4.4 rel4.3 https://github.com/espressif/esp-idf/tree/release/v4.3 This report was from a run with version 4.4 (should be exact commit)

2: Just a normal BLE process with RPA Enabled. As in start with adv then do a connect then we have GATT Attributes that we retrieve every 5 seconds. Our process also does do a scan for sensors but that is not active during the tests.

We also have a UDP Echo server running which also gets polled in a 5 second interval which is running over Wifi the BLE and UDP Requests have a offset of 2.5 Seconds so that they don’t interfere.

We also have a Additional Task for retrieving CAN Messages via one of the Uart‘s where we also poll each 5 seconds

(Edit) We have just a single device connected that does the polling (Samsung A22 Phone, Tested with other devices same outcomes)

Platin21 avatar Nov 05 '21 12:11 Platin21

@Platin21 1.During the test, Does the Adv is still continue? If it does,what's the frequence? 2. Do you use bluedroid or nimble as your host layer? 3.Could you try to run the wifi and bt on different core. For example wifi 0 core ,bt 1 core.

skyESP avatar Nov 08 '21 02:11 skyESP

@skyESP

  1. No it doesn’t continue
  2. As i already said we use nimble
  3. What part of bt? The host our task or the controller? We switched some stuff already around and changed priorities but that made no notable change..

Platin21 avatar Nov 08 '21 06:11 Platin21

And no we have no option to switch to bluedroid as that would not fit in memory anymore.

Platin21 avatar Nov 08 '21 06:11 Platin21

@Platin21 OK,Thanks for your information. We are trying to reproduce the issue.

For the third question. Just the controller of bt . You can do as following. We have menuconfig to chose running on core0 or core1. run idf.py menuconfig For wifi: (Top) → Component config → Wi-Fi → WiFi Task Core ID For bt (Top) → Component config → Bluetooth → Bluetooth controller(ESP32 Dual Mode Bluetooth) → The cpu core which bluetooth controller run

Another two questions 1.What's the priority of the task for retrieving CAN Messages ? Does it higher than controller task? 2.From your describe you just use polling way not using interrupt for handling the CAN Messages , Am I right?

skyESP avatar Nov 08 '21 08:11 skyESP

@skyESP Let me check if that changes anything and i get back to you.

  1. 20 from max 25. The BLE one is also slightly evalated i think it’s 15 there.
  2. It’s done via Interrupt, Polling is only used for retrieving the uds messages

Platin21 avatar Nov 08 '21 10:11 Platin21

OK, You set the priority of the controller task lower than the task to handle the can message. Here please do two tests.

  1. Set the controller task higher than the task to handle the CAN message.If still have problem please do the step 2. 2.Reduce the load or stop the the task to handle the CAN message and check if the issue is still here.

I suspect that the controller task is blocked by other task or the interrupt.

skyESP avatar Nov 08 '21 11:11 skyESP

@skyESP where can one change the controller priority? Seems to be not a option inside Component config -> Bluetooth -> Bluetooth Controller. Or what is the default level there?

  1. Isn’t really a fix but for testing sure i can do that.

Platin21 avatar Nov 08 '21 12:11 Platin21

@Platin21 1.You can change the priority in the code where you create the task. 2.From our experience the controller should not be blocked by other tasks or interrupt for long time. Through I do not know how long and how much work you did in your task,so I advance you to do this to test and verify my suspection. If it works,this can be a fix.

skyESP avatar Nov 09 '21 01:11 skyESP

@skyESP you mean the task that handles the BLE so neither the controller nor the host?

I would first need to know under which value it needs to be to not take priority.

Platin21 avatar Nov 09 '21 07:11 Platin21

@Platin21 hi, 1.You had better keep the priority of controller and host task as our demo example. 2.The task you created yourself should lower than the controller and host ble task. The priority should be: ble controller task > ble host task > application task ( build by your self) 3.If this still can not work, can you give us your code?

skyESP avatar Nov 10 '21 03:11 skyESP

  1. I didn’t change anything there what i was not sure about is if our task should have a higher prio than the can one
  2. This was the case anyways i think meanwhile i dunno what the default prio of these other two tasks is.
  3. Possible but that needs to be elaborated in a Email. And i suspect it makes more sense that we setup something as i already said and you maybe can guess that the pcb on which the esp sits is custom. @skyESP

Platin21 avatar Nov 10 '21 06:11 Platin21

@skyESP so back to the priorities the host task seems to run on 21 and the LL task seems to run on 24 so our can task is acutally already lower in priority. And to the other tasks on the system all of them run with priority 0 so should not be possbile for it to interfere at all then.

Platin21 avatar Nov 10 '21 11:11 Platin21

hi,@Platin21 Sorry for replay late. Yes,Priority 0 of other tasks should not interfere them.
For the example ble_ibeacon demo in our IDF sdk. The controller task is 23,hci host task is 22, BTU task is 20 ,BTC task is 19.

skyESP avatar Nov 10 '21 11:11 skyESP

Yeah that makes it clearer now. We will setup something tomorrow and i will leave my response here. But yeah it sounds very promising. We are now switching the priorities to be under a value of 18. Would be a huge step for us forward if that works out.

Platin21 avatar Nov 10 '21 13:11 Platin21