esp-idf
esp-idf copied to clipboard
ASSERT_PARAM(-218959118 0), in arch_main.c at line 343 (IDFGH-6126)
Opening - same as https://github.com/espressif/esp-idf/issues/6517. Still occurs in 4.3.1
Same here, it happens to me on 4.2.2.
It is less frequent than it used to be, but it still happens
ASSERT_PARAM(-218959118 0), in arch_main.c at line 343
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).
Core 0 register dump:
PC : 0x40088b64 PS : 0x00060434 A0 : 0x801395d0 A1 : 0x3ffbf8f0
A2 : 0x00000001 A3 : 0x00000000 A4 : 0x60008048 A5 : 0x00000000
A6 : 0x00000004 A7 : 0x3ffbdcb8 A8 : 0x80088b64 A9 : 0x3ffbf8d0
A10 : 0x0000f2f2 A11 : 0x00000037 A12 : 0x00000014 A13 : 0xffffffff
A14 : 0x00000000 A15 : 0xfffffffc SAR : 0x00000004 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x40088a9c LEND : 0x40088aa3 LCOUNT : 0x00000000
Core 0 was running in ISR context:
EPC1 : 0x401ccdf3 EPC2 : 0x00000000 EPC3 : 0x00000000 EPC4 : 0x40088b64
Backtrace:0x40088b61:0x3ffbf8f0 0x401395cd:0x3ffbf910 0x40019fb5:0x3ffbf930 0x40046683:0x3ffbf960 0x40047515:0x3ffbf980 0x4008bbb1:0x3ffbf9a0 0x4008b805:0x3ffbf9c0 0x4008df12:0x3ffbf9e0 0x4008f18f:0x3ffbfa00 0x4008645d:0x3ffbfa20 0x401399f5:0x3ffb5f10 0x40139fd5:0x3ffb5f30 0x4009487e:0x3ffb5f60
Core 1 register dump:
PC : 0x4009296d PS : 0x00060d34 A0 : 0x80092ad5 A1 : 0x3ffbcc00
A2 : 0x3ffc76a8 A3 : 0x00060d20 A4 : 0x800949c4 A5 : 0x3ffbcb20
A6 : 0x00000000 A7 : 0x3ffbc500 A8 : 0x00000015 A9 : 0x00000001
A10 : 0x000001a4 A11 : 0x00000000 A12 : 0x00060d20 A13 : 0x00060d23
A14 : 0x00000001 A15 : 0x3ffbc500 SAR : 0x0000001d EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Backtrace:0x4009296a:0x3ffbcc00 0x40092ad2:0x3ffbcc20 0x40092cf0:0x3ffbcc40 0x4009487e:0x3ffbcc60
@Alvin1Zhang This issue has been reported quite long time ago, and at least 3 people reported that the fix does not really fix the issue. https://github.com/espressif/esp-idf/issues/6517#issuecomment-891186177
This ongoing issue is the reason we removed the ESP32 from a product we're developing.
Thanks for reporting the issue. The crash was because after the controller ran out of memory, malloc failure happens and assert is triggered. The issue is usually caused by the slow processing of the application layer, which caused the controller memory could not be released in time. We did some optimization before, however, if the latest IDF version still has the issue, please refer to the following suggestions and give it a try.
- Reduce time-consuming operations at the application layer, such as a large number of log printing.
- Increase the CPU speed, for example, increase the CPU frequency to 240Mhz. If dual-core is supported, run bluedroid host and Controller on different cores.
- If you are scanning at the same time: Try to reduce controller memory consumption, for example, by making the following changes in the configuration: (1) Change BLE adv report flow control number to 20 and BLE adv lost event threshold value to 100 (2) Disable BLE full scan feature supported (3) Enable BLE queue congestion check (4) Disable Report adv data and scan response individually when BLE active scan (5) Enable duplicate Scan when setting scan parameters to filter out duplicate broadcast packets
Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.
Tiana, Espressif
@TianaESP we have a similar error but 1) is not possible as we need the tasks todo the stuff they need todo. 2) Update on that side we already run at 240hz more is not possible 3) We can certainly try that but we are not using bluedroid we are using the nimble stack which seems to also use libbtdm in which this assertion seems to happen.
May i ask what is the particular reason why there is not source for this part? As that would make debugging easier. (Not much of a fan of re'ing such stuff or reading xtensa assembly.. )
We can reporduce the issue with a run that takes around +3 hours. Not sure if there can be a minimal repro made as you guy's would still require the hardware on which we do the tests. We can supply stacktraces and registers if that may help. Or if you wish we can certainly setup a remote session or something in order to get this working again. This is basically a show stopper for our entire project so if this can't be fixed we have to move to a diffrent RTOS and do a lot of the work again.
I'm not to sure if the controller memory is the problem as we have 4096 allocated to it so that is plenty or do you mean there is a leak inside the controller task? That would be very unfortnunate.
Meanwhile, we tried but unfortunately did not reproduce the issue yesterday. Could you please provide a tiny demo with the steps of reappearance? We will keep looking into this. Thanks.
@TianaESP
Just to clarify about your comments above.
So you should be able to reproduce it by
- Reduce CPU speed
- Disable dual-core
- Add more logging
In additional, this issue does not happen immediately. The device can run for hours then it suddenly hit assert. So it's really difficult for application developers to know what could be wrong.
@AxelLin in our case we have dual-core running, we have the highest frequnecy set, we have it still happening. Our host controller has 4096 as a buffer in memory.
@Platin21 Can you please provide the specific commit id of IDF and describe what the host task did?
@TianaESP We are using version 4.3.1 and 4.4 but it was the same on ealier builds. We didn't want to go full master as we are not sure about the stability of the other features that we use with that. Are there any recent changes that should have fixed this?
@TianaESP I can certainly provide you a trace not sure if that will help alot though. We don't have JTAG on the board so debugging is somewhat difficult.
(3332242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3332242) task_wdt: - IDLE (CPU 1)
(3332242) task_wdt: Tasks currently running:
(3332242) task_wdt: CPU 0: IDLE
(3332242) task_wdt: CPU 1: btController
(3332242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3332242) task_wdt: Print CPU 1 backtrace
Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3332399) UDS: Read data id fe20
(3332406) UDS: Read data id fe1f
(3332425) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3332426) UDP SERVER: JoDaHomas
(3332430) UDP SERVER: Waiting for data
(3337242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3337242) task_wdt: - IDLE (CPU 1)
(3337242) task_wdt: Tasks currently running:
(3337242) task_wdt: CPU 0: IDLE
(3337242) task_wdt: CPU 1: btController
(3337242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3337242) task_wdt: Print CPU 1 backtrace
Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3337321) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3337321) UDP SERVER: JoDaHomas
(3337325) UDP SERVER: Waiting for data
(3337647) UDS: Read data id fe20
(3337653) UDS: Read data id fe1f
(3342242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3342242) task_wdt: - IDLE (CPU 1)
(3342242) task_wdt: Tasks currently running:
(3342242) task_wdt: CPU 0: IDLE
(3342242) task_wdt: CPU 1: btController
(3342242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3342242) task_wdt: Print CPU 1 backtrace
Backtrace: 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3342356) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3342357) UDP SERVER: JoDaHomas
(3342360) UDP SERVER: Waiting for data
(3342928) UDS: Read data id fe20
(3342934) UDS: Read data id fe1f
(3347242) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
(3347242) task_wdt: - IDLE (CPU 1)
(3347242) task_wdt: Tasks currently running:
(3347242) task_wdt: CPU 0: IDLE
(3347242) task_wdt: CPU 1: btController
(3347242) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
(3347242) task_wdt: Print CPU 1 backtrace
Backtrace:0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
(3347382) UDP SERVER: Received 9 bytes from 192.168.0.68:
(3347383) UDP SERVER: JoDaHomas
(3347388) UDP SERVER: Waiting for data
(3348169) UDS: Read data id fe20
(3348178) UDS: Read data id fe1f
ASSERT_PARAM(-218959118 0), in arch_main.c at line 343 <==== The reported ASSERT
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).
Core 0 register dump:
PC : 0x4018bf52 PS : 0x00060334 A0 : 0x800d4d40 A1 : 0x3ffbc7e0
A2 : 0x00000000 A3 : 0x00000001 A4 : 0x00000014 A5 : 0x000000a5
A6 : 0x000000a5 A7 : 0x00060023 A8 : 0x801132ae A9 : 0x3ffbc7c0
A10 : 0x00000000 A11 : 0x00000001 A12 : 0x00000014 A13 : 0x000000a5
A14 : 0x000000a5 A15 : 0x00060023 SAR : 0x0000001d EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Backtrace:0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820
Core 1 register dump:
PC : 0x40089689 PS : 0x00060f34 A0 : 0x80173c7c A1 : 0x3ffbeec0
A2 : 0x00000001 A3 : 0x00000000 A4 : 0x60008048 A5 : 0x00000000
A6 : 0x00000004 A7 : 0x3ffc21ec A8 : 0x80089684 A9 : 0x3ffbeea0
A10 : 0x00000000 A11 : 0x00000037 A12 : 0x00000014 A13 : 0xffffffff
A14 : 0x00000000 A15 : 0xfffffffc SAR : 0x00000004 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x400895bc LEND : 0x400895c3 LCOUNT : 0x00000000
(Edit removed color escape sequences)
And as i said ealier we have no problem spending some time on this bug as this is pretty important for us to work.
@TianaESP And here is the decoded addresses sadly still some ??'s:
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40112F9E:0x3FFBE9A00x40083299:0x3FFBE9D0 0x4018BF4F:0x3FFBC7E0 0x400D4D3D:0x3FFBC800 0x40093907:0x3FFBC820
0x40112f9e: #idf path#/esp_system/task_wdt.c:191 (discriminator 3)
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x400d4d3d: #idf path#/esp_system/freertos_hooks.c:63
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40085055:0x3FFBEFB00x40083299:0x3FFBEFD0 0x40174091:0x3FFE7C20 0x40174659:0x3FFE7C40
0x40085055: #idf path#/esp_system/crosscore_int.c:92
0x40174091: ??:?
0x40174659: ??:?
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x4018bf4f:0x3ffbc7e00x400d4d3d:0x3ffbc800 0x40093907:0x3ffbc820
0x4018bf4f: #idf path#/hal/esp32/include/hal/cpu_ll.h:183
0x40093907: #idf path#/freertos/tasks.c:3823
CONSOLE>xtensa-esp32-elf-addr2line.exe -e binary.elf -p -a 0x40089686:0x3ffbeec00x40173c79:0x3ffbeee0 0x40019fb5:0x3ffbef00 0x400466c3:0x3ffbef30 0x4008c772:0x3ffbef50 0x4008c45d:0x3ffbef70 0x4008eb6a:0x3ffbef90 0x4008fde7:0x3ffbefb0 0x40083299:0x3ffbefd0 0x40174091:0x3ffe7c20 0x40174659:0x3ffe7c40
0x40089686: ??:?
0x40019fb5: ??:0
0x400466c3: ??:0
0x4008c772: ??:?
0x4008c45d: ??:?
0x4008eb6a: ??:?
0x4008fde7: intc.c:?
0x40083299: #idf path#/freertos/port/xtensa/xtensa_vectors.S:1105
0x40174091: ??:?
0x40174659: ??:?
In this case it was running for around 3 hours which is for our use case not that uncommon. And as i said we are willing to take some time to debug the issue but for us it's not a clear case. And we can only see that libbtdm as inside the controller task we have a failure.
After this the device has a Guru Mediation Error and Restarts which kills the BLE Connection and makes it worthless for the customer.
1.Coud you give the commit id of the idf you are using? 2 Could you describe the demo you are using? We have a demo which the device do scan , adv and connect to four other devices. But we run about one day and could not reproduce the issue.
@skyESP 1: rel4.4 https://github.com/espressif/esp-idf/tree/release/v4.4 rel4.3 https://github.com/espressif/esp-idf/tree/release/v4.3 This report was from a run with version 4.4 (should be exact commit)
2: Just a normal BLE process with RPA Enabled. As in start with adv then do a connect then we have GATT Attributes that we retrieve every 5 seconds. Our process also does do a scan for sensors but that is not active during the tests.
We also have a UDP Echo server running which also gets polled in a 5 second interval which is running over Wifi the BLE and UDP Requests have a offset of 2.5 Seconds so that they don’t interfere.
We also have a Additional Task for retrieving CAN Messages via one of the Uart‘s where we also poll each 5 seconds
(Edit) We have just a single device connected that does the polling (Samsung A22 Phone, Tested with other devices same outcomes)
@Platin21 1.During the test, Does the Adv is still continue? If it does,what's the frequence? 2. Do you use bluedroid or nimble as your host layer? 3.Could you try to run the wifi and bt on different core. For example wifi 0 core ,bt 1 core.
@skyESP
- No it doesn’t continue
- As i already said we use nimble
- What part of bt? The host our task or the controller? We switched some stuff already around and changed priorities but that made no notable change..
And no we have no option to switch to bluedroid as that would not fit in memory anymore.
@Platin21 OK,Thanks for your information. We are trying to reproduce the issue.
For the third question. Just the controller of bt . You can do as following. We have menuconfig to chose running on core0 or core1. run idf.py menuconfig For wifi: (Top) → Component config → Wi-Fi → WiFi Task Core ID For bt (Top) → Component config → Bluetooth → Bluetooth controller(ESP32 Dual Mode Bluetooth) → The cpu core which bluetooth controller run
Another two questions 1.What's the priority of the task for retrieving CAN Messages ? Does it higher than controller task? 2.From your describe you just use polling way not using interrupt for handling the CAN Messages , Am I right?
@skyESP Let me check if that changes anything and i get back to you.
- 20 from max 25. The BLE one is also slightly evalated i think it’s 15 there.
- It’s done via Interrupt, Polling is only used for retrieving the uds messages
OK, You set the priority of the controller task lower than the task to handle the can message. Here please do two tests.
- Set the controller task higher than the task to handle the CAN message.If still have problem please do the step 2. 2.Reduce the load or stop the the task to handle the CAN message and check if the issue is still here.
I suspect that the controller task is blocked by other task or the interrupt.
@skyESP where can one change the controller priority? Seems to be not a option inside Component config -> Bluetooth -> Bluetooth Controller. Or what is the default level there?
- Isn’t really a fix but for testing sure i can do that.
@Platin21 1.You can change the priority in the code where you create the task. 2.From our experience the controller should not be blocked by other tasks or interrupt for long time. Through I do not know how long and how much work you did in your task,so I advance you to do this to test and verify my suspection. If it works,this can be a fix.
@skyESP you mean the task that handles the BLE so neither the controller nor the host?
I would first need to know under which value it needs to be to not take priority.
@Platin21 hi, 1.You had better keep the priority of controller and host task as our demo example. 2.The task you created yourself should lower than the controller and host ble task. The priority should be: ble controller task > ble host task > application task ( build by your self) 3.If this still can not work, can you give us your code?
- I didn’t change anything there what i was not sure about is if our task should have a higher prio than the can one
- This was the case anyways i think meanwhile i dunno what the default prio of these other two tasks is.
- Possible but that needs to be elaborated in a Email. And i suspect it makes more sense that we setup something as i already said and you maybe can guess that the pcb on which the esp sits is custom. @skyESP
@skyESP so back to the priorities the host task seems to run on 21 and the LL task seems to run on 24 so our can task is acutally already lower in priority. And to the other tasks on the system all of them run with priority 0 so should not be possbile for it to interfere at all then.
hi,@Platin21
Sorry for replay late.
Yes,Priority 0 of other tasks should not interfere them.
For the example ble_ibeacon demo in our IDF sdk.
The controller task is 23,hci host task is 22, BTU task is 20 ,BTC task is 19.
Yeah that makes it clearer now. We will setup something tomorrow and i will leave my response here. But yeah it sounds very promising. We are now switching the priorities to be under a value of 18. Would be a huge step for us forward if that works out.