[BUG] STM32H5 __start failure when debugging and flash banks are swapped
Description / Steps to reproduce the issue
I am running into issues when loading and debugging the basic H5 NSH configuration when the flash banks are swapped on the Nucleo H563ZI. Has anyone run into similar issues on this or other STM architectures?
This effectively prevents debugging on H5 MCUs where the bank swap option bit is active. This could be a debugger issue, but it is occurring on both JLink and STLink. So, I want to see if there is something wrong with the H5 architecture that could be causing this.
Board: Nucleo-H563ZI Configuration: nucleo-h563zi:nsh GCC: arm-none-eabi-gcc 13.3.1 20240614 Host OS: Ubuntu 24.04 container on Ubuntu 24.04 host (WSL2)
Steps to reproduce
- Connect device to STM32CubeProgrammer.
- Mass erase flash memory for a clean slate.
- Under Option bytes, select
SWAP_BANKand hit Apply. - Reset the board by removing power and reconnecting. (Bank swap only takes effect after reset)
- In Nuttx:
./tools/configure.sh -l nucleo-h563zi:nsh - Start programming and debugging. This will succeed and behave as expected.
- Stop debug session.
- Program and debug a second time. This will reach the beginning of
__start(), but fail after getting to code described below. - Only way to restore ability to debug: Connect to STM32CubeProgrammer and mass erase flash. Can repeat steps 5-8, same thing will happen.
I have tried JLink + JLink server and STLink + OpenOCD (ST Fork), same thing occurs in both.
Interestingly, if just resetting the device and letting it run instead of program/debugging again after step 7, I get a nutshell and it seems to be running properly.
Also: If I kill debugging when it reaches the beginning of __start() in step 8, without continuing to debug. Then I reset the device and let it run without the debugger, I get a nutshell and it seems to run properly. However, as soon as I continue debugging after reaching the beginning of __start() the 2+ time loading/debugging, it will forever break even if resetting without debugging. To recover from this, have to wipe flash memory.
Where the issue occurs
I have narrowed down the issue to occurring in this step at the beginning of __start():
for (src = (const uint32_t *)_eronly,
dest = (uint32_t *)_sdata; dest < (uint32_t *)_edata;
)
{
*dest++ = *src++;
}
Placing a breakpoint on stm32_clockconfig() (occurs immediately after the for loop in question), the breakpoint will hit on the first time through but will never hit the second time debugging.
Note: I went back a ways (all the way to 9078ffa4) to before this block was moved to the top of __start. All code before it successfully executed, including stm32_clockconfig(), but it fails immediately after this block just as it does currently.
Log files
These debugger output logs are from the failing (second) attempt in the steps described above. gdb-server.log dbg_console.log
Notes
If there are any other things that I am missing that would help diagnose this issue, please let me know and I will do my best to get those ASAP.
On which OS does this issue occur?
[OS: Linux]
What is the version of your OS?
Ubuntu 24.04
NuttX Version
master
Issue Architecture
[Arch: arm]
Issue Area
[Area: Drivers]
Host information
Output of make host_info: host_info.log
Verification
- [x] I have verified before submitting the report.
@stbenn does the PR #16422 closes this Issue?
You case setup a PR to close an issue automatically, just use "closes #123" or "fixes #123" in the Summary
@acassis No, this issue is separate from the flash progmem driver and unrelated to the PR. I just discovered this bug while trying to work on progmem with banks swapped.
The bug reported here occurs regardless of if the progmem driver is being used. It appears that some interaction between NuttX, SWD/JTAG, and/or hardware makes NuttX break while debugging with flash banks swapped.
So, I am also having issues debugging the Nucleo-G0B1RE (MCU: STM32G0B1RERET6U). The behavior is different than the H5; the G0B1 it fails to boot even once if the banks are swapped. I can dig into the logs etc. on this if that would help.
I verified that dual bank works as expected on the STM32H745ZI Nucleo board, and it behaves correctly when banks are swapped.
@raiden00pl, I know you have a lot of experience with STM32 chips on NuttX. Have you experienced anything similar to this before? All I can think of is that maybe something is wrong with the linker scripts on the H5 and G0B1, but nothing is jumping out at me.
maybe this issue is related to additional security present in cortex-m33 (TZ features)? I'm not familiar with stm32h5, but from my experience with other cortex-m33 - security settings can cause various errors that are difficult to find. Are you able to verify if any exception occurs during the crash, like SecureFault?
@raiden00pl Thanks for the suggestion! I have not used any of the TZ features, and they are all disabled but that doesn't mean they aren't causing an issue. I briefly looked into them and nothing jumped out that could be causing this, but I could have missed something.
For some more info that I have gathered in the last week or two:
On the H5, it appears to be a "Double Fault" judging by the GDB logs. I looked into it a bit further, and I see a flash double ECC error occur during the copy from flash to ram in the start function. Very strangely, the address recorded for the error is outside of the bounds (or should be) of the copy.
When the fault occurs, FLASH_ECCDETR::ADDR_ECC = 0x3300 (Least significant 16 bits of address where first double ECC error occurs). However, the map suggests the memory copy should correspond to these ranges:
- FLASH:
[0x08032f24, 0x080332a8) - SRAM:
[0x20000000, 0x10000384)
This leads me to think that maybe FLASH isn't getting cleared properly before write during debug when banks are swapped, or some hardware mechanism/bug.
On the G0B1RE, debug immediately fails and does not even get to __start().
Could there be something wrong with the linker scripts of these chips that would cause this behavior when banks are swapped? I am not very familiar with linker scripts, and both of the boards that I put the linker scripts for (H5 and G0B1RE) fail during debug when banks are swapped. I want to believe this is a coincidence, but I would like to verify if possible.
Hi @stbenn
I do not have relevant experience on nuttx explicitly regarding your issue, but I can give you some relevant input based on my experience.
If if also happens if you are not under debug, my input does not apply
I see that you refer to this issue only when debugging.
I've run into similar problem on some project. My concussion were that, in embedded environment everything is linked statically (everything is placed in memory at compile time, not loaded, as in Linux).
I also expect that you are using that bank swap functionality for firmware update, load a file into the other flash bank and reset the mcu after. So, after the mcu restart everything is shifted (to the new bank), but the compiled binary (I expect elf file) still holds the original addresses, so what end up happening is that you are placing the breakpoints at wrong addresses. If my assumption is correct we either cannot debug the target after a fw update, or we need to come up with an solution for that memory shift between banks.
Another thing I've saw is that, by default the flash bank that is not used is locked, it will memfault if and mpu region is configured over that flash bank, or hardfault if used without mpu.
Again, this assumptions are correct if the device is running fine on the swapped bank without the debugger attached