esp-hal icon indicating copy to clipboard operation
esp-hal copied to clipboard

Writing to flash freezes the whole system on ESP32-S3

Open florianL21 opened this issue 3 months ago • 28 comments

Bug description

When trying to write to flash through esp_storage on the ESP32S3 while both cores are active, the whole system freezes.

This is 100% reproducible with the provided example.

It seems to be some combination of having the WIFI stack and the second core running when trying to write to the flash. It seems to be a bit timing sensitive but with the provided example it usually happens within a maximum of 3 erase calls.

When commenting out lines 66 to 68 in main.rs makes all calls to erase succeed and system will never freeze.

To Reproduce

  1. git clone https://github.com/florianL21/esp_storage-bug.git
  2. Run cargo run --release
  3. Observe the logs, you can see that system will freeze up at latest after a few writes to flash.

This is my log for reference:

     Running `espflash flash --monitor --chip esp32s3 --flash-size=32mb --flash-freq 80mhz --flash-mode qio --partition-table=partition_table.csv target/xtensa-esp32s3-none-elf/release/test-setup`
[2025-09-02T20:31:47Z INFO ] 🚀 A new version of espflash is available: v4.0.1
[2025-09-02T20:31:47Z INFO ] Serial port: '/dev/ttyACM0'
[2025-09-02T20:31:47Z INFO ] Connecting...
[2025-09-02T20:31:48Z INFO ] Using flash stub
Chip type:         esp32s3 (revision v0.1)
Crystal frequency: 40 MHz
Flash size:        32MB
Features:          WiFi, BLE
MAC address:       ********
Partition table:   partition_table.csv
App/part. size:    439,024/2,097,152 bytes, 20.93%
[2025-09-02T20:31:48Z INFO ] Segment at address '0x0' has not changed, skipping write
[2025-09-02T20:31:48Z INFO ] Segment at address '0x8000' has not changed, skipping write
[00:00:03] [========================================]     280/280     0x10000                                                                                                                                                                                                                                                                                                                [2025-09-02T20:31:52Z INFO ] Flashing has completed!
Commands:
    CTRL+R    Reset chip
    CTRL+C    Exit

ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x15 (USB_UART_CHIP_RESET),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40379f6d
0x40379f6d - <u32 as core::ops::bit::BitAndAssign>::bitand_assign
    at ******/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ops/bit.rs:719
SPIWP:0xee
Octal Flash Mode Enabled
For OPI Flash, Use Default Flash Boot Mode
mode:SLOW_RD, clock div:1
load:0x3fce3818,len:0x16f8
load:0x403c9700,len:0x4
load:0x403c9704,len:0xc00
load:0x403cc700,len:0x2eb0
entry 0x403c9908
I (33) boot: ESP-IDF v5.1-beta1-378-gea5e0ff298-dirt 2nd stage bootloader
I (33) boot: compile time Jun  7 2023 08:07:32
I (34) boot: Multicore bootloader
I (38) boot: chip revision: v0.1
I (42) boot.esp32s3: Boot SPI Speed : 80MHz
I (47) boot.esp32s3: SPI Mode       : SLOW READ
I (52) boot.esp32s3: SPI Flash Size : 32MB
I (57) boot: Enabling RNG early entropy source...
I (63) boot: Partition Table:
I (66) boot: ## Label            Usage          Type ST Offset   Length
I (73) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (81) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (88) boot:  2 factory          factory app      00 00 00010000 00200000
I (96) boot:  3 otadata          OTA data         01 00 00210000 00002000
I (103) boot:  4 app0             OTA app          00 10 00220000 00300000
I (111) boot:  5 app1             OTA app          00 11 00520000 00300000
I (118) boot:  6 storage          Unknown data     01 83 00820000 00a00000
I (126) boot: End of partition table
I (130) boot: Defaulting to factory image
I (135) esp_image: segment 0: paddr=00010020 vaddr=3c000020 size=128c8h ( 75976) map
I (161) esp_image: segment 1: paddr=000228f0 vaddr=3fc917a0 size=0216ch (  8556) load
I (164) esp_image: segment 2: paddr=00024a64 vaddr=40378000 size=097a0h ( 38816) load
I (178) esp_image: segment 3: paddr=0002e20c vaddr=00000000 size=01e0ch (  7692) 
I (180) esp_image: segment 4: paddr=00030020 vaddr=42020020 size=4b2ach (307884) map
I (261) boot: Loaded app from partition at offset 0x10000
I (261) boot: Disabling RNG early entropy source...
INFO - vendor id    : 0d (AP)
INFO - dev id       : 02 (generation 3)
INFO - density      : 03 (64 Mbit)
INFO - good-die     : 01 (Pass)
INFO - Latency      : 01 (Fixed)
INFO - VCC          : 00 (1.8V)
INFO - SRF          : 01 (Fast Refresh)
INFO - BurstType    : 01 (Hybrid Wrap)
INFO - BurstLen     : 01 (32 Byte)
INFO - Readlatency  : 02 (10 cycles@Fixed)
INFO - DriveStrength: 00 (1/1)
INFO - 8388608 bytes of PSRAM
INFO - Embassy initialized!
INFO - Core 1 spawning tasksINFO
 - esp-wifi configuration EspWifiConfig { rx_queue_size: 5, tx_queue_size: 3, static_rx_buf_num: 10, dynamic_rx_buf_num: 32, static_tx_buf_num: 0, dynamic_tx_buf_num: 32, ampdu_rx_enable: true, ampdu_tx_enable: true, amsdu_tx_enable: false, rx_ba_win: 6, max_burst_size: 1, country_code: "CN", country_code_operating_class: 0, mtu: 1492, tick_rate_hz: 100, listen_interval: 3, beacon_timeout: 6, ap_beacon_timeout: 300, failure_retry_cnt: 1, scan_method: 0 }
INFO - IPv4: DOWN
WARN - esp_wifi_internal_tx 12290
INFO - System is still running
INFO - System is still running
INFO - System is still running
INFO - System is still running
WARN - !Going to write to flash. This may freeze the system!

Expected behavior

System does not freeze. Maybe the erase returns with an error if something went wrong.

Environment

  • Target device: ESP32-S3-WROOM-2-N32R16V
  • esp-hal 1.0.0-rc.0
  • esp-storage 0.7.0
  • esp-hal-embassy 0.9.0
  • embassy-executor 0.7.0

florianL21 avatar Sep 02 '25 20:09 florianL21

You're not allowed to write to flash whilst the other core is executing from it. You have to suspend the other core first

Dominaezzz avatar Sep 02 '25 21:09 Dominaezzz

Is this documented somewhere and I missed it? I also have an example where I park the second core before writing to flash and indeed it does not crash most of the time. But I cannot seem to make it work reliably, possibly because I have no way of knowing if the second core is already suspended or not.

florianL21 avatar Sep 02 '25 21:09 florianL21

Also, if this is the case why does it not freeze if I get rid of the WIFI stack on the first core? Theoretically this hasn't changed that the second core is executing from flash while the first one is writing, right?

florianL21 avatar Sep 02 '25 22:09 florianL21

Is this documented somewhere and I missed it?

Seems it's not yet and it should get added to the docs.

Also, if this is the case why does it not freeze if I get rid of the WIFI stack on the first core? Theoretically this hasn't changed that the second core is executing from flash while the first one is writing, right?

It's a bit more complex - in that example the 2nd core doesn't do too much and without the first core doing much, too (i.e. not running wifi) there is a chance that a lot of the code runs from cache and things might work - maybe not reliable but you can be lucky.

Adding wifi to the picture changes that (like doing anything more complex on one of the cores)

bjoernQ avatar Sep 03 '25 06:09 bjoernQ

I see. That seems logical.

But this raises another question for me:

If I am supposed to park the second core how can I know that it actually stopped running? In my project where I ran into this issue I am calling the park_core function to park the second core but it seems that the core may still keep running for a bit longer after that function returns. I didn't find any interface in the system module that I can seemingly use to check if core 1 is running or parked. So does this mean that I am supposed to implement some detection mechanism myself? This can ofc be done but it seems suboptimal to me.

In general I am a bit surprised to see such behavior as neither of the two APIs I am using (start_app_core or any of the esp_storage APIs) are unsafe, yet I can still cause the whole system to go into a seemingly undefined state.

florianL21 avatar Sep 03 '25 07:09 florianL21

Ideally we should solve this inside esp-storage. Unfortunately, that is hard to do generally - code doesn't HAVE to be running from flash, and in that case the core doesn't have to be parked... We can likely add a marker type that you, the user would have to select - UnsafeNoPark, ParkOtherCore, or mabye some other strategy if we can come up with them.

That parking the core returns before the core actually stops running (if this really is the case), should be considered a bug.

bugadani avatar Sep 03 '25 08:09 bugadani

Just to be very clear, the park_core call returning before the core is actually parked was just an assumption of mine. I could not find any statements in the documentation that detail whether this function is blocking or not. I assumed that it returns sooner as simply doing:

park_core()
flash.write()
unpark_core()

Still freezes the system in some cases

florianL21 avatar Sep 03 '25 08:09 florianL21

In general I am a bit surprised to see such behavior as neither of the two APIs I am using (start_app_core or any of the esp_storage APIs) are unsafe, yet I can still cause the whole system to go into a seemingly undefined state.

Strictly speaking, unsafe has to do with rust rules being broken, not necessarily the hardware or environment rules.

In safe rust, you can still have deadlocks, memory leaks (my favourite) and race conditions.

Still freezes the system in some cases

Out of curiosity does it still freeze if you add a sleep after parking?

Dominaezzz avatar Sep 03 '25 08:09 Dominaezzz

Strictly speaking, unsafe has to do with rust rules being broken, not necessarily the hardware or environment rules.

Yes you are right ofc. But in other parts of the HAL these mechanics are used more by convention to signal to the user to be careful with using these APIs. At least this is how I interpreted the park_core function being unsafe.

Out of curiosity does it still freeze if you add a sleep after parking?

Yes it does, but only about 20% of cases. I currently have a delay of 50ms after calling park_core. I figured that this should be plenty.

Maybe I can try to modify my example code to provoke this behaviour. I will give an update here once I have something.

florianL21 avatar Sep 03 '25 09:09 florianL21

I tried to reproduce the part about park_core not parking the core immediately and cannot reproduce that.

I can easily get to the point where not calling it will freeze everything but calling it immediately before calling erase (like in your example) it works 100% of the time for me - curious about a reproducer showing that behavior

bjoernQ avatar Sep 03 '25 14:09 bjoernQ

Well, I just spent 2h on this and I cannot reproduce it either. You probably have already suspected it but ofc it was something else in my code which was causing issue that made it look like as if the core did not immediately park.

So to conclude, you are absolutely correct, parking the second core before the write makes the write succeed 100% of the time.

Thus, the only open point remaining in this issue is that this rule of not writing to flash while the second core is running seems to be not documented.

The possibility of this being enforced by the esp_storage API would be really great as well but I guess this is not really a priority since nothing is really broken.

Thank you to all of you for your help and insights. Feel free to close this issue if you feel like it should be closed.

If you are open to it I can of course also try my hand on adding this information to the documentation of esp_storage myself and create a PR. If so, just let me know.

florianL21 avatar Sep 03 '25 19:09 florianL21

I guess that is good news :)

If you are open to it I can of course also try my hand on adding this information to the documentation of esp_storage myself and create a PR. If so, just let me know.

Sure, we are always open to PRs and improving documentation is always a good thing.

I have ideas what we could do in code but documenting this is a great first step

bjoernQ avatar Sep 04 '25 06:09 bjoernQ

I would also be willing to take a shot at implementing something, but I am not that familiar with the conventions and goals of the esp-hal. If you were to tell me what you would envision for improvements in the code I can also implement that. Worst case it just doesn't get merged :)

florianL21 avatar Sep 04 '25 07:09 florianL21

I honestly haven't thought it through, but my initial idea was to check if the other core is running (i.e. not stalled) on the multicore targets and return an error in that case - and have a feature to opt-out of that (e.g. if someone runs everything on the other core from RAM)

Not sure if it's a good idea? (@MabezDev @bugadani) - So probably better to hear other's opinions before poking at it

bjoernQ avatar Sep 04 '25 08:09 bjoernQ

I'd define a set of strategies we can take, and create features for those. It's not obvious what the best approach is:

  • auto-park the core
  • retun an error
  • unsafely do nothing because the user only runs code from RAM on the second core or the whole system
  • mask interrupts that run from flash on the other core - this is pretty difficult, but technically possible
  • ?

Because we have several options, I'd like to come up with something extensible. Then, if the target is multi-core, either pick auto-park by default, or require the user to select a strategy.

bugadani avatar Sep 04 '25 08:09 bugadani

fair point - given user's a choice is always good - probably the first three options would be a good start - and if we use an esp-config enum option for it we should be able to easily add more strategies later 🤔

bjoernQ avatar Sep 04 '25 08:09 bjoernQ

There's also the option of auto suspending the flash chip. https://github.com/esp-rs/esp-hal/discussions/3413#discussioncomment-12931626

Dominaezzz avatar Sep 04 '25 08:09 Dominaezzz

I just sat down and started poking around in the esp-hal code a little and came up with this:

Ground rules

I defined the following priorities for me to start evaluating my options:

  1. Try to not introduce breaking API changes to the current interfaces
  2. Give the user a choice over how a write should be handled when we are in a multi-core system and the second core is active
  3. If possible make it clear to the user whether a certain chosen strategy needs special care from their end which I would intend to indicate via unsafe

I would start off by defining 3 strategies for now:

  • Error: Simply return an error when attempting a flash write while the second core is active. I would make this the default to not surprise users with hidden behavior they would not expect
  • AutoPark: Automatically park the second core and un-park it when the write operation is done
  • Ignore: Don't check for the second core. This would be the case where I would like to make it obvious that this is unsafe to use

Multi-core strategy as an enum

I could introduce a new field in the FlashStorage struct which can hold the strategy to use. This would default to the Error strategy when calling FlashStorage::new. The user could then change to a different strategy using some interfaces on the FlashStorage struct. The interface for changing to the Ignore strategy could be marked as unsafe. I would then make the according checks in the FlashStorage::internal_write and FlashStorage::internal_erase functions as this seems to be the correct place to put them so that all other abstractions use them as well.

Pros:

  • Clear signaling to the user of which strategy is used
  • Easy to change strategies
  • Easy to see which strategies need extra care (function marked as unsafe)

Cons:

  • Takes some memory for storing this, probably runtime global, enum
  • The user could possibly switch the strategy during runtime. Either this has to be prevented somehow, or we could decide that it could even be a valid thing to do.

Multi-core strategy as a type parameter

The FlashStorage struct could take a type parameter which would define the used strategy.

Pros:

  • Locks a particular instance of FlashStorage to a single strategy
  • The strategy would be reflected in the type

Cons:

  • It may be hard to mark the unsafe strategies as unsafe in this case
  • Doing it this way is usually more boilerplate and maybe not so clear for the user as this would need some traits, and some implementers of those traits. This I usually found is a bit harder to dig up when going through documentation

Multi-core strategy configuration via feature flags

I assume this is what

if we use an esp-config enum option for it

refers to.

As I said. I am not too familiar with the inner workings of the esp-hal. If this is a proven choice that has worked well in the past then why not.

Pros:

  • No memory needed to store the strategy
  • No additional interfaces for switching between strategies could be simpler for the user

Cons:

  • Harder to convey if a used strategy is unsafe

Open questions

  1. In any case I think I need have access to a CpuControl to detect (maybe for this I will even have to add functions to it) and potentially park cores, correct? I assume simply writing/reading to/from the registers is not a good thing to do even if that would mean that I could get away without making changes to the FlashStorage::new interface.
  2. If I do need acces to CpuControl how should I keep track of it. Should I take ownership of CpuControl via the FlashStorage::new Interface? Or should I only take a reference to it? I assume I have to somehow keep track of the CpuControl as having the user pass it to the write function every time is not possible since most interaction with the flash seems to be intended to take place via the embedded-storage trait implementation.
  3. Currently the esp-hal seems to be hard-coded to support exactly 2 cores. I assume there is no real reason to make the implementation generic for N numbers of cores as of right now, correct?
  4. I assume it is possible that the user could write to the flash also from the second core. Is it a good idea to detect the current core and then park the other one in case of the AutoPark strategy?

I already started tinkering around a bit with the code but I am a bit unsure which avenues to explore further and which options I can abandon right away.

I hope I am making at least some sense and would appreciate some feedback/answers/guidance :)

florianL21 avatar Sep 04 '25 19:09 florianL21

Thanks for looking into this!

I think I like the idea of having it changeable at runtime - I don't see a problem if the user can change the strategy after creating FlashStorage.

Currently esp-storage doesn't depend on esp-hal and ideally, we shouldn't add it as a dependency. Which would mean to duplicate some code - unfortunately parking a core is writing to two registers which means it's not an atomic operation. Duplicating code is certainly not great but we already have this in esp-backtrace, too IIRC.

Currently the esp-hal seems to be hard-coded to support exactly 2 cores. I assume there is no real reason to make the implementation generic for N numbers of cores as of right now, correct?

The public API should deal with the Cpu enum so it's not really limited to two cores. Currently only ESP32 and ESP32-S3 are dual-core. There are upcoming chips - none of them with more than two cores.

I assume it is possible that the user could write to the flash also from the second core. Is it a good idea to detect the current core and then park the other one in case of the AutoPark strategy?

Yes - that's true.

bjoernQ avatar Sep 05 '25 09:09 bjoernQ

Might be just also worthwhile to get XIP from PSRAM working, #3024

ProfFan avatar Sep 06 '25 20:09 ProfFan

I just made a first draft of what we discussed. It currently lives here: https://github.com/esp-rs/esp-hal/compare/main...florianL21:esp-hal:esp-storage-implement-multi-core-strategies

I ran into a couple of issues mostly related to me not knowing the esp-hal internals very well:

  1. I don't know how I am supposed to access the registers for checking if a core is active or to park a core without creating a dependency on esp-hal
  2. How does the #[cfg(multi_core)] work? I see it used in other crates of esp-hal but I cannot find where this is defined for other crates and it does not seem to "just work"

florianL21 avatar Sep 07 '25 19:09 florianL21

Currently e.g. for esp-backtrace we duplicate functionality: https://github.com/esp-rs/esp-hal/blob/779228ef287e11d7545d50a37e343402123c98a7/esp-backtrace/src/lib.rs#L137-L166 - that's less than ideal of course

You should get the multi_core cfg by adding a dependency on esp-metadata-generated ( https://github.com/esp-rs/esp-hal/blob/779228ef287e11d7545d50a37e343402123c98a7/esp-backtrace/Cargo.toml#L25 ) and using it in build.rs: https://github.com/esp-rs/esp-hal/blob/779228ef287e11d7545d50a37e343402123c98a7/esp-radio/build.rs#L29-L33

bjoernQ avatar Sep 08 '25 13:09 bjoernQ

I just opened a draft PR and would appreciate a preliminary review before I start writing some more extensive documentation.

I tested it locally and on my esp32s3 everything is now working as expected. However I had to more or less blindly implement the ESP32 side of things as I do not have an ESP32 laying around.

I am also unsure about the way I am detecting if the second core is active, as this seems to be a scenario which needs to be handled explicitly. If someone could have a close look at it, that would be much appreciated:

https://github.com/esp-rs/esp-hal/blob/6987019ce268ae0969ca37637ab0a09ff0b59e6b/esp-storage/src/multi_core.rs#L133-L158

florianL21 avatar Sep 08 '25 19:09 florianL21

I have the same use-case (flash write with multi-core). Parking the other core actually resolve flash write issue, but I noticed that the esp randomly freeze when doing parking, depending on what’s going on on the parked core. I notably noticed using SPI2 increase frequency of freezes. I suppose the other core held a lock when it is parked or something like that… For context, I’m using a embassy thread executor on the other core as in provided example.

=> park_core does not seems a general reliable solution

=> I have been able to make park_core reliable by first asking parked core to block in a busy loop (this is not my case but may be using interrupts require additional precautions before parking the core?).

May be the documentation should explains the preconditions to fulfill on parked core for a reliable park, but I do not really knows what there are!

For records, my understanding on how ESP-IDF solves flash issues:

All these functions have IRAM_ATTR. The other core is not parked. I quickly tried to mimic esp-idf behavior but i still have to park the core for reliable flash ops for now.

bouttier avatar Sep 11 '25 13:09 bouttier

@bouttier my PR with the changes was merged yesterday, but in the PR there was a discussion about the possibility that the whole auto-park, flash unlock, flash write, un-park procedure should be placed in a critical section. You could try if that fixes your issue I guess

florianL21 avatar Sep 18 '25 19:09 florianL21

I have updated my code from 1.0.0-rc0 to fb8aa314f68f5b8666260d4d03809fe04b30b873 I have tested your auto-park feature, which is working well. I confirm I still have freezes! I get rid of these freeze with the same strategy as before: a dedicated task which block the second core before the parking.

bouttier avatar Oct 06 '25 21:10 bouttier

I confirm I still have freezes!

If we could find a reproducer that would be awesome

bjoernQ avatar Oct 07 '25 06:10 bjoernQ

@bouttier my PR with the changes was merged yesterday, but in the PR there was a discussion about the possibility that the whole auto-park, flash unlock, flash write, un-park procedure should be placed in a critical section. You could try if that fixes your issue I guess

I miss read your comment: I was thinking the last version of your PR add a critical section, but I re-read it and realize that it is not the case. I added critical sections in my code and I confirm it solve all freeze issues.

Why not including a critical section directly alongside the auto park feature?

I have not been able to make a small and simple reproducer for the freezes … i had feeze issues only when doing ota update: writing on the flash a firmware received over wifi.

bouttier avatar Oct 14 '25 16:10 bouttier