Writing to flash freezes the whole system on ESP32-S3
Bug description
When trying to write to flash through esp_storage on the ESP32S3 while both cores are active, the whole system freezes.
This is 100% reproducible with the provided example.
It seems to be some combination of having the WIFI stack and the second core running when trying to write to the flash. It seems to be a bit timing sensitive but with the provided example it usually happens within a maximum of 3 erase calls.
When commenting out lines 66 to 68 in main.rs makes all calls to erase succeed and system will never freeze.
To Reproduce
- git clone https://github.com/florianL21/esp_storage-bug.git
- Run cargo run --release
- Observe the logs, you can see that system will freeze up at latest after a few writes to flash.
This is my log for reference:
Running `espflash flash --monitor --chip esp32s3 --flash-size=32mb --flash-freq 80mhz --flash-mode qio --partition-table=partition_table.csv target/xtensa-esp32s3-none-elf/release/test-setup`
[2025-09-02T20:31:47Z INFO ] 🚀 A new version of espflash is available: v4.0.1
[2025-09-02T20:31:47Z INFO ] Serial port: '/dev/ttyACM0'
[2025-09-02T20:31:47Z INFO ] Connecting...
[2025-09-02T20:31:48Z INFO ] Using flash stub
Chip type: esp32s3 (revision v0.1)
Crystal frequency: 40 MHz
Flash size: 32MB
Features: WiFi, BLE
MAC address: ********
Partition table: partition_table.csv
App/part. size: 439,024/2,097,152 bytes, 20.93%
[2025-09-02T20:31:48Z INFO ] Segment at address '0x0' has not changed, skipping write
[2025-09-02T20:31:48Z INFO ] Segment at address '0x8000' has not changed, skipping write
[00:00:03] [========================================] 280/280 0x10000 [2025-09-02T20:31:52Z INFO ] Flashing has completed!
Commands:
CTRL+R Reset chip
CTRL+C Exit
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x15 (USB_UART_CHIP_RESET),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40379f6d
0x40379f6d - <u32 as core::ops::bit::BitAndAssign>::bitand_assign
at ******/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ops/bit.rs:719
SPIWP:0xee
Octal Flash Mode Enabled
For OPI Flash, Use Default Flash Boot Mode
mode:SLOW_RD, clock div:1
load:0x3fce3818,len:0x16f8
load:0x403c9700,len:0x4
load:0x403c9704,len:0xc00
load:0x403cc700,len:0x2eb0
entry 0x403c9908
I (33) boot: ESP-IDF v5.1-beta1-378-gea5e0ff298-dirt 2nd stage bootloader
I (33) boot: compile time Jun 7 2023 08:07:32
I (34) boot: Multicore bootloader
I (38) boot: chip revision: v0.1
I (42) boot.esp32s3: Boot SPI Speed : 80MHz
I (47) boot.esp32s3: SPI Mode : SLOW READ
I (52) boot.esp32s3: SPI Flash Size : 32MB
I (57) boot: Enabling RNG early entropy source...
I (63) boot: Partition Table:
I (66) boot: ## Label Usage Type ST Offset Length
I (73) boot: 0 nvs WiFi data 01 02 00009000 00006000
I (81) boot: 1 phy_init RF data 01 01 0000f000 00001000
I (88) boot: 2 factory factory app 00 00 00010000 00200000
I (96) boot: 3 otadata OTA data 01 00 00210000 00002000
I (103) boot: 4 app0 OTA app 00 10 00220000 00300000
I (111) boot: 5 app1 OTA app 00 11 00520000 00300000
I (118) boot: 6 storage Unknown data 01 83 00820000 00a00000
I (126) boot: End of partition table
I (130) boot: Defaulting to factory image
I (135) esp_image: segment 0: paddr=00010020 vaddr=3c000020 size=128c8h ( 75976) map
I (161) esp_image: segment 1: paddr=000228f0 vaddr=3fc917a0 size=0216ch ( 8556) load
I (164) esp_image: segment 2: paddr=00024a64 vaddr=40378000 size=097a0h ( 38816) load
I (178) esp_image: segment 3: paddr=0002e20c vaddr=00000000 size=01e0ch ( 7692)
I (180) esp_image: segment 4: paddr=00030020 vaddr=42020020 size=4b2ach (307884) map
I (261) boot: Loaded app from partition at offset 0x10000
I (261) boot: Disabling RNG early entropy source...
INFO - vendor id : 0d (AP)
INFO - dev id : 02 (generation 3)
INFO - density : 03 (64 Mbit)
INFO - good-die : 01 (Pass)
INFO - Latency : 01 (Fixed)
INFO - VCC : 00 (1.8V)
INFO - SRF : 01 (Fast Refresh)
INFO - BurstType : 01 (Hybrid Wrap)
INFO - BurstLen : 01 (32 Byte)
INFO - Readlatency : 02 (10 cycles@Fixed)
INFO - DriveStrength: 00 (1/1)
INFO - 8388608 bytes of PSRAM
INFO - Embassy initialized!
INFO - Core 1 spawning tasksINFO
- esp-wifi configuration EspWifiConfig { rx_queue_size: 5, tx_queue_size: 3, static_rx_buf_num: 10, dynamic_rx_buf_num: 32, static_tx_buf_num: 0, dynamic_tx_buf_num: 32, ampdu_rx_enable: true, ampdu_tx_enable: true, amsdu_tx_enable: false, rx_ba_win: 6, max_burst_size: 1, country_code: "CN", country_code_operating_class: 0, mtu: 1492, tick_rate_hz: 100, listen_interval: 3, beacon_timeout: 6, ap_beacon_timeout: 300, failure_retry_cnt: 1, scan_method: 0 }
INFO - IPv4: DOWN
WARN - esp_wifi_internal_tx 12290
INFO - System is still running
INFO - System is still running
INFO - System is still running
INFO - System is still running
WARN - !Going to write to flash. This may freeze the system!
Expected behavior
System does not freeze. Maybe the erase returns with an error if something went wrong.
Environment
- Target device: ESP32-S3-WROOM-2-N32R16V
- esp-hal 1.0.0-rc.0
- esp-storage 0.7.0
- esp-hal-embassy 0.9.0
- embassy-executor 0.7.0
You're not allowed to write to flash whilst the other core is executing from it. You have to suspend the other core first
Is this documented somewhere and I missed it? I also have an example where I park the second core before writing to flash and indeed it does not crash most of the time. But I cannot seem to make it work reliably, possibly because I have no way of knowing if the second core is already suspended or not.
Also, if this is the case why does it not freeze if I get rid of the WIFI stack on the first core? Theoretically this hasn't changed that the second core is executing from flash while the first one is writing, right?
Is this documented somewhere and I missed it?
Seems it's not yet and it should get added to the docs.
Also, if this is the case why does it not freeze if I get rid of the WIFI stack on the first core? Theoretically this hasn't changed that the second core is executing from flash while the first one is writing, right?
It's a bit more complex - in that example the 2nd core doesn't do too much and without the first core doing much, too (i.e. not running wifi) there is a chance that a lot of the code runs from cache and things might work - maybe not reliable but you can be lucky.
Adding wifi to the picture changes that (like doing anything more complex on one of the cores)
I see. That seems logical.
But this raises another question for me:
If I am supposed to park the second core how can I know that it actually stopped running?
In my project where I ran into this issue I am calling the park_core function to park the second core but it seems that the core may still keep running for a bit longer after that function returns.
I didn't find any interface in the system module that I can seemingly use to check if core 1 is running or parked.
So does this mean that I am supposed to implement some detection mechanism myself? This can ofc be done but it seems suboptimal to me.
In general I am a bit surprised to see such behavior as neither of the two APIs I am using (start_app_core or any of the esp_storage APIs) are unsafe, yet I can still cause the whole system to go into a seemingly undefined state.
Ideally we should solve this inside esp-storage. Unfortunately, that is hard to do generally - code doesn't HAVE to be running from flash, and in that case the core doesn't have to be parked... We can likely add a marker type that you, the user would have to select - UnsafeNoPark, ParkOtherCore, or mabye some other strategy if we can come up with them.
That parking the core returns before the core actually stops running (if this really is the case), should be considered a bug.
Just to be very clear, the park_core call returning before the core is actually parked was just an assumption of mine. I could not find any statements in the documentation that detail whether this function is blocking or not. I assumed that it returns sooner as simply doing:
park_core()
flash.write()
unpark_core()
Still freezes the system in some cases
In general I am a bit surprised to see such behavior as neither of the two APIs I am using (start_app_core or any of the esp_storage APIs) are unsafe, yet I can still cause the whole system to go into a seemingly undefined state.
Strictly speaking, unsafe has to do with rust rules being broken, not necessarily the hardware or environment rules.
In safe rust, you can still have deadlocks, memory leaks (my favourite) and race conditions.
Still freezes the system in some cases
Out of curiosity does it still freeze if you add a sleep after parking?
Strictly speaking, unsafe has to do with rust rules being broken, not necessarily the hardware or environment rules.
Yes you are right ofc. But in other parts of the HAL these mechanics are used more by convention to signal to the user to be careful with using these APIs. At least this is how I interpreted the park_core function being unsafe.
Out of curiosity does it still freeze if you add a sleep after parking?
Yes it does, but only about 20% of cases. I currently have a delay of 50ms after calling park_core. I figured that this should be plenty.
Maybe I can try to modify my example code to provoke this behaviour. I will give an update here once I have something.
I tried to reproduce the part about park_core not parking the core immediately and cannot reproduce that.
I can easily get to the point where not calling it will freeze everything but calling it immediately before calling erase (like in your example) it works 100% of the time for me - curious about a reproducer showing that behavior
Well, I just spent 2h on this and I cannot reproduce it either. You probably have already suspected it but ofc it was something else in my code which was causing issue that made it look like as if the core did not immediately park.
So to conclude, you are absolutely correct, parking the second core before the write makes the write succeed 100% of the time.
Thus, the only open point remaining in this issue is that this rule of not writing to flash while the second core is running seems to be not documented.
The possibility of this being enforced by the esp_storage API would be really great as well but I guess this is not really a priority since nothing is really broken.
Thank you to all of you for your help and insights. Feel free to close this issue if you feel like it should be closed.
If you are open to it I can of course also try my hand on adding this information to the documentation of esp_storage myself and create a PR. If so, just let me know.
I guess that is good news :)
If you are open to it I can of course also try my hand on adding this information to the documentation of esp_storage myself and create a PR. If so, just let me know.
Sure, we are always open to PRs and improving documentation is always a good thing.
I have ideas what we could do in code but documenting this is a great first step
I would also be willing to take a shot at implementing something, but I am not that familiar with the conventions and goals of the esp-hal. If you were to tell me what you would envision for improvements in the code I can also implement that. Worst case it just doesn't get merged :)
I honestly haven't thought it through, but my initial idea was to check if the other core is running (i.e. not stalled) on the multicore targets and return an error in that case - and have a feature to opt-out of that (e.g. if someone runs everything on the other core from RAM)
Not sure if it's a good idea? (@MabezDev @bugadani) - So probably better to hear other's opinions before poking at it
I'd define a set of strategies we can take, and create features for those. It's not obvious what the best approach is:
- auto-park the core
- retun an error
- unsafely do nothing because the user only runs code from RAM on the second core or the whole system
- mask interrupts that run from flash on the other core - this is pretty difficult, but technically possible
- ?
Because we have several options, I'd like to come up with something extensible. Then, if the target is multi-core, either pick auto-park by default, or require the user to select a strategy.
fair point - given user's a choice is always good - probably the first three options would be a good start - and if we use an esp-config enum option for it we should be able to easily add more strategies later 🤔
There's also the option of auto suspending the flash chip. https://github.com/esp-rs/esp-hal/discussions/3413#discussioncomment-12931626
I just sat down and started poking around in the esp-hal code a little and came up with this:
Ground rules
I defined the following priorities for me to start evaluating my options:
- Try to not introduce breaking API changes to the current interfaces
- Give the user a choice over how a write should be handled when we are in a multi-core system and the second core is active
- If possible make it clear to the user whether a certain chosen strategy needs special care from their end which I would intend to indicate via unsafe
I would start off by defining 3 strategies for now:
- Error: Simply return an error when attempting a flash write while the second core is active. I would make this the default to not surprise users with hidden behavior they would not expect
- AutoPark: Automatically park the second core and un-park it when the write operation is done
- Ignore: Don't check for the second core. This would be the case where I would like to make it obvious that this is unsafe to use
Multi-core strategy as an enum
I could introduce a new field in the FlashStorage struct which can hold the strategy to use.
This would default to the Error strategy when calling FlashStorage::new. The user could then change to a different strategy using some interfaces on the FlashStorage struct. The interface for changing to the Ignore strategy could be marked as unsafe.
I would then make the according checks in the FlashStorage::internal_write and FlashStorage::internal_erase functions as this seems to be the correct place to put them so that all other abstractions use them as well.
Pros:
- Clear signaling to the user of which strategy is used
- Easy to change strategies
- Easy to see which strategies need extra care (function marked as unsafe)
Cons:
- Takes some memory for storing this, probably runtime global, enum
- The user could possibly switch the strategy during runtime. Either this has to be prevented somehow, or we could decide that it could even be a valid thing to do.
Multi-core strategy as a type parameter
The FlashStorage struct could take a type parameter which would define the used strategy.
Pros:
- Locks a particular instance of
FlashStorageto a single strategy - The strategy would be reflected in the type
Cons:
- It may be hard to mark the unsafe strategies as unsafe in this case
- Doing it this way is usually more boilerplate and maybe not so clear for the user as this would need some traits, and some implementers of those traits. This I usually found is a bit harder to dig up when going through documentation
Multi-core strategy configuration via feature flags
I assume this is what
if we use an esp-config enum option for it
refers to.
As I said. I am not too familiar with the inner workings of the esp-hal. If this is a proven choice that has worked well in the past then why not.
Pros:
- No memory needed to store the strategy
- No additional interfaces for switching between strategies could be simpler for the user
Cons:
- Harder to convey if a used strategy is unsafe
Open questions
- In any case I think I need have access to a
CpuControlto detect (maybe for this I will even have to add functions to it) and potentially park cores, correct? I assume simply writing/reading to/from the registers is not a good thing to do even if that would mean that I could get away without making changes to theFlashStorage::newinterface. - If I do need acces to
CpuControlhow should I keep track of it. Should I take ownership ofCpuControlvia theFlashStorage::newInterface? Or should I only take a reference to it? I assume I have to somehow keep track of theCpuControlas having the user pass it to the write function every time is not possible since most interaction with the flash seems to be intended to take place via the embedded-storage trait implementation. - Currently the esp-hal seems to be hard-coded to support exactly 2 cores. I assume there is no real reason to make the implementation generic for N numbers of cores as of right now, correct?
- I assume it is possible that the user could write to the flash also from the second core. Is it a good idea to detect the current core and then park the other one in case of the
AutoParkstrategy?
I already started tinkering around a bit with the code but I am a bit unsure which avenues to explore further and which options I can abandon right away.
I hope I am making at least some sense and would appreciate some feedback/answers/guidance :)
Thanks for looking into this!
I think I like the idea of having it changeable at runtime - I don't see a problem if the user can change the strategy after creating FlashStorage.
Currently esp-storage doesn't depend on esp-hal and ideally, we shouldn't add it as a dependency. Which would mean to duplicate some code - unfortunately parking a core is writing to two registers which means it's not an atomic operation. Duplicating code is certainly not great but we already have this in esp-backtrace, too IIRC.
Currently the esp-hal seems to be hard-coded to support exactly 2 cores. I assume there is no real reason to make the implementation generic for N numbers of cores as of right now, correct?
The public API should deal with the Cpu enum so it's not really limited to two cores. Currently only ESP32 and ESP32-S3 are dual-core. There are upcoming chips - none of them with more than two cores.
I assume it is possible that the user could write to the flash also from the second core. Is it a good idea to detect the current core and then park the other one in case of the AutoPark strategy?
Yes - that's true.
Might be just also worthwhile to get XIP from PSRAM working, #3024
I just made a first draft of what we discussed. It currently lives here: https://github.com/esp-rs/esp-hal/compare/main...florianL21:esp-hal:esp-storage-implement-multi-core-strategies
I ran into a couple of issues mostly related to me not knowing the esp-hal internals very well:
- I don't know how I am supposed to access the registers for checking if a core is active or to park a core without creating a dependency on esp-hal
- How does the #[cfg(multi_core)] work? I see it used in other crates of esp-hal but I cannot find where this is defined for other crates and it does not seem to "just work"
Currently e.g. for esp-backtrace we duplicate functionality: https://github.com/esp-rs/esp-hal/blob/779228ef287e11d7545d50a37e343402123c98a7/esp-backtrace/src/lib.rs#L137-L166 - that's less than ideal of course
You should get the multi_core cfg by adding a dependency on esp-metadata-generated ( https://github.com/esp-rs/esp-hal/blob/779228ef287e11d7545d50a37e343402123c98a7/esp-backtrace/Cargo.toml#L25 ) and using it in build.rs: https://github.com/esp-rs/esp-hal/blob/779228ef287e11d7545d50a37e343402123c98a7/esp-radio/build.rs#L29-L33
I just opened a draft PR and would appreciate a preliminary review before I start writing some more extensive documentation.
I tested it locally and on my esp32s3 everything is now working as expected. However I had to more or less blindly implement the ESP32 side of things as I do not have an ESP32 laying around.
I am also unsure about the way I am detecting if the second core is active, as this seems to be a scenario which needs to be handled explicitly. If someone could have a close look at it, that would be much appreciated:
https://github.com/esp-rs/esp-hal/blob/6987019ce268ae0969ca37637ab0a09ff0b59e6b/esp-storage/src/multi_core.rs#L133-L158
I have the same use-case (flash write with multi-core). Parking the other core actually resolve flash write issue, but I noticed that the esp randomly freeze when doing parking, depending on what’s going on on the parked core. I notably noticed using SPI2 increase frequency of freezes. I suppose the other core held a lock when it is parked or something like that… For context, I’m using a embassy thread executor on the other core as in provided example.
=> park_core does not seems a general reliable solution
=> I have been able to make park_core reliable by first asking parked core to block in a busy loop (this is not my case but may be using interrupts require additional precautions before parking the core?).
May be the documentation should explains the preconditions to fulfill on parked core for a reliable park, but I do not really knows what there are!
For records, my understanding on how ESP-IDF solves flash issues:
- protect flash operation with
spi_flash_disable_interrupts_caches_and_other_cpu- disable cache & non-iram interrupts on current core (esp-hal: done in rom code?)
- call
spi_flash_op_block_funcon the other core (throughesp_ipc_call)- disable non-iram interrupts
- busy loop waiting the end of flash operation
All these functions have IRAM_ATTR. The other core is not parked. I quickly tried to mimic esp-idf behavior but i still have to park the core for reliable flash ops for now.
@bouttier my PR with the changes was merged yesterday, but in the PR there was a discussion about the possibility that the whole auto-park, flash unlock, flash write, un-park procedure should be placed in a critical section. You could try if that fixes your issue I guess
I have updated my code from 1.0.0-rc0 to fb8aa314f68f5b8666260d4d03809fe04b30b873 I have tested your auto-park feature, which is working well. I confirm I still have freezes! I get rid of these freeze with the same strategy as before: a dedicated task which block the second core before the parking.
I confirm I still have freezes!
If we could find a reproducer that would be awesome
@bouttier my PR with the changes was merged yesterday, but in the PR there was a discussion about the possibility that the whole auto-park, flash unlock, flash write, un-park procedure should be placed in a critical section. You could try if that fixes your issue I guess
I miss read your comment: I was thinking the last version of your PR add a critical section, but I re-read it and realize that it is not the case. I added critical sections in my code and I confirm it solve all freeze issues.
Why not including a critical section directly alongside the auto park feature?
I have not been able to make a small and simple reproducer for the freezes … i had feeze issues only when doing ota update: writing on the flash a firmware received over wifi.