Arduino
Arduino copied to clipboard
PoC for handling Erase WiFi Setting after OTA
Status: FLASH_MAP_SUPPORT has not been properly incorporated.
Issue: Sometimes when an ESP8266 is reflashed/upgraded the WiFi does not work. Then a serial flash with Erase Flash with WiFi setting is recommended. I have seen this more often when changing SDK by OTA. We don't have an erase WiFi for OTA.
This PR should not be merged. It presents three Proof of Concept solutions. The intent is that one of these solutions would be chosen, developed further, and incorporate in a new PR.
There are 3 cases to consider when the firmware is updated by OTA. The new firmware:
- has the same Flash Configuration as the old.
- has a larger Flash Configuration than the old.
- has a smaller Flash Configuration than the old.
In theory after an OTA and before a restart, the flash could be erased for
case 1. Case 2 is a problem because the size exceeds the size specified in
flashchip. That size is used by SPIEraseSector to validate the callers request
and fails when too large. We have to wait for a restart, after which the SDK
will have updated the values in flashchip
. Case 3 is potentially unsafe
because we could be erasing the code that is running.
At app_entry()
flashchip
properties appear to be reset to a default 4MByte
flash. Even an ESP8285 reported 4MByte. The value of flashchip->chip_size
was
changed at the 2nd SPIRead by the SDK. The 1st read was to address 0, 4 bytes.
To erase the flash WiFi area for methods 1 and 2, I choose to wait until the SDK
has finished its adjustments to the flashchip
structure. Then to begin erasing
WiFi sectors. Method 3 runs before the SDK starts. It temporarily updates
flashchip->chip_size
. Then does the erase and puts everything back before
starting the SDK.
Summary of the three methods:
-
The first runs after the SDK calls
user_init()
and flash code execution is available (No IRAM needed), but restarts to be sure the SDK does not get confused about its sectors being erased. -
The 2nd method runs as early as possible requires IRAM. The sectors are erased before the 2nd read is processed. This allows the SDK to start off thinking the sectors were blank at boot.
-
Similar to method 2 runs as early as possible, turns on flash code execution so more of the initialization code can be moved to flash. Directly modifies the size element in
flashchip
structure in ROM data, dRAM. This allows the flash erase to succeed.The original flash size is restored before starting the SDK with the assumption that the SDK will handle the size change properly. It should be noted that only changing the size value in the
flashchip
structure, is equivalent to what the esptool.py is doing, when it flashes the firmware.
I also added an example/test sketch for exercising the feature. It is OTAEraseConfig. It also gathers some WiFi signal/connection statistics.
Edited: Added Method 3, corrected some grammar.
I have seen this more often when changing SDK
Just to clarify, that only affects OTA to or from SDK 3.x.x? Or some settings (rfcal or credentials format) are incompatible even between 2.2.x versions?
@mhightower83 In #6690, I wonder myself whether the first bytes in flash are used by FW or only used by and for ourselves. https://github.com/d-a-v/Arduino/blob/nosizeconf/cores/esp8266/Esp.cpp#L298
In the latter case, we'd better rely on getFlashChipRealSize()
= FW's spi_flash_get_id()
That would at least fix incoherence like what you've seen with esp8285.
@mcspr Also between 2.x versions, but I don't have a clear view on that. What I know is when everything fails, a full erase can fix the issue.
@mcspr Definitely the latter. Too often I would forget which 2.x FW version I had on a remote device and flash it via OTA, only to have it stop working. I think 191105 might be the worst it often took an RF CAL erase and a power cycle before it would respond remotely. (related to a unit with a weak signal) Updated: Added clarification.
The problem with getFlashChipRealSize()
= FW's spi_flash_get_id()
is, it defines an upper bound value that is not followed by SPIRead
, SPIWrite
, or SPIEraseSector
. These ROM functions use the values provided in the flashchip
structure to validate each call request. Looking at one of the ROM disassemblies out there, SPIRead
compares flash offset + size of the read with flashchip.chip_size
and fails the request if it exceeds flashchip.chip_size
.
In effect, as the SDK starts up, by the time of the 2nd SPIRead
call, the values in flashchip
have been updated to reflect the information read at flash offset 0. This is the IDE configured value which may be smaller than getFlashChipRealSize()
.
That said, I think I noticed a download result, that implied the serial download tool was changing the IDE configured value to the real chip size on the fly.
According to your findings, flashchip
structure is initialized by FW/SDK after the first call of SPIRead()
.
Do you think it reads first flash byte to initialize it, or does it get flash size from flash chip ?
Does spi_flash_get_id()
gets its result from flashchip()
structure ?
My concern is about the API calls: Esp.getFlashChipSize()
vs Esp.getFlashChipRealSize()
:
What's the use of the first, and can we get rid of it ?
It looks like the flashchip
structure update is based on the 1st byte read. The updated structured appears to track the IDE configured value.
My take or interpretation so far is that Esp.getFlashChipSize()
gives us:
- the IDE configured size
- Which the SDK adopts and uses as THE size for all operations by updating
flashchip
. - Since
flashchip
is referenced by the SPIRead, ... type API calls for an upper limit check. (at least the ones I have looked at.) - I hesitate to suggest, that maybe we could refer to this as a virtual chip size.
I see Esp.getFlashChipRealSize()
as providing insight to the actual hardware chip size.
Does spi_flash_get_id() gets its result from flashchip() structure ?
Using the RTOS SDK for inspiration, if I am reading this right, it would appear the command issues a flash chip command to read an ID from the actual flash chip. https://github.com/espressif/ESP8266_RTOS_SDK/blob/b02ad1477b8657e9224b768f73f9aa9ee5d950ff/components/spi_flash/src/spi_flash_raw.c#L46-L50
What do you think of reading the first byte of flash at boot and then, if it does not reflects what Esp.getFlashChipRealSize()
gives, rewriting it ?
(edit: that, only if changing flashchip value in ram on-the-fly won't work)
We would have
- no more "virtual" chip size
- no more misconfiguration that would potentially poison the FS
- and (possibly) FS configuration selected by sketch (like what's proposed in #6690)
Currently, I see the following when building and downloading the example sketch with the generic option selected with Flash Size:"1MB(FS:64KB OTA:~470KB)"
Flash information reported after using esptool
Flash Size as reported by:
flashchip->chip_size: 0x01000000, 16777216
ESP.getFlashChipSize: 0x01000000, 16777216
ESP.getFlashChipRealSize: 0x01000000, 16777216
Flash information reported after using OTA
Flash Size as reported by:
flashchip->chip_size: 0x0100000, 1048576
ESP.getFlashChipSize: 0x0100000, 1048576
ESP.getFlashChipRealSize: 0x01000000, 16777216
So currently it appears the esptool
today is doing what you are suggesting we do with OTA. It sounds like a good idea and I like it a lot; however, I also think it needs deeper thought. My initial concern is for legacy devices. On too many occasions, I have mismatched build and existing configured sizes.
I am concerned about the sealed-up devices that cannot be reflashed serially. Maybe there is a need to respect the original value of legacy devices. An old device may not expect the upgrade in size. The location of the EEPROM will move. We could copy it; however, if there are objects inside, pointing to places in flash, that could create an issue. And there are some more fancy EEPROM libraries out there in use that may wonder where their data went.
If we could somehow increase the size and keep the old pointer offsets until a sketch could handle the migration to a larger size. Migration, referring to moving of EEPROM and SPIFFs, etc. to their new home.
In case anyone is trying this, there appears to be a problem with the linker not using the erase_config module. It appears in my effort to reduce the code down to what was needed, I took away the one direct reference to this module that caused it to be linked in. :frowning_face:
There is also an issue of Soft WDT when doing a system_restart() with method 1. It appears I have to keep the call to __real_system_restart_local() instead.
I'll try and sprinkle in some printf's I think I can make those work now to a degree. Before, all the code reduction, I was using a hack together deferred print library to get a peek of what was going on during this garbled print period.
It sounds like the PoC has outlived its usefulness if it had any. Shall I Close?
@mhightower83 To be clear, the principle of what you accomplished here is absolutely necessary. What isn't clear is the approach. About putting this in eboot, the idea is interesting, but it must be weighed against the last necessary ota enhancement (discussed in #905 and the resulting proposal in #6538). At this point, I don't know what fits and what doesn't fit in what remains of the eboot sector. I think the tiny bit of logic needed in eboot to handle this would be small enough so that we can fit both, but that needs to be tried.
I'm keen to finalise #6538 is it worth my working through this PR to see how they might mesh?
@davisonja yes, I think you're the person best suited to assess whether #6538 and this eboot idea would work together. I think we would need:
- #6538 updated
- numbers for remaining space in the eboot sector with and without #6538
- a new PR that implements the eboot alternative to this PR
Points 1 and 2 should be easy enough. With those we will know how much space is left in the eboot sector.
I think that the eboot alternative to this PR would need: A. absolute minimal new code that goes in eboot, something along the lines of what @earlephilhower said: maybe a new command that has an address/size to erase. B. all of the address/size calculations are outside of eboot, and done after the OTA image has been received and written in the empty area, but before the reboot actually happens.
I'm not sure who should implement points A or B, I think that's up to you both 😛