Arduino PoC for handling Erase WiFi Setting after OTA

Status: FLASH_MAP_SUPPORT has not been properly incorporated.

Issue: Sometimes when an ESP8266 is reflashed/upgraded the WiFi does not work. Then a serial flash with Erase Flash with WiFi setting is recommended. I have seen this more often when changing SDK by OTA. We don't have an erase WiFi for OTA.

This PR should not be merged. It presents three Proof of Concept solutions. The intent is that one of these solutions would be chosen, developed further, and incorporate in a new PR.

There are 3 cases to consider when the firmware is updated by OTA. The new firmware:

has the same Flash Configuration as the old.
has a larger Flash Configuration than the old.
has a smaller Flash Configuration than the old.

In theory after an OTA and before a restart, the flash could be erased for case 1. Case 2 is a problem because the size exceeds the size specified in flashchip. That size is used by SPIEraseSector to validate the callers request and fails when too large. We have to wait for a restart, after which the SDK will have updated the values in flashchip. Case 3 is potentially unsafe because we could be erasing the code that is running.

At app_entry() flashchip properties appear to be reset to a default 4MByte flash. Even an ESP8285 reported 4MByte. The value of flashchip->chip_size was changed at the 2nd SPIRead by the SDK. The 1st read was to address 0, 4 bytes.

To erase the flash WiFi area for methods 1 and 2, I choose to wait until the SDK has finished its adjustments to the flashchip structure. Then to begin erasing WiFi sectors. Method 3 runs before the SDK starts. It temporarily updates flashchip->chip_size. Then does the erase and puts everything back before starting the SDK.

Summary of the three methods:

The first runs after the SDK calls user_init() and flash code execution is available (No IRAM needed), but restarts to be sure the SDK does not get confused about its sectors being erased.
The 2nd method runs as early as possible requires IRAM. The sectors are erased before the 2nd read is processed. This allows the SDK to start off thinking the sectors were blank at boot.
Similar to method 2 runs as early as possible, turns on flash code execution so more of the initialization code can be moved to flash. Directly modifies the size element in flashchip structure in ROM data, dRAM. This allows the flash erase to succeed.

The original flash size is restored before starting the SDK with the assumption that the SDK will handle the size change properly. It should be noted that only changing the size value in the flashchip structure, is equivalent to what the esptool.py is doing, when it flashes the firmware.

I also added an example/test sketch for exercising the feature. It is OTAEraseConfig. It also gathers some WiFi signal/connection statistics.

Edited: Added Method 3, corrected some grammar.

Dec 31 '19 04:12 mhightower83

I have seen this more often when changing SDK

Just to clarify, that only affects OTA to or from SDK 3.x.x? Or some settings (rfcal or credentials format) are incompatible even between 2.2.x versions?

Dec 31 '19 14:12 mcspr

@mhightower83 In #6690, I wonder myself whether the first bytes in flash are used by FW or only used by and for ourselves. https://github.com/d-a-v/Arduino/blob/nosizeconf/cores/esp8266/Esp.cpp#L298

In the latter case, we'd better rely on getFlashChipRealSize() = FW's spi_flash_get_id() That would at least fix incoherence like what you've seen with esp8285.

@mcspr Also between 2.x versions, but I don't have a clear view on that. What I know is when everything fails, a full erase can fix the issue.

Dec 31 '19 15:12 d-a-v

@mcspr Definitely the latter. Too often I would forget which 2.x FW version I had on a remote device and flash it via OTA, only to have it stop working. I think 191105 might be the worst it often took an RF CAL erase and a power cycle before it would respond remotely. (related to a unit with a weak signal) Updated: Added clarification.

Dec 31 '19 17:12 mhightower83

The problem with getFlashChipRealSize() = FW's spi_flash_get_id() is, it defines an upper bound value that is not followed by SPIRead, SPIWrite, or SPIEraseSector. These ROM functions use the values provided in the flashchip structure to validate each call request. Looking at one of the ROM disassemblies out there, SPIRead compares flash offset + size of the read with flashchip.chip_size and fails the request if it exceeds flashchip.chip_size.

In effect, as the SDK starts up, by the time of the 2nd SPIRead call, the values in flashchip have been updated to reflect the information read at flash offset 0. This is the IDE configured value which may be smaller than getFlashChipRealSize().

That said, I think I noticed a download result, that implied the serial download tool was changing the IDE configured value to the real chip size on the fly.

Dec 31 '19 18:12 mhightower83

According to your findings, flashchip structure is initialized by FW/SDK after the first call of SPIRead(). Do you think it reads first flash byte to initialize it, or does it get flash size from flash chip ? Does spi_flash_get_id() gets its result from flashchip() structure ?

My concern is about the API calls: Esp.getFlashChipSize() vs Esp.getFlashChipRealSize(): What's the use of the first, and can we get rid of it ?

Jan 01 '20 16:01 d-a-v

It looks like the flashchip structure update is based on the 1st byte read. The updated structured appears to track the IDE configured value.

My take or interpretation so far is that Esp.getFlashChipSize() gives us:

the IDE configured size
Which the SDK adopts and uses as THE size for all operations by updating flashchip.
Since flashchip is referenced by the SPIRead, ... type API calls for an upper limit check. (at least the ones I have looked at.)
I hesitate to suggest, that maybe we could refer to this as a virtual chip size.

I see Esp.getFlashChipRealSize() as providing insight to the actual hardware chip size.

Does spi_flash_get_id() gets its result from flashchip() structure ?

Using the RTOS SDK for inspiration, if I am reading this right, it would appear the command issues a flash chip command to read an ID from the actual flash chip. https://github.com/espressif/ESP8266_RTOS_SDK/blob/b02ad1477b8657e9224b768f73f9aa9ee5d950ff/components/spi_flash/src/spi_flash_raw.c#L46-L50

Jan 01 '20 17:01 mhightower83

What do you think of reading the first byte of flash at boot and then, if it does not reflects what Esp.getFlashChipRealSize() gives, rewriting it ? (edit: that, only if changing flashchip value in ram on-the-fly won't work)

We would have

no more "virtual" chip size
no more misconfiguration that would potentially poison the FS
and (possibly) FS configuration selected by sketch (like what's proposed in #6690)

Jan 06 '20 12:01 d-a-v

Currently, I see the following when building and downloading the example sketch with the generic option selected with Flash Size:"1MB(FS:64KB OTA:~470KB)"

Flash information reported after using esptool

Flash Size as reported by:
  flashchip->chip_size:     0x01000000, 16777216
  ESP.getFlashChipSize:     0x01000000, 16777216
  ESP.getFlashChipRealSize: 0x01000000, 16777216

Flash information reported after using OTA

Flash Size as reported by:
  flashchip->chip_size:     0x0100000, 1048576
  ESP.getFlashChipSize:     0x0100000, 1048576
  ESP.getFlashChipRealSize: 0x01000000, 16777216

So currently it appears the esptool today is doing what you are suggesting we do with OTA. It sounds like a good idea and I like it a lot; however, I also think it needs deeper thought. My initial concern is for legacy devices. On too many occasions, I have mismatched build and existing configured sizes.

I am concerned about the sealed-up devices that cannot be reflashed serially. Maybe there is a need to respect the original value of legacy devices. An old device may not expect the upgrade in size. The location of the EEPROM will move. We could copy it; however, if there are objects inside, pointing to places in flash, that could create an issue. And there are some more fancy EEPROM libraries out there in use that may wonder where their data went.

If we could somehow increase the size and keep the old pointer offsets until a sketch could handle the migration to a larger size. Migration, referring to moving of EEPROM and SPIFFs, etc. to their new home.

Jan 09 '20 01:01 mhightower83

In case anyone is trying this, there appears to be a problem with the linker not using the erase_config module. It appears in my effort to reduce the code down to what was needed, I took away the one direct reference to this module that caused it to be linked in. :frowning_face:

There is also an issue of Soft WDT when doing a system_restart() with method 1. It appears I have to keep the call to __real_system_restart_local() instead.

I'll try and sprinkle in some printf's I think I can make those work now to a degree. Before, all the code reduction, I was using a hack together deferred print library to get a peek of what was going on during this garbled print period.

Jan 29 '20 05:01 mhightower83

It sounds like the PoC has outlived its usefulness if it had any. Shall I Close?

Apr 04 '21 15:04 mhightower83

@mhightower83 To be clear, the principle of what you accomplished here is absolutely necessary. What isn't clear is the approach. About putting this in eboot, the idea is interesting, but it must be weighed against the last necessary ota enhancement (discussed in #905 and the resulting proposal in #6538). At this point, I don't know what fits and what doesn't fit in what remains of the eboot sector. I think the tiny bit of logic needed in eboot to handle this would be small enough so that we can fit both, but that needs to be tried.

Apr 04 '21 20:04 devyte

I'm keen to finalise #6538 is it worth my working through this PR to see how they might mesh?

Apr 06 '21 05:04 davisonja

@davisonja yes, I think you're the person best suited to assess whether #6538 and this eboot idea would work together. I think we would need:

#6538 updated
numbers for remaining space in the eboot sector with and without #6538
a new PR that implements the eboot alternative to this PR

Points 1 and 2 should be easy enough. With those we will know how much space is left in the eboot sector.

I think that the eboot alternative to this PR would need: A. absolute minimal new code that goes in eboot, something along the lines of what @earlephilhower said: maybe a new command that has an address/size to erase. B. all of the address/size calculations are outside of eboot, and done after the OTA image has been received and written in the empty area, but before the reboot actually happens.

I'm not sure who should implement points A or B, I think that's up to you both 😛

Apr 06 '21 14:04 devyte

Arduino Arduino copied to clipboard

PoC for handling Erase WiFi Setting after OTA

Arduino
Arduino copied to clipboard