Arduino icon indicating copy to clipboard operation
Arduino copied to clipboard

Ota commands in flash

Open davisonja opened this issue 4 years ago β€’ 47 comments

Currently OTA updates involve the use of (volatile) RTC memory to keep track of update commands across a chip reset. If there is a powerloss part-way through the flashing process the result is likely to be corrupt flash. Ideally these commands would be persisted in something less volatile - like flash.

This PR adds that functionality, and resolves #905.

Update commands are stored in a reserved area of flash that has been inserted just before the start of the FS in flash. This ensures that other things don't end up moving (like sdkwifi pages), that we don't lose FS space, but does cost 2 pages of potential sketch space. This may be an issue on smaller sized devices and it is possible to make the reservation optional - though the last consensus was it was a small enough loss so as not to be an issue.

A pointer to the location of the flash store has been added to the start of the flash, along with a magic number that the bootloader can use to recognise whether or not flash-storage should be checked. In this way it's possible to use the updated bootloader with older sketches without any issues (and means that control over using flash storage should be kept within the sketch).

As it stands the location of the update data hasn't been changed, and so is still in the first 1MB of flash, meaning you have a maximum sketch size half of that (less reserved areas). Once this has received some meaningful testing beyond my boards I anticipate updating the system to allow update data to be stored further on in flash, thus allowing larger OTA updates - assuming it's not already been done; I've not been through the current state beyond what was required to merge in the flash update solution.

boards.txt.py now generates .ld files that should include all the necessary information, and in the right places.

There is now a README.md file in the eboot folder as the start of some descriptions of how it works. It still needs some details added, but realistically all that info is already in the source itself.

I've now tested it on real hardware, which works, after a few fixes. πŸ˜„

davisonja avatar Sep 20 '19 04:09 davisonja

@d-a-v @Androbin a draft PR if you're interested.

davisonja avatar Sep 20 '19 05:09 davisonja

Good useful PR, so a CI that does not fail and the PR goes through would be great...

OpenUAS avatar Sep 20 '19 11:09 OpenUAS

Travis is next on the list, along with some tests on a couple of different esps that I've got.

The other thing I need to check, since I'm listing todo's is ensure I finished the eboot doc changes.

davisonja avatar Sep 20 '19 23:09 davisonja

What would happen if I deployed the current bootloader and had it update itself to the new version later? Would it brick itself because the running code would be overwritten?

Androbin avatar Sep 26 '19 23:09 Androbin

What would happen if I deployed the current bootloader and had it update itself to the new version later? Would it brick itself because the running code would be overwritten?

While I don't have my notes handy at the moment I'm fairly sure that self-updating was supported, if not already (I forget whether the writing in eboot is already totally run-from-RAM) then certainly as an achievable improvement - switching to the flash-using bootloader entirely OTA is an intended use-case.

davisonja avatar Sep 27 '19 00:09 davisonja

Testing this PR can be done by starting the OTA process, then reset the chip in the middle of the final flashing process right ?

d-a-v avatar Oct 03 '19 09:10 d-a-v

Yup, though it's aimed at losing power (a chip reset iirc won't clear rtc) with flash ota commands enabled it takes priority over rtc commands.

On Thu, 3 Oct 2019, 22:34 david gauchard, [email protected] wrote:

Testing this PR can be done by starting the OTA process, then reset the chip in the middle of the final flashing process right ?

β€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/esp8266/Arduino/pull/6538?email_source=notifications&email_token=AABX6A2Z5Q4VOAJI77SCAADQMW4CJA5CNFSM4IYSVVSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAHTMWQ#issuecomment-537867866, or mute the thread https://github.com/notifications/unsubscribe-auth/AABX6A7ZG26DGQV24M6HPSDQMW4CJANCNFSM4IYSVVSA .

davisonja avatar Oct 03 '19 09:10 davisonja

Great changes. I absolutely need a safe updater, the current updater it's really risky in production environment. I'm currently using a fork of @davisonja branch as my platformio SDK.

Any chance having this merged in master?

lrodorigo avatar Apr 12 '20 12:04 lrodorigo

If we have a plan on what's needed to get it into master I'm happy to remerge master in to the pr

On Mon, 13 Apr 2020, 00:40 Luigi Rodorigo, [email protected] wrote:

Great changes. I absolutely need a safe updater, the current updater it's really risky in production environment. I'm currently using a fork of @davisonja https://github.com/davisonja branch as my platformio SDK.

Any chance having this merged in master?

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esp8266/Arduino/pull/6538#issuecomment-612608220, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABX6A2MRCP2GNAQ7REF6RLRMGZDVANCNFSM4IYSVVSA .

davisonja avatar Apr 12 '20 20:04 davisonja

@davisonja glad to hear that :) The plan is to get 2.7.0 out the door, then merge some big PRs that are ready to go, e. g. the gcc 9.x one, then get to this one. Having this ready to test would help a lot. BTW, several things have been added to eboot, so the space available is quite small. If more functionality needs to be added, we may need to go from the current 4KB up to 8KB.

devyte avatar Apr 12 '20 20:04 devyte

Yeah, I saw that, though I've yet to compare the difference the pr made, it's not huge so I'm hopeful it will fit.

Figured I'd wait until we were closer to merging it before bringing things up to date😊

On Mon, 13 Apr 2020, 08:24 Develo, [email protected] wrote:

@davisonja https://github.com/davisonja glad to hear that :) The plan is to get 2.7.0 out the door, then merge some big PRs that are ready to go, e. g. the gcc 9.x one, then get to this one. Having this ready to test would help a lot. BTW, several things have been added to eboot, so the space available is quite small. If more functionality needs to be added, we may need to go from the current 4KB up to 8KB.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esp8266/Arduino/pull/6538#issuecomment-612671204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABX6A5L3TTTFNLZAFUSPF3RMIPQ3ANCNFSM4IYSVVSA .

davisonja avatar Apr 12 '20 20:04 davisonja

If I am not making a mistake, the binary used by the OTA it's a merge of the bootloader binary and the application binary. The whole binary is written after the reboot (by eboot) starting at 0x0000.

So the bootloader is overwritten at each OTA update, and if a power-failure occurs during the flashing of the bootloader (first 4KB if I understood correctly), this can lead to a brick.

It is a 'I can live with it' tradeoff, in order to avoid having different binaries, one for OTA and the other one for esp-tool flashing?

It could be a good idea to skip, if not needed (i.e. adding an option to the UpdaterClass), the first 4KB of the image, and instruct eboot to flash it at 0x1000 instead of 0x0000 ?

lrodorigo avatar Apr 16 '20 13:04 lrodorigo

That's a good idea.

However, as things stand (or at least were when I last went through this) it's worth keeping in mind that the various 'md5sum of sketch' calculations include the bootloader. So if you're using those to determine whether to update or not it's important that your OTA file includes the bootloader that matches the one in flash. This is effectively dealt with when you overwrite the bootloader on each update (which is why it's currently still updating it all) but is an obscure gotcha if you never update the bootloader - changes to eboot are infrequent, but will cause endless update loops when they happen if the bootloader isn't updated. Not a big deal, but worth keeping in mind as it may not be obvious that's the fault.

J,

On Fri, Apr 17, 2020 at 1:08 AM Luigi Rodorigo [email protected] wrote:

If I am not making a mistake, the binary used by the OTA it's a merge of the bootloader binary and the application binary. The whole binary is written after the reboot (by eboot) starting at 0x0000.

So the bootloader is overwritten at each OTA update, and if a power-failure occurs during the flashing of the bootloader (first 4KB if I understood correctly), this can lead to a brick.

It's a 'I can live with it' tradeoff in order to avoid having different binaries, one for OTA and the other one for esp-tool flashing?

It could be a good idea to skip, if not needed (i.e. adding an option to the UpdaterClass), the first 4KB of the image, and instruct eboot to flash it at 0x1000 instead of 0x0000 ?

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esp8266/Arduino/pull/6538#issuecomment-614641521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABX6A5WWCBNIFDUAMVFG7DRM37MPANCNFSM4IYSVVSA .

davisonja avatar Apr 16 '20 23:04 davisonja

My braindump after working with it to add the compressed stuff:

Eboot is very simple and just copies a block or blocks of given size from one part of flash to another.

Right now what the IDE builds is a complete image, including eboot, as one big chunk. So the first thing it does is overwrite eboot and then the following app. A powerfail during eboot write is generally fatal.

You could break that into 2 chunks, one the main app and one eboot, then just not send eboot if you don't want it updated. That's a new build option and changes to elf2bin.py script, but it's not very complicated. As long as you don't send the eboot chunk, you should be safe (assuming the onboard flash doesn't do something weird like randomly erase things when power is taken away during an operation).

Then only eboot updates would be potentially damaging. I don't think you generally care about uploading eboot except when you've got a new chip or a full-flash-erase'd one.

earlephilhower avatar Apr 17 '20 00:04 earlephilhower

You can't leave eboot out without making changes to anything that's using the sketch md5, which includes everything from 0 to end-of-sketch. The current state of (my changes to) eboot reflect the underlying goal of adding features without breaking anything.

It's trivial to ask eboot to not update the bootloader flash page, indeed, from my just-now looking it would appear the range of addresses are still actually determined by the main-code leaving a command for eboot - so changing Updater.cpp to skip over the first page is a matter of changing the source and destination addresses by 0x1000 and leaving everything else as it already is (which won't change anything, let alone break it).

Much safer to prevent bootloader updates in this way - but leaves the potential gotcha I mentioned above.

I went through this as part of the initial bootloader changes (splitting out eboot to only upload the sketch etc). Doing so might be an overall good change to make, but it's slightly wider reaching than just the upload process.

davisonja avatar Apr 17 '20 02:04 davisonja

There's a potentially easier way:

  • always build and send with eboot included in the image as now
  • have eboot compare the first block of flash against the first block of the new image. If they're different, then copy the entire image to 0x0. If identical, skip the first block of the copy, i. e. just copy the sketch to 0x1000.

devyte avatar Apr 17 '20 02:04 devyte

I like that as an added safety measure for regular updates regardless, helps keep us out of brick territory. πŸ‘

I'm not averse to also making it possible to force-skipping of eboot update, if people are keen though. For my purposes the suggestion by @devyte will make it effectively bulletproof for my uses.

davisonja avatar Apr 17 '20 03:04 davisonja

I agree with @devyte . But I think that, in order to have a consistent bootloader command, it could be better to check if the bootloader has changed in the UpdaterClass.

  • The whole binary (bootl + app) is always written at the end side of the application flash "partition" (say starting at _startAddress).
  • The MD5 sum if verified (on the whole binary).
  • The bootloader sector is byte-compared with the new available image (compare [0, 0x1000] with [_startAddress, _startAddress + 0x1000])
  • If the bootloader has changed then eboot is instructed to flash starting at 0x0000, copying from _startAddress
  • If the bootloader has not changed then eboot is instructed to flash starting at 0x1000 copying from _startAddress +0x1000

I apologise if I am missing something... I do not have a so deep UpdaterClass/Eboot knowledge...

lrodorigo avatar Apr 17 '20 08:04 lrodorigo

@devyte's idea looks very good.

This bit is an easy change to Updater.cpp

* The whole binary (bootl + app) is always written at the end side of the application flash "partition" (say starting at `_startAddress`).
* The MD5 sum if verified (on the whole binary).

This bit needs to be at least partially in eboot.c because compressed uploads cannot be read by the main app code:

* The bootloader sector is byte-compared with the new available image (compare `[0, 0x1000]` with `[_startAddress, _startAddress + 0x1000]`)
* If the bootloader has changed then eboot is instructed to flash starting at `0x0000`, copying from `_startAddress`
* If the bootloader has not changed then eboot is instructed to flash starting at `0x1000` copying from `_startAddress +0x1000`

There is already 4K allocated as a decompress buffer (uzlib_flash_read_cb_buff), so it's halfway there.

earlephilhower avatar Apr 17 '20 14:04 earlephilhower

@earlephilhower @devyte Sorry, it doesn't look like I've made any recent comments on this one, but not abandoned.

Originally the PR got sent through and seemingly stalled. I had thought it was being reviewed/tested by other people and then either feedback or merging would happen. The codebase continued to shift under the PR, however, and it got further and further out of date (and more and more complex to resolve as key parts of the codebase had changed).

If there's still interest in getting it merged (and I certainly am keen as I think it provides some useful safeguards for isolated device update) then I'll get it running against the current codebase. It sounds as though it's been assigned to the current milestone, so now is the right time.

davisonja avatar Sep 03 '20 01:09 davisonja

@davisonja some concerns came up that required testing, and nobody was available for testing, so it was pushed back, because merging could have broken stuff. Now we're developing for major release 3, so if stuff breaks it's not so bad, and there's time to fix such cases later anyways.

devyte avatar Sep 03 '20 02:09 devyte

@devyte I hadn't realised there were concerns. I don't believe the original work introduced any breaking changes (not doing so was part of the design goal), but if we're moving to a release for which that's less of a concern that's also fine.

The only real concern at this stage is the other changes that have been made to eboot since this work was originally put together. It's possibly more complicated now - I haven't set about reconciling my changes with the current state yet.

davisonja avatar Sep 03 '20 03:09 davisonja

General question, the original version of this had all the ota-commands-in-flash code wrapped in #ifdefs so it could be turned off, do we want it to still be optional, or just inflict it on everyone? The effective cost is a page of flash where the commands get stored.

davisonja avatar Sep 10 '20 09:09 davisonja

I think a clean implementation free of defines. This is something we want to keep, and it doesn't really make sense to go back. Questions (without reviewing yet): one page of flash or one 4K block? where are the commands being stored?

devyte avatar Sep 10 '20 14:09 devyte

I wouldn't review it yet, still fixing up the changes between when it was written and now :)

Oh, uhm, page vs block, it's the smallest erasable chunk, which I recall was 4k, but that might be a block rather than a page here?

The commands are written into a series of slots in this block. Slots are used in succession being 'consumed' by each command. When the command has been processed the slot is marked as 'used' and the next one used. When all the slots have been used the block is erased and the process starts again. (this is determined at command-write time, so if for any reason it fails, such as losing power, the original firmware remains intact)

Way back there were thoughts on putting other bootloader related info in the block, purely because the command slots can be useful without taking up the entire block - but since we can't erase less than a block at a time and if you're using OTA you're almost certain to need to erase it, it doesn't make sense to share it with entirely arbitrary things.

davisonja avatar Sep 10 '20 22:09 davisonja

The actual flash address was directly after the area allocated for the running program, so it shortened that by the block size.

davisonja avatar Sep 10 '20 22:09 davisonja

For the smaller flash size, e. g. 1MB (let's not discuss 512KB), the max bin size is of the order of 600KB, i. e. about 400KB free space (assuming compression is used and no FS). Using an entire 4KB block to store just the few bits of the command means 1% of the free space. If you do use a FS, that % goes sharply up. There was the idea floated around at some point to use the upper part of a block that is already used, e. g. the wifi config area, or the upper part of the eboot block, or something else. It really doesn't matter much where it gets stored, as long as it's out of the way and well documented for posterity. I don't want to hold up this work even one tiny bit more, so let's keep going as it is with the empty 4K block, but there should be a follow up issue opened to track investigation of that optimization.

devyte avatar Sep 10 '20 23:09 devyte

I suspect smaller flash sizes were the original reason for the conditional inclusion of the flash code. The code increase is fairly trivial, but (for smaller sized devices) the block was more of a problem.

The original design allows for the commands to be stored anywhere (a pointer to them is store in the 1st block with eboot) so the only two considerations are more slots leads to fewer flash erase cycles, and realisitically any data that shares with the eboot slots has a chance of being lost entirely if there's a power failure during the erase of that block. For eboot data it's irrelevant as it's all generated, you just miss out on that update, but that might not be true for other things.

This is the primary reason it's stored outside the eboot block - the code's goal is to allow an update to survive a power-failure (which wipes the RTC bytes), if we erase the eboot block itself you're just as vulnerable as without the code entirely.

davisonja avatar Sep 11 '20 00:09 davisonja

Arguably I should have started here, however, I've just put everything together (now on Mac, to complete the desktop OS trio) and built eboot. While the original version of this fit with room to spare, with the addition of the compressed image stuff, it doesn't.

So our target size is 0x1000 and we have:

features size
compression 0xf10
compression + serial debug 0x1040
compression + flash storage 0x1098
compression + flash + serial debug 0x1318

Hopefully there's some trimming that can be done, but it's a lot closer than I'd originally hoped.

I'm reluctant to increase the size of eboot if we can avoid it (partly due to considerations as mentioned earlier by @devyte). So, I'm wondering what general thoughts are on making the compression stuff a bit easier (as in #ifdef'd) to opt out of which would let people continue to use eboot within 0x1000 (personally I'm much more interested in general in having flash-backed updates than I am compressed images) but potentially having a two-page option that may offer some additional features.

Worst case I think it'd be possible to, with the functionality we have, make eboot two pages, but with the flash-update-commands stored in the second of the two pages and some clever logic to help ensure a failed rewrite of the second page could be recovered from.

davisonja avatar Oct 16 '20 09:10 davisonja

Oh, potentially getting rid of the existing RTC stuff will help, the original approach left that intact to provide a path of least change, but if we were to opt to go with just flash storage, it might all fit. Will explore that next.

davisonja avatar Oct 16 '20 09:10 davisonja