RIOT icon indicating copy to clipboard operation
RIOT copied to clipboard

driver/mtd_spi_nor, pkg/littlefs: improve reliability with corrupted flash (new PR)

Open crasbe opened this issue 1 year ago • 24 comments

Contribution description

This is a takeover of PR #14908, which has been abandoned. The goal is to implement the proposed changes in the original PR to make it mergeable.

The changes to make it mergeable after four years were necessary due to the following commits:

  1. https://github.com/RIOT-OS/RIOT/commit/60eb75da3c97c675e7c6a10edae157ddec78e55a This commit removed the "mtd_write_page" function in favor of the "mtd_write_page_raw" function.

  2. https://github.com/RIOT-OS/RIOT/commit/ea105d34ece9402d9bdff38e31a77abc4abcd74e This commit removed the "mtd_spi_nor_write" function, which was changed in the original PR as well but these changes are not necessary anymore.

Open Tasks:

  • [X] The review from @benpicco ( https://github.com/RIOT-OS/RIOT/pull/14908#pullrequestreview-897554562 ) has to be included in this PR still. My plan was to add a header file to the drivers/mtd_spi_nor folder which would probably be called ~~something like "drivers/mtd_spi_nor/include/mtd_spi_nor_security.h"~~ "drivers/mtd_spi_nor/include/mtd_spi_nor_defines.h".

  • [X] The security bits used by this PR are not universal to all manufacturers of Flash chips, specifically the Flash used by the original author @vincent-d is from Macronix and for example ISSI uses a different register and different bit positions. ~~I'll have to check if they are manufacturer specific or device specific.~~

  • [X] The Kconfig related changes ~~can probably be~~ were deleted, as the Kconfig subsystem has been removed from the mtd_spi_nor driver.

  • [x] ~~There is no timeout for the added functions wait_for_write_enable_cleared and wait_for_write_enable_set, which would get the thread stuck if the chip does not answer.~~ Addressed in #20769

  • [X] ~~The function wait_for_write_complete does not have a timeout either but it counts the attempts and how many times it yielded the thread. This can be used as a basis for a timeout.~~ Addressed in #20769

  • [x] Write tests for the Macronix and ISSI flash security features.

  • [x] Add IS25LP128 and IS25LE01G as DuTs to the test (and possibly more, whatever the magic drawer might supply).

Testing procedure

The PR comes with new tests that utilize the Block Protect functions from NOR Flashs which triggers the Program Fail and Erase Fail flags. The Block Protect bits are not permanent and can be reset easily. The nRF52840DK development board has a suitable flash chip on board and is covered by the test.

All tests should run successfully to verify the PR is working correctly.

Running the test on the nRF52840DK currently requires patching the boards/nrf52840dk/include/boards.h, the "SPI_NOR_F_CHECK_INTEGRITY" flag has to be added to the "NRF52840DK_NOR_FLAGS" define.

change from:
#define NRF52840DK_NOR_FLAGS              (SPI_NOR_F_SECT_4K | SPI_NOR_F_SECT_32K)

to:
#define NRF52840DK_NOR_FLAGS              (SPI_NOR_F_SECT_4K | SPI_NOR_F_SECT_32K | SPI_NOR_F_CHECK_INTEGRITY)

Otherwise the tests will run, but will not test the new data integrity features.

chris@ThinkPias:~/RIOT/tests/drivers/mtd_spi_nor$ make flash term
Building application "tests_mtd_spi_nor" for "nrf52840dk" with CPU "nrf52".

... all the normal build commands...

2024-05-30 00:08:35,710 # Connect to serial port /dev/ttyACM0
2024-05-30 00:08:36,717 # ...
Welcome to pyterm!
Type '/exit' to exit.
2024-05-30 00:08:36,717 # OK (3 tests)

When enabling the ENABLE_DEBUG and ENABLE_TRACE flags in the drivers/mtd_spi_nor/mtd_spi_nor.c file, the debug output should show the following lines:

...
 # mtd_spi_nor_write: ERR: program/erase failed! scur = 20
...
 # mtd_spi_nor_write: ERR: program/erase failed! scur = 60
...
 # OK (3 tests)

This indicates that the security related code path was triggered and if the tests still pass, the program and erase failures happened at the right location in the tests.

Issues/PRs references

This closes #14908.

crasbe avatar Apr 17 '24 09:04 crasbe

@benpicco I think I found a suitable solution to get rid of the Kconfig configuration options and have a versatile option to enable manufacturer dependent security options. The mtd_spi_nor_params_t struct (https://github.com/RIOT-OS/RIOT/blob/master/drivers/include/mtd_spi_nor.h) has a flag field, which is evaluated during compile time and is currently used to signify the page size. However it is a 16-bit wide flag field, so there should be plenty of space to add security flags.

The flags are defined in the global header file https://github.com/RIOT-OS/RIOT/blob/master/drivers/include/mtd_spi_nor.h This would follow the approach that the mtd_spi_nor driver is configured at a board level. One "downside" is that the common Opcode structure in https://github.com/RIOT-OS/RIOT/blob/master/drivers/mtd_spi_nor/mtd_spi_nor_configs.c can not be used in this form anymore, since the security related Opcodes are not shared between manufacturers and even worse, the Opcodes collide. While it would be possible to just ignore the collision and use the right opcode in the code later on, it's not the right approach since it blocks future extension of the code (if anyone wants to use the ACTUAL opcode).

Therefore I would introduce four new structures: two for Macronix with security (1byte and 4byte access) and two for ISSI (1byte and 4byte access). The comment above the structures says that the compiler only adds the selected structure to the memory, so it does not take up any additional space.

I'll probably have the preliminary commit with the proposed changes ready in a couple of hours, that should make everything clearer.

crasbe avatar Apr 17 '24 12:04 crasbe

The static-test codespell complains about these lines in drivers/mtd_spi_nor/mtd_spi_nor.c being a typo (it would like to have WELL instead of WEL):

/* Wait for WEL to be set */
...
/* Wait for WEL to be cleared */

I'm not sure how to fix that? Oh wel... :sweat_smile: Interestingly it does not occur when I run "make static-test" locally.

crasbe avatar Apr 17 '24 14:04 crasbe

Interestingly it does not occur when I run "make static-test" locally.

the static-test checks if the tool that is running a test is installed - so you probably do not have codespell installed.

I'm not sure how to fix that? Oh wel...

it there is a long form (write enable L... what is the L? eg WREL?) or alternative of WEL (STATUS_WEL ?) you could use that -and workaround the spell checker or as it suggest it could be added to the ignored words list.

kfessel avatar Apr 18 '24 20:04 kfessel

Interestingly it does not occur when I run "make static-test" locally.

the static-test checks if the tool that is running a test is installed - so you probably do not have codespell installed. That is the weird part about it: I did install codespell and it did complain about a couple of actual typos, but at the moment the tests are all running without failure:

chris@ThinkPias:~/flashdev-riot/RIOT$ make static-test
./dist/tools/ci/static_tests.sh
...
Running "./dist/tools/codespell/check.sh" ✓
Running "./dist/tools/uncrustify/uncrustify.sh --check" ✓
Command output:

	All files are uncrustified!

Running "./dist/tools/shellcheck/check.sh" ✓

I'm not sure how to fix that? Oh wel...

it there is a long form (write enable L... what is the L? eg WREL?) or alternative of WEL (STATUS_WEL ?) you could use that -and workaround the spell checker or as it suggest it could be added to the ignored words list.

WEL stands for "Write Enable Latch". However I found out that codespell added an inline option to ignore certain words. That would avoid adding a global ignored word (and potentially missing typos of "well" in the future): https://github.com/codespell-project/codespell?tab=readme-ov-file#inline-ignore

crasbe avatar Apr 19 '24 10:04 crasbe

WEL stands for "Write Enable Latch". However I found out that codespell added an inline option to ignore certain words. That would avoid adding a global ignored word (and potentially missing typos of "well" in the future): https://github.com/codespell-project/codespell?tab=readme-ov-file#inline-ignore

citing from https://github.com/codespell-project/codespell?tab=readme-ov-file#ignoring-words

note that spelling errors are case-insensitive but words to ignore are case-sensitive.

i don't think you need to go the inline route (wel would still be detected if WEL was added to the ignore file and if someone writes WEL i got somthing its probably by choice)

kfessel avatar Apr 19 '24 10:04 kfessel

Okay, this was a bit naive of me. Obviously the spellcheck only checks with the ignored word list that is currently in master. So either I have to ignore the failed spell test or open a separate PR with just the updated ignored word list :thinking:

crasbe avatar Apr 19 '24 10:04 crasbe

Okay, this was a bit naive of me. Obviously the spellcheck only checks with the ignored word list that is currently in master. So either I have to ignore the failed spell test or open a separate PR with just the updated ignored word list 🤔

that PR also would get spellchecked i think -> there is a egg and chicken problem

seems like (our)codespell does not follow their doc https://github.com/codespell-project/codespell/pull/3272 (don't know which date our version is

seems like our codespell is case-insensitive in either case

kfessel avatar Apr 19 '24 11:04 kfessel

i read it wrong by case sensitive they mean you need to type it the way they have it in their typo database (usually lowercase) (i don't know why they do it like this) https://github.com/codespell-project/codespell/issues/2451#issuecomment-1321312802

so for WEL to not trigger you need to add wel but then wel and wEl and Wel would also not trigger

kfessel avatar Apr 19 '24 11:04 kfessel

So this was quite a rabbit hole. No matter what I did I could not reproduce the issue... It turns out that Linux Mint had an old version of codespell. Now I installed it via pip and not form the package repository and now it complains about WEL as well, yay.

BUT... the latest Release of codespell does not support the inline ignores yet. It's only implemented in the development version.

A workaround is changing "WEL" to "WEL-Flag" and now the spellcheck passes.

crasbe avatar Apr 19 '24 21:04 crasbe

The last PR adds a new test for the mtd_spi_nor driver which tests the security features. Contrary to what I wrote in the first post, you can trigger the P_FAIL (Program Failed) and E_FAIL (Erase Failed) flags by protecting a block with the Block Protect flags in the status register.

The test is not suuuper neat because I had to copy some internal functions from the mtd_spi_nor driver to have direct access to the register. This seemed the least intrusive way to create the tests.

This is a log from the most relevant test for this PR, checking the security functions with the Block Protect bits.

2024-04-24 19:02:42,301 # ..test_mtd_block_protect: Checking the Block Protect Functions
2024-04-24 19:02:42,357 # mtd_spi_nor_write_page: 0x20000800, 0x200017d0, 0x0, 0x0, 0x10
2024-04-24 19:02:42,360 # mtd_spi_cmd: 0x20000800, 06
2024-04-24 19:02:42,364 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000173f, 1
2024-04-24 19:02:42,369 # mtd_spi_nor: wait device status = 0x3e, waiting WEL-Flag
2024-04-24 19:02:42,374 # mtd_spi_cmd_addr_write: 0x20000800, 02, (000000), 0x200017d0, 16
2024-04-24 19:02:42,379 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000172f, 1
2024-04-24 19:02:42,383 # mtd_spi_nor: wait device status = 0x3c, waiting !WIP
2024-04-24 19:02:42,388 # wait loop 0 times, yield 0 times, total wait 0us
2024-04-24 19:02:42,392 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000173f, 1
2024-04-24 19:02:42,397 # mtd_spi_nor: wait device status = 0x3c, waiting !WEL-Flag
2024-04-24 19:02:42,401 # mtd_spi_cmd_read: 0x20000800, 2b, 0x2000176f, 1
2024-04-24 19:02:42,405 # mtd_spi_nor_write: ERR: page program failed!
2024-04-24 19:02:42,409 # mtd_spi_nor_read: 0x20000800, 0x200017e0, 0x0, 0x10
2024-04-24 19:02:42,415 # mtd_spi_cmd_addr_read: 0x20000800, 03, (000000), 0x200017e0, 16
2024-04-24 19:02:42,420 # mtd_spi_nor_erase: 0x20000800, 0x0, 0x1000
2024-04-24 19:02:42,423 # mtd_spi_cmd: 0x20000800, 06
2024-04-24 19:02:42,427 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000176f, 1
2024-04-24 19:02:42,432 # mtd_spi_nor: wait device status = 0x3e, waiting WEL-Flag
2024-04-24 19:02:42,437 # mtd_spi_cmd_addr_write: 0x20000800, 20, (000000), 0, 0
2024-04-24 19:02:42,441 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000175f, 1
2024-04-24 19:02:42,446 # mtd_spi_nor: wait device status = 0x3c, waiting !WIP
2024-04-24 19:02:42,450 # wait loop 0 times, yield 0 times, total wait 0us
2024-04-24 19:02:42,454 # mtd_spi_cmd_read: 0x20000800, 05, 0x2000176f, 1
2024-04-24 19:02:42,459 # mtd_spi_nor: wait device status = 0x3c, waiting !WEL-Flag
2024-04-24 19:02:42,463 # mtd_spi_cmd_read: 0x20000800, 2b, 0x20001797, 1
2024-04-24 19:02:42,467 # mtd_spi_nor_erase: ERR: erase failed!
2024-04-24 19:02:42,467 # 
2024-04-24 19:02:42,468 # OK (4 tests)
2024-04-24 19:03:01,934 # Exiting Pyterm

These two messages are generated by the new code that was created by @vincent-d:

2024-04-24 19:02:42,405 # mtd_spi_nor_write: ERR: page program failed!
2024-04-24 19:02:42,467 # mtd_spi_nor_erase: ERR: erase failed!

HOWEVER... I do not love the solution with the configuration (setting both Flag and Opcode) in drivers/mtd_spi_nor/mtd_spi_nor_configs.c. On one hand it's unnecessary duplication and on the other hand it is really easy to forget to set one of the two (ask me how I know...). And if you forget to set the correct opcodes, it results in false positive security behaviour. An undefined opcode is set to 0x00 (NOP) and that returns 0x00. This 0x00 is interpreted as "everything is fine" by the security code.

My preferred option would be adding some Pseudomodules, similar to the AT24CXXX driver. However I'm not sure if it would make sense to do it the same way by allocating every namespace that begins with "M25F...", "M25R...", "IS25LP...", "IS25LE...", "SST25V..." etc. A better option would probably be something like "MTD_SPI_NOR_MX", "MTD_SPI_NOR_IS", "MTD_SPI_NOR_SST" for defining the opcode set and "MTD_SPI_NOR_SECURITY" and (in a future expansion) "MTD_SPI_NOR_ECC" for defining the features.

Any other ideas about this are very welcome.

crasbe avatar Apr 24 '24 20:04 crasbe

@benpicco You reviewed the original PR, could you take a quick look at this to give me some guidance on the previous question? You don't have to review this PR yet.

crasbe avatar Apr 25 '24 21:04 crasbe

I implemented the Pseudomodule solution and IMO it is way more elegant than the original idea and scales a lot better. One thing that became clear is that the ISSI flash behaves a lot different than the Macronix chip. The flags in the security register of the Macronix flash reset themselves after a successful read and the ISSI flags remain until cleared with a separate OpCode. Furthermore, it appears like the WEL (Write Enable Latch) Flag on the ISSI is not reset when the program or erase operation was not successful. This is not clear from the datasheet, it only states that the WEL flag is reset on completion of a program or erase operation, not that is has to be successful. Therefore the WEL flag has to be cleared with a dedicated command. When you don't reset the WEL flag, the wait_for_wel_reset function will poll the flag indefinitely.

So one thing to add to this function is trying to clear the flag manually perhaps after a certain number of tries. (As well as adding the timeout).

Therefore it took a lot of testing and fiddling until the ISSI chip was running as expected and the solution with the Pseudomodules proved to be superior in this case because I was able to add the additional OpCodes conditionally.

Unfortunately I did not separate the changes in separate commits because I did not intend them to become this involved...

crasbe avatar Apr 30 '24 18:04 crasbe

I think this PR is already quite extensive and beyond the originally intended scope, so I would like to postpone adding the timeout tasks for now to a separate PR.

crasbe avatar May 06 '24 10:05 crasbe

Is there anything I can do to help with the review process? Perhaps split off the test program (minus the security tests) to a seperate PR?

I could send out an Arduino Uno Shield with a Macronix flash as well for testing with a Nucleo board in case the reviewer does not have an nRF52840DK.

crasbe avatar May 27 '24 12:05 crasbe

@benpicco @fabian18 maybe someone of you feels more confident.

kfessel avatar May 27 '24 17:05 kfessel

Murdock results

:heavy_check_mark: PASSED

41921a7bcf7c3437f64f051923c156baa6887bbd fixup! fixup! fixup! tests/mtd_spi_nor: Added test with security features

Success Failures Total Runtime
10195 0 10195 15m:06s

Artifacts

riot-ci avatar May 29 '24 14:05 riot-ci

I don't really understand why the tests fail now, the changes I did are unrelated to the failures (all except for the busy_wait_us(1000000) in tests/drivers/mtd_spi_nor/main.c).

It would appear that most of the errors come from the fact that on an 8-bit ATmega an int is 16-bit instead of 32-bit on the other 32-bit architectures: https://gcc.gnu.org/wiki/avr-gcc#Type_Layout

I'm not really sure what to do about the busy_wait_us function, the maximum time I could wait here is 65ms, which defeats the purpose. The idea here was that the terminal needs some time to connect and it's annoying to lose half of the test output.

main.c: In function ‘main’:
main.c:255:18: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
     busy_wait_us(1000000);
                  ^

For the time being I'll just delete it.

crasbe avatar May 29 '24 18:05 crasbe

I'm not really sure what to do about the busy_wait_us function, the maximum time I could wait here is 65ms, which defeats the purpose.

Using the CPU to delay a function with a down-counting loop is the most costly option in terms of power consumption and the least accurate option. It is good enough for a brief delay in a driver's init function to avoid having to depend on ztimer just for that - or when desperate (e.g. early in boot before a timer is available).

IMO: When the unsigned type becomes an issue, it is a red flag that a different function should be used to delay execution.

maribu avatar May 29 '24 20:05 maribu

@maribu I absolutely agree with you. As I said, the delay was just a little help for me to catch the actual test output. However, considering that no other tests do it, it was better to delete it.

@kfessel I think I got all of the comments in the incorrect styling and added some more information regarding the test procedure.

Unrelated: I still don't love git 👀 If I had known how this escalates, I would've/should've approached it differently, but oh well. It's a bit tidier now I guess.

crasbe avatar May 29 '24 22:05 crasbe

If you really want to busy wait long you can always loop a busy_wait

for(uint32_t milliseconds; milliseconds; milliseconds--) busy_wait_us(1000);

also for many cpu counting down ( single_threaded / isr disabled ) isn't to bad in precision ( especially many avr-board that have no clock crystal but a cpu crystal and are very predictable counting) setting up a ztimer in up to millisecond is probably worse in precision on slow mcu

kfessel avatar May 30 '24 11:05 kfessel

@crasbe: do you want to tackle the missing timeouts (waiting on status change) (from your TODOs)? want to prepare the interfaces to be able to fail? ignore them (for reason e.g.: if this happens somthing is very wrong and the system is not recoverable)?

kfessel avatar May 30 '24 11:05 kfessel

@crasbe: do you want to tackle the missing timeouts (waiting on status change) (from your TODOs)? want to prepare the interfaces to be able to fail? ignore them (for reason e.g.: if this happens somthing is very wrong and the system is not recoverable)?

I'm not sure what the best way to go forward is here. IMO the timeout features have to potential to be a big can of worms as well, so I would tend to do that in a separate PR as well.

crasbe avatar May 30 '24 18:05 crasbe

This took a bit longer than I anticipated to get back to it, but other things happened.

I think we have an elegant solution now. The code introduces a new "SPI_NOR_F_CHECK_INTEGRITY" flag, that can be set in the mtd_spi_nor_params_t.flag field. The manufacturer is determined by reading the first byte of the JEDEC ID. With the 0x9F (Read JEDEC ID) command it is permissible to just read the first byte and then stop, so we can read the manufacturer ID only.

As suggested, the naming has been changed to "integrity" and the function has been excluded into a dedicated helper function, which reduces duplication (especially with the added manufacturer check).

When thinking about the best solution I noticed that the Pseudomodule approach would've made it impossible to use a Macronix and an ISSI device in one project. This rarely happens, but nevertheless it isn't nice to have such limitations imposed.

Yet to do is the following:

  • [ ] change EIO to a different error value, waiting for input from @benpicco
  • [x] create a tracking issue with all the stuff that's left to do which I don't really want to include in this PR anymore

crasbe avatar Jul 02 '24 17:07 crasbe

I updated the initial message in this PR to reflect the new testing procedure, which (at this moment) requires patching the boards/nrf52840dk/include/board.h file to activate the data integrity checks.

crasbe avatar Jul 02 '24 19:07 crasbe