RIOT icon indicating copy to clipboard operation
RIOT copied to clipboard

mtd_spi_nor | littlefs: improve reliability with corrupted flash

Open vincent-d opened this issue 5 years ago • 10 comments

Contribution description

This PR improves the reliability of mtd_spi_nor and littlefs:

  • mtd_spi_nor: properly checks for WEL enabling/disabling, check for errors when writing/erasing and return a dedicated error code (-EIO).
  • littlefs: catch -EIO and translate it into LFS_ERR_CORRUPT that is internally used to realloc sectors if needed

Testing procedure

Check that littlefs(2) still works (examples/filesystem). This has been tested with flash that had dead sectors. littlefs could detect them and recover.

vincent-d avatar Aug 31 '20 14:08 vincent-d

What's the deal with the Security register and will it be available on all flash devices?

On macronix flashes (at least, I haven't check other vendors actually :/ ), security register contains some flags: Erase fail bit and Program fail bit that are set when erase or program failed (usually meaning the sector is dead). See, for instance (page 83): https://www.macronix.com/Lists/Datasheet/Attachments/7574/MX25L51245G,%203V,%20512Mb,%20v1.3.pdf

vincent-d avatar Aug 31 '20 14:08 vincent-d

Hm I couldn't find such flags on neither GD25Q32C nor AT25SF041

benpicco avatar Aug 31 '20 14:08 benpicco

Hm I couldn't find such flags on neither GD25Q32C nor AT25SF041

d'oh! I'll revisit the patch to enable this with a compile flag

vincent-d avatar Aug 31 '20 15:08 vincent-d

Is there some way to detect this at run-time?

benpicco avatar Aug 31 '20 15:08 benpicco

I guess it could be possible to read the jedec ID. But I'm not sure it's worth it.

vincent-d avatar Aug 31 '20 15:08 vincent-d

ping @benpicco

fjmolinas avatar Feb 02 '21 11:02 fjmolinas

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

stale[bot] avatar Mar 02 '22 11:03 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

stale[bot] avatar Nov 02 '22 05:11 stale[bot]

@benpicco Can we get this reactivated? The project I'm currently working on uses an ISSI flash which has the same security features (however the bit positions in the registers are different and the OpCodes are different as well). This PR would be a good base and reduce the amount of work significantly.

Given the age of the PR, it probably has merge conflicts now. Unfortunately it looks like @vincent-d is not active anymore, his last GitHub activity was in 2022 and there is not much to find about OTAkeys anymore either.

On a different note: I checked the code and there is no timeout for the added functions wait_for_write_enable_cleared and wait_for_write_enable_set, which would get the thread stuck if the chip does not answer. (The function wait_for_write_complete which existed before and still exists in the current code does not have a timeout either, it counts the attempts and how many times it yielded. But there is no timeout either.)

I don't know how this is handled generally in other modules?

crasbe avatar Apr 15 '24 14:04 crasbe

Sure, feel free to take over the PR. There should of course be a timeout, we don't want block the application in case of a hardware error but instead give to opportunity to handle it gracefully.

benpicco avatar Apr 15 '24 14:04 benpicco