stabilizer USB/flash `clear`/`save` behavior w.r.t. default changes

Once things have been saved to flash (also implicitly through e.g. clear), then changes in Default are not propagated. This makes the behavior with an empty flash (that has never seen clear) different from the one after clear. Usual expectation is that clear restores behavior to "what it was before/initially", arguably including transparent propagation/activation of changes to Default with app updates.

The proposed alternative is to use remove_item in current clear and https://github.com/quartiq/stabilizer/issues/860

https://github.com/quartiq/stabilizer/pull/857#discussion_r1540132885

Mar 27 '24 11:03 jordens

@jordens Based on my understanding, if you change a Default in the application and re-run the clear command, you will receive the new Default value. Is this specifically what is not desired?

I'm not entirely seeing the difference between an uninitialized flash and one that is cleared. Yes, it will have initial values in it that may or may not be Default, but the proposed change to use remove_item and the current implementation wouldn't have any different effect on the clear command or operation of the device.

Mar 27 '24 11:03 ryan-summers

I'd see the additional clear required as an work around.

The behavior is different. With the proposed change clear followed by app update also updates to new defaults (like sucessive app updates without ever touching USB).

Mar 27 '24 13:03 jordens

Based on the findings of https://github.com/quartiq/stabilizer/pull/884#discussion_r1644776679, I don't think we can use the remove_item API that sequential-storage provides because the chip cannot reliably write an already-written flash cell due to ECC codes.

Jun 18 '24 17:06 ryan-summers

I talked with Dion (maintainer of sequential-storage) about this and his proposal was to add a feature flag to have the sequential-storage CRC be saved within its own unique flash-word. Then, we could always write this word to zero.

We would need to verify that writing 0 to a flash word also results in all 0s for the ECC bits. The RM says that SECDEED (extended hamming codes) are used for ECC bit calculations. This ultimately relies on the chip using an even-parity scheme I believe. We could try this and see if we hit faults as well, but it might be a bit riskier.

Jun 18 '24 17:06 ryan-summers

If we have HW ECC, then we don't need the additional CRC, right? Does it make sense to change/adapt/fork sequential-storage for HW ECC flash?

Jun 18 '24 18:06 jordens

But regarding the implementation detail of what a key removal should look like: using a magic value (empty, None) to mark removed keys as you suggest sounds fine. Just needs to be unique: Empty vor invalid would only work if it's JSON, right?

Jun 19 '24 08:06 jordens

If we have HW ECC, then we don't need the additional CRC, right? Does it make sense to change/adapt/fork sequential-storage for HW ECC flash?

ECC and CRCs are accomplishing different things. ECC covers a 256-bit (32 byte) flash line against potential flash wear-out and/or marginal cell voltage levels resulting in inconsistent read results. Generally, this means ECC is protecting the flash against physical properties of flash memory.

CRC on the other hand is protecting against data corruption. This operates at a higher level, so if you (intentionally) partially write over CRC'd data (i.e. overlapping elements), you would invalidate the CRC of the object. However, ECC wouldn't defend against this because ECC gets updated whenever flash is written.

But regarding the implementation detail of what a key removal should look like: using a magic value (empty, None) to mark removed keys as you suggest sounds fine. Just needs to be unique: Empty vor invalid would only work if it's JSON, right?

Right now, the data is postcard-serialized. The current implementation is storing a Vec::default(), i.e. &[] in the data. Then, when we fetch the key from storage, we can see that the item deserializes into a slice of length 0, so we can skip using set_key_by_postcard() since there's no data to register with the settings structure.

Jun 19 '24 08:06 ryan-summers

If the intention of the CRC there is to protect against bugs in the code (not HW flash read/write errors/power cuts), then is it useful? Why apply this protection here and not elsewhere? Is it a robust/complete/significant protection?

Regarding the second point, I suspect that postcard might serialize () as just that empty slice and (strictly speaking) make this an ambiguous choice of "deleted".

Jun 19 '24 09:06 jordens

I'm not sure that the CRC in seq-store (there are in fact two different ones) is intended to protect against code bugs. It's prime aim does appears to be "special marker for erased" and "protect against power cut/flash read/write errors". Just like HW ECC.

Jun 19 '24 09:06 jordens

If the intention of the CRC there is to protect against bugs in the code (not HW flash read/write errors/power cuts), then is it useful? Is it a robust/complete/significant protection?

A CRC would indeed protect us against a power cut, as the CRC would be written, half of the element would then be written, but the second half wouldn't. Then, upon reboot, you would notice that the CRC does not match the element and would discard it. CRCs like this are super common in embedded and tend to work well in keeping persistent storage robust.

Why apply this protection here and not elsewhere?

What do you mean? It's applied internally in sequential-storage because of their design patterns. I'm unsure where else we would need them.

Regarding the second point, I suspect that postcard might serialize () as just that empty slice and (strictly speaking) make this an ambiguous choice of "deleted".

We aren't serializing (), but rather &[]. Postcard's wire format says that any variable-length sequences will be prefixed with a varint encoding of length, so an empty slice encodes as [0x00]. Given that we only ever serialize a single sequence into flash, the serial representation will always be [varint length, <sequence data>], so this should always be robust as currently used.

Jun 19 '24 09:06 ryan-summers

A CRC would indeed protect us against a power cut,

As does HW ECC. The point was to figure out whether CRC on top of ECC is important.

Why apply this protection here and not elsewhere? What do you mean?

Why do we need to protect against code bugs in the storage code but not elsewhere? E.g. why don't we add CRC to MQTT payloads to protect against bugs analogously?

Jun 19 '24 10:06 jordens

A CRC would indeed protect us against a power cut,

As does HW ECC. The point was to figure out whether CRC on top of ECC is important.

I don't believe HW ECC would protect you from a device shutdown during a flash write operation in all cases. Take for example an object that spans 64 bytes (512 bits, 2 flash words), followed by a CRC stored after it. Assume that a power loss happens after the first 32-bytes have been successfully programmed, but the last 32-bytes have not. In this case, ECC indicates valid data in all 64-bytes (because there was not a power loss in the middle of an individual word being programmed), but the CRC would fail because the second 32-bytes does not contain the expected data.

Why do we need to protect against code bugs in the storage code but not elsewhere? E.g. why don't we add CRC to MQTT payloads to protect against bugs analogously?

Generally this is done in persistent storage that can be modified across reboots and/or different firmware versions. Say you have version 1 that saves some data with CRCs. Then, a while later you make a firmware version 2 that completely redoes the memory layout and ignores whatever was already there. If someone were to reprogram version 1 firmware after having used version 2, they could potentially load all kinds of corrupt data without knowing it (since version 2 trampled all over whatever was there).

I've had this exact bug happen in a different project where the items being stored were logs. We ended up seeing random striping and strange data in the logs, but thankfully the CRCs were able to indicate to us that the data was corrupted.

This is less necessary for i.e. MQTT payloads because they are RAM based, temporary variables. In some safety critical systems (i.e. medical devices), important RAM-based variables (like radio payloads, MQTT traffic, or active settings) are CRC-protected as a defensive mechanism against unintended memory corruption. This generally is driven by risk analysis and fault-tolerance requirements of the device.

Jun 19 '24 10:06 ryan-summers

the CRC would fail because the second 32-bytes does not contain the expected data.

But for that scenario one doesn't need (and arguably shouldn't use) CRC or ECC. I thought seq-stor uses the usual transaction-based updates which protect against aborts.

across reboots and/or different firmware versions

But CRC isn't a robust protection against these schema changes at all. As you describe it's coincidental that it triggered in your case.

It's correct to say that there needs to be a risk analysis instead of coincidental detection. And I would argue that the faults where CRC in our context would help (in the presence of HW ECC and transactions) aren't crucial and worse, that (a) doubling CRC as a deletion marker is a bad idea and (b) matching CRC provides false assurance that the schema matches. I'd rather have an architecture that at least doesn't fight ECC (https://github.com/quartiq/stabilizer/pull/884#discussion_r1644776679 indicates that the CRC updates are fighting ECC).

Jun 19 '24 12:06 jordens

the CRC would fail because the second 32-bytes does not contain the expected data.

But for that scenario one doesn't need (and arguably shouldn't use) CRC or ECC.

ECC is not an optional feature. It's a property of the flash memory on the STM32H743ZITx chip and is always used. There is no mechanism to opt out of it.

I thought seq-stor uses the usual transaction-based updates which protect against aborts.

What do you mean by "transaction-based updates which protect against aborts"? This doesn't exist when programming flash memory in general. Say you are programming 3 words, where i denotes the initial state and rp is the reprogrammed state.

At start up, you have [i_1, i_2, i_3], where the first two positions are your object and the last position is the CRC
You successfully program the first flash word: [rp_1, i_2, i_3]
Your system then loses power, so the contents of flash are now [rp_1, i_2, i_3]
- Assume ECC in flash memory is fine, since you did not power down while programming a flash word
Upon reboot, you check the memory and see [RP, I, I]. You calculate the CRC of the first two positions [rp_1, i_2] and notice that the CRC does not match the final position i_3, so you discard the item [rp_1, i_2] as invalid.

There's no way you could ever protect yourself against a power off during the flash reprogramming process regardless of software design. There's always a critical time window when flash is in a transitional state. This is a key purpose to the CRC - you can detect if an error happened during the transitional reprogramming state and ignore the entry as a result.

across reboots and/or different firmware versions

But CRC isn't a robust protection against these schema changes at all. As you describe it's coincidental that it triggered in your case.

I don't think It's intended to be perfect. It's a canary that something happened to change the structure of your data that you may not have done intentionally. It is never 100% accurate, but has a sufficiently high detection rate (generally >99%). It was not coincidental in my case, it would have happened with a 99% probability for each element modified (and there were hundreds of elements).

And I would argue that the faults where CRC in our context would help (in the presence of HW ECC and transactions) aren't crucial and worse, that (a) doubling CRC as a deletion marker is a bad idea and (b) matching CRC provides false assurance that the schema matches.

(a) I don't disagree that clearing the CRC as a deletion marker may not be the best idea, but we aren't doing that in our implementation.

(b) I don't think the CRC's primary goal is verifying data schema. Ultimately, we as software developers are responsible for managing schema through firmware versions. It's not been delegated to an external library.

I'd rather have an architecture that at least doesn't fight ECC (#884 (comment) indicates that the CRC updates are fighting ECC).

This only happened because I misunderstood the capabilities of flash on our chip. Our chip does not support writing the same flash location twice, and i was trying that anyways. sequential-storage requires that your flash implement the MultiwriteNorFlash trait, which indicates that it's safe to write a flash location twice.

Jun 19 '24 12:06 ryan-summers

But for that scenario one doesn't need (and arguably shouldn't use) CRC or ECC. ECC is not an optional feature.

I meant neither CRC nor ECC are needed (or typically used) to achieve robustnes in that case.

Transaction-based robustnes against aborts goes like this:

Given several storage pages with some write granularity in "chunks" (the flash words)
And a known initial state EMPTY (page erased)
Choose three distinguishable values as headers in each chunk: EMPTY, DIRTY, COMMIT
Start a transaction by writing DIRTY+payload to the first EMPTY chunk
continue the transaction with writing to further DIRTY+payload chunks
Commit the transaction with a COMMIT+payload chunk
Only ever write a chunk once
When out of EMPTY chunks, defragment/compress/balance with a similar robust (two phase vacate page by duplication/compaction, then erase vacated page) transaction algo.

With this your scenario of an aborted transaction between complete chunk writes is detected (last non-EMPTY chunk is DIRTY) without the need for any of CRC or ECC. ECC covers other mid-chunk aborts and the remaining corruption cases. In any case no need for an additional CRC. And it only needs single writes.

I don't think It's intended to be perfect.

If it doesn't have a clear intention and claims, if it can provide ambiguous validity claims, if it collides with other features (if it weren't for ECC, our flash can be MultiwriteNorFlash, no?), I'd argue that it might be better to revise it, not rely on it, and avoid it.

I don't think the CRC's primary goal is verifying data schema.

It should not be! I understood your anecdote of precisely that scenario as an argument in favor of CRC.

MultiwriteNorFlash

Right. We can't use the clear/remove algo of seq-store. But using an empty value to mark deletion isn't good either as I mentioned before: it collides with the serialization of the unit type.

Jun 19 '24 14:06 jordens

stabilizer stabilizer copied to clipboard

USB/flash `clear`/`save` behavior w.r.t. default changes

stabilizer
stabilizer copied to clipboard