opentitan icon indicating copy to clipboard operation
opentitan copied to clipboard

[rom,rom_ext] During owner firmware selection/verification handle flash ECC errors as non-fatal

Open jettr opened this issue 1 year ago • 17 comments

Description

When ROM_EXT is deciding and verifying which slot of owner firmware should execute, any flash ECC error while reading the potential owner firmware image should be treated similar to a signature failure, i.e. the next slot should be considered. ECC flash errors during the read operation should not reset the chip in this case. Hopefully we can accomplish this with our current hardware.

jettr avatar Feb 13 '24 22:02 jettr

Hi @msfschaffner, marking this for PROD.M2 so that we can triage. The idea here is that during secure boot, we want to be able to avoid triggering a fatal escalation when reading the image to calculate its hash. We rather handle the alert, treat the impacted code partition as corrupted, and try to boot from the next partition. We want to consider this change to avoid getting into a potential boot failure loop.

This under the assumption that the probability of flash corruption with a hamming distance of >2 bits is non-zero; and, the corruption is due to environmental issues and not active attacks.

This change may also be applicable to partitions used to store data.

CC: @zi-v

moidx avatar Feb 14 '24 04:02 moidx

I think that's a good idea. BTW, I guess you mean hamming distance > 1 bits, right?

zi-v avatar Feb 14 '24 04:02 zi-v

I believe the current hardware implementation should support this already.


First some background:

The FLASH_CTRL has two flavors of ECC:

  1. reliablity ECC on 64bit blocks, added after encrypting two 32bit words with PRINCE.
  2. integrity ECC on 32bit words. this is part of the bus end-to-end integrity scheme.

Both ECC flavors are enabled together on region/page boundaries with the ECC_EN bit. The first flavor is reliability ECC that is truly SECDED in the sence that 1bit errors are corrected. 2bit errors lead to an in-band access error, and trigger a fatal alert. The second flavor is for detection only (no SECDED) - although it is not truly end-to-end as in SRAM_CTRL, since the block size of the cipher and the amount of meta data bits available per 64bit block do not allow us to encrypt and store full 7bit ECC tags alongside each 32bit data word. Therefore, the bus integrity ECC is stripped away and a 4b checksum is calculated over 2x32bit words before encryption (see https://opentitan.org/book/hw/ip/flash_ctrl/doc/theory_of_operation.html#flash-ecc-and-icv for more details.). Any mismatch upon readout will trigger a fatal alert and lead to an in-band access error on the bus.

The good part here is that the bus integrity metadata bits have to be recomputed upon sending the data back to the processor, and hence a fatal alert will be contained within the FLASH_CTRL and not propagate into the Ibex LSU. As opposed to fatal alerts within Ibex, a fatal ECC alert will not shut down the FLASH_CTRL. Rather, the FLASH_CTRL will jujst keep on sending alert events to the alert handler until reset.


Now, there are a few ways we can handle the scenario discussed above:

  1. Disable ECC on the region in question when hashing an image during secure boot. This will prevent ECC errors from occurring - but it also disables single bit error correction.

  2. Enable ECC, and filter fatal alerts from FLASH_CTRL in the alert handler to allow firmware to handle them instead of immediately resetting the chip. Note that fatal alerts will however keep on firing until a clean system reset is performed, since they are sticky in FLASH_CTRL. So eventually, the chip should be reset.

  3. Alternative to 2): Enable ECC, and let the alert handler reset the system. The reset reason and alert handler crashdump can then be inspected after rebooting in order to determine that the reset occurred due to a fatal ECC alert in the FLASH_CTRL, and appropriate action can be taken.

Does any of these approaches work for the scenario you had in mind? If yes, I am advocating to keep the current RTL as-is, since any change in this scheme will cause quite a bit of RTL/DV work to make sure it's done properly, and it could alter the security posture of this subsystem.

CC @jdonjdon please let me know if anything does not align with your understanding of FLASH_CTRL.

msfschaffner avatar Feb 15 '24 21:02 msfschaffner

I believe the current hardware implementation should support this already.

First some background:

The FLASH_CTRL has two flavors of ECC:

  1. reliablity ECC on 64bit blocks, added after encrypting two 32bit words with PRINCE.
  2. integrity ECC on 32bit words. this is part of the bus end-to-end integrity scheme.

Both ECC flavors are enabled together on region/page boundaries with the ECC_EN bit. The first flavor is reliability ECC that is truly SECDED in the sence that 1bit errors are corrected. 2bit errors lead to an in-band access error, and trigger a fatal alert. The second flavor is for detection only (no SECDED) - although it is not truly end-to-end as in SRAM_CTRL, since the block size of the cipher and the amount of meta data bits available per 64bit block do not allow us to encrypt and store full 7bit ECC tags alongside each 32bit data word. Therefore, the bus integrity ECC is stripped away and a 4b checksum is calculated over 2x32bit words before encryption (see https://opentitan.org/book/hw/ip/flash_ctrl/doc/theory_of_operation.html#flash-ecc-and-icv for more details.). Any mismatch upon readout will trigger a fatal alert and lead to an in-band access error on the bus.

The good part here is that the bus integrity metadata bits have to be recomputed upon sending the data back to the processor, and hence a fatal alert will be contained within the FLASH_CTRL and not propagate into the Ibex LSU. As opposed to fatal alerts within Ibex, a fatal ECC alert will not shut down the FLASH_CTRL. Rather, the FLASH_CTRL will jujst keep on sending alert events to the alert handler until reset.

Now, there are a few ways we can handle the scenario discussed above:

  1. Disable ECC on the region in question when hashing an image during secure boot. This will prevent ECC errors from occurring - but it also disables single bit error correction.
  2. Enable ECC, and filter fatal alerts from FLASH_CTRL in the alert handler to allow firmware to handle them instead of immediately resetting the chip. Note that fatal alerts will however keep on firing until a clean system reset is performed, since they are sticky in FLASH_CTRL. So eventually, the chip should be reset.
  3. Alternative to 2): Enable ECC, and let the alert handler reset the system. The reset reason and alert handler crashdump can then be inspected after rebooting in order to determine that the reset occurred due to a fatal ECC alert in the FLASH_CTRL, and appropriate action can be taken.

Does any of these approaches work for the scenario you had in mind? If yes, I am advocating to keep the current RTL as-is, since any change in this scheme will cause quite a bit of RTL/DV work to make sure it's done properly, and it could alter the security posture of this subsystem.

CC @jdonjdon please let me know if anything does not align with your understanding of FLASH_CTRL.

@msfschaffner It is same as what I know.

jdonjdon avatar Feb 15 '24 22:02 jdonjdon

Thanks @msfschaffner!

We can explore options 1) & 2). I wasn't sure if this fatal alert was similar to the local fatal alerts. It is good that this is something we can configure in the alert handler.

We'll have to play with escalating to interrupt to see if we can reliably pinpoint to the corrupted access word, otherwise we will have to think about a mitigation scheme. From a reliability point of view, storing the failed flash access address in a register could be helpful in aiding firmware recover from this error.

If unable to pinpoint to the corrupted address, firmware can opt for erasing and re-writing the entire code partition.

I will leave this on M2 until I get a chance to sync with @cfrantz and @jettr.

moidx avatar Feb 16 '24 06:02 moidx

I think Jett's concern was in part that if ROM_EXT uses memory mapped flash accesses, then reading any flash word that has ECC failure will result in a RISC-V trap, which could easily turn into a boot loop. So ROM_EXT should be careful to either only go through the controller IO registers when reading and verifying the two slots of the next stage, or it should have a non-trivial trap handler, which allows it to recover and continue looking at the other slot.

Regards Jes

On Thu, Feb 15, 2024 at 10:04 PM moidx @.***> wrote:

Thanks @msfschaffner https://github.com/msfschaffner!

We can explore options 1) & 2). I wasn't sure if this fatal alert was similar to the local fatal alerts. It is good that this is something we can configure in the alert handler.

We'll have to play with escalating to interrupt to see if we can reliably pinpoint to the corrupted access word, otherwise we will have to think about a mitigation scheme. From a reliability point of view, storing the failed flash access address in a register could be helpful in aiding firmware recover from this error.

If unable to pinpoint to the corrupted address, firmware can opt for erasing and re-writing the entire code partition.

I will leave this on M2 until I get a chance to sync with @cfrantz https://github.com/cfrantz and @jettr https://github.com/jettr.

— Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/21353#issuecomment-1947808686, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSTVWYMZ3MDWGRYOOKTQS3YT3ZGTAVCNFSM6AAAAABDHIRGH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBXHAYDQNRYGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jesultra avatar Feb 16 '24 17:02 jesultra

...and by the way, the same goes for how the ROM decides which of the two slots of ROM_EXT to jump to. It should also not use memory mapped reads, or if it does, have a trap handler that allows recovery.

On Fri, Feb 16, 2024 at 9:32 AM Jes Klinke @.***> wrote:

I think Jett's concern was in part that if ROM_EXT uses memory mapped flash accesses, then reading any flash word that has ECC failure will result in a RISC-V trap, which could easily turn into a boot loop. So ROM_EXT should be careful to either only go through the controller IO registers when reading and verifying the two slots of the next stage, or it should have a non-trivial trap handler, which allows it to recover and continue looking at the other slot.

Regards Jes

On Thu, Feb 15, 2024 at 10:04 PM moidx @.***> wrote:

Thanks @msfschaffner https://github.com/msfschaffner!

We can explore options 1) & 2). I wasn't sure if this fatal alert was similar to the local fatal alerts. It is good that this is something we can configure in the alert handler.

We'll have to play with escalating to interrupt to see if we can reliably pinpoint to the corrupted access word, otherwise we will have to think about a mitigation scheme. From a reliability point of view, storing the failed flash access address in a register could be helpful in aiding firmware recover from this error.

If unable to pinpoint to the corrupted address, firmware can opt for erasing and re-writing the entire code partition.

I will leave this on M2 until I get a chance to sync with @cfrantz https://github.com/cfrantz and @jettr https://github.com/jettr.

— Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/21353#issuecomment-1947808686, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSTVWYMZ3MDWGRYOOKTQS3YT3ZGTAVCNFSM6AAAAABDHIRGH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBXHAYDQNRYGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jesultra avatar Feb 16 '24 17:02 jesultra

The flash_ctrl register ERR_ADDR can be used to determine the address of a synchronous access error due to ECC. We can probably use this as part of option 2) to register and/or handle the error in the ROM and/or ROM_EXT.

I am assigning this issue to the SiVal Project and SV3 milestone to implement a test to evaluate both options.

This issue will later be tracked as part of the ROM and ROM_EXT milestones.

moidx avatar Feb 16 '24 21:02 moidx

CC @cfrantz and @timothytrippel

moidx avatar Feb 16 '24 21:02 moidx

Do we have ROM_EXT linked for both bank A and B? I would like to try composing an image that uses both slots, and then corrupt one or the other ROM_EXT to see what happens.

I am a little worried about the comment about fatal ECC errors being sticky in FLASH_CTRL. I think it would be good, if our Google firmware running in one slot could notice if the copy in the other slot has degraded due to age of the chip (or double alpha decay), and let ChromeOS know. ChromeOS would then presumably react by re-writing an image to the inactive partition, but then we would have no immediate desire to reset the OT chip, as that typically involves the Chromebook rebooting, and we do not want to put users through random and unnecessary reboots.

jesultra avatar Feb 16 '24 22:02 jesultra

Hi @jesultra, we have a and b images for most SKU targets. See for example:

https://github.com/lowRISC/opentitan/blob/8056b6cd3b44680d6623c065f0f68db649bfcd60/sw/device/silicon_creator/rom_ext/prodc/binaries/BUILD#L7-L15

Yes, I agree with your concern. The test of options 1) and 2) is to make sure we can fulfill the use case as described.

We can also demote the alert to recoverable in hardware, but this would require security evaluation. We can proceed with discussing the security impact in the security working group. In this case I would prefer to favor reliability over security.

CC @johannheyszl, is this something we can cover in the next security WG? Thanks!

moidx avatar Feb 17 '24 00:02 moidx

The discussion is currently about option 2 - 'Run w/ ECC and handle fatal alerts somehow'. Option 1 - 'Disabling ECC while reading for hashing+signature verification' was mentioned earlier but we disregard it because we want to use the single bit error correction, correct?

cc @vogelpi

johannheyszl avatar Feb 19 '24 14:02 johannheyszl

Disabling ECC while reading for hashing+signature verification' was mentioned earlier but we disregard it because we want to use the single bit error correction, correct

That would be my preference yes. Having single bit error correction enabled is highly desirable.

jettr avatar Feb 20 '24 16:02 jettr

In case we decide to demote the flash_ctrl alert severity from fatal to recoverable, can you spawn another issue and put it into the PROD.M3 milestone so that we can take care of the alignment? I don't think it's going to be a dramatic change if that is all that is required, hence does not necessarily have to be in PROD.M2.

msfschaffner avatar Feb 21 '24 20:02 msfschaffner

In case we decide to demote the flash_ctrl alert severity from fatal to recoverable, can you spawn another issue and put it into the PROD.M3 milestone so that we can take care of the alignment? I don't think it's going to be a dramatic change if that is all that is required, hence does not necessarily have to be in PROD.M2.

Just so I make sure I am doing this right: You want me to create a new issue that just tracks the hardware change for flash_ctrl to demote alert severity from fatal to recoverable, and this this issue stays open to track the ROM and ROM_EXT FW changes needed to handle the recoverable alert properly?

jettr avatar Feb 22 '24 18:02 jettr

Yes that is correct. Issues can only be associated with one milestone, hence we keep this one here in Earlgrey ES SV3 for now, and assign the new issue tracking the hardware change to Earlgrey PROD.M3.

msfschaffner avatar Feb 22 '24 20:02 msfschaffner

Thanks for opening issue #21637 for tracking / discussing the option of making this alert recoverable. I'll add my thoughts there.

vogelpi avatar Feb 22 '24 22:02 vogelpi

AFAIU, we are delaying the response to fault attack by waiting until signature failure instead of immediately reacting to it. This raises the question of if anything can go wrong when we continue with the rest of signature verification with modified ROM_ext data (that is how I interpreted @nasahlpa's question in Sec WG).

I think delaying the error response is fine. I will go on a short story about why it would not be a problem from signature’s security perspective.

Formally, the security of a signature scheme is judged against a security model called existential unforgeability under chosen message attack (EUF-CMA). Putting the formality aside, it simply says that it is hard for an attacker to produce any new (m, sig) pair that passes the signature verification, even if the attacker is allowed to ask for and see many valid signature pairs.

Applied to this specific issue, I believe letting attacker modify bytes/words of loaded firmware resembles to giving the attacker limited power in modifying the message. I am guessing we are already keeping the error aside to raise a failure later, but even without it, it should be very hard for attacker to trick signature verification to compute correctly by only faulting flash data. More importantly, with signature verification algorithm there is no secret data to leak, so continuing with the rest of verification algorithm should be fine.

ballifatih avatar Feb 29 '24 22:02 ballifatih

To answer the question from the last paragraph in Fatih's comment. Yes, we do plan on "keeping the error" to raise a failure later. Specifically, when ROM_EXT looks at the user code in the two banks, trying to figure out which one to boot, if during signature verification of one, it encounters unrecoverable ECC errors, it should immediately stop attempting to verify that bank, not even comparing the final signature, and instead proceed with the other bank, which hopefully would produce a signature match, so it can be booted.

This will allow us to recover from e.g. power loss during firmware upgrade, or other situations that may leave one of the banks in a corrupt state that generates ECC errors when attempted read.

I sincerely hope that when the ROM code needs to decide which of the two ROM_EXT banks it should execute, that it will do the same, and tolerate ECC errors.

jesultra avatar Feb 29 '24 22:02 jesultra

thanks @jesultra for stating this so clear:

  • upon flash error, we can immediately stop verifying,
  • but we want to stay functional to try boot the other bank,
  • which would not be the case if flashctrl continues firing alerts :)

johannheyszl avatar Mar 01 '24 07:03 johannheyszl

I just wanted to quickly confirm that we discussed this in the Sec WG meeting on Feb 29, 2024. There were no red flags and there was consensus to move forward with the proposed change.

FYI @msfschaffner

vogelpi avatar Mar 02 '24 01:03 vogelpi

Update: PR #22431 which changes flash_ctrl as proposed here and approved by the Security WG (demote the alert to recoverable in hardware) has been merged. I believe this issue can be closed as well @jettr @moidx .

vogelpi avatar Apr 16 '24 03:04 vogelpi

No, the main part of the work is still to do, unless I am misunderstanding something.

My understanding is this: If the CPU core attempts to read from memory mapped flash with ECC, and the data happens to be corrupt, then it will get a exception/trap of some sort. This will now be recoverable, in that the flash controller and chip does not require a reset, which is good.

However, as ROM code in e.g. rom_verify() tries to evaluate the ROM_EXT headers from the two slots, in order to decice which of them to boot, it currently does something like if (launder32(manifest->security_version) < ...) where manifest is a pointer into memory mapped flash. If a prior update of ROM_EXT in slot A was interrupted due to e.g. power loss, and left the flash in corrupted state, ROM should boot the older ROM_EXT in slot B, however, the ROM code as I can see it would hit a trap while just trying to peek at the header of ROM_EXT in slot A, and end in infinite reset loop, only recoverable through SPI bootstrap. This would mean a bricked Chromebook, that probably needed to have its mainboard replaced via an expensive RMA process.

In my view, we need the ROM code (and ROM_EXT) to be modified in one of two ways, either A. so that it accesses flash only through the register interface, allowing it to handle ECC errors without core trap, or B. enhance its trap handler such that it recognizes that it was in the process of rom_verify(), and somehow return from the trap handler to continue execution and look at the other slot.

In my view, option B is too technically challenging, but option A may require significant code change.

Edit: The issue is not limited to rom_verify(), the very fact that ROM code has a boot_policy_manifest_a_get() method that returns a memory pointer (const manifest_t *)TOP_EARLGREY_EFLASH_BASE_ADDR invites danger, in that any defererencing could cause a trap and infinite boot loop, and if we choose option A, in my opinion, this method must be replaced with one that returns an integer offset into the flash, rather than a pointer.

jesultra avatar Apr 17 '24 17:04 jesultra

Thanks for the feedback @jesultra , this makes sense. I was only looking at this from the hardware / flash_ctrl perspective. We should probably ping @cfrantz and @timothytrippel about the ROM/ROM_EXT part.

vogelpi avatar Apr 18 '24 10:04 vogelpi

After getting up to speed on the above, it seems while the hardware has been altered to fix part of the underlying issue here (the fatal-ness, i.e., "stickiness", of multi-bit ECC errors has been downgraded to non-fatal now), there is still some SW (both ROM and ROM_EXT) fixes required.

I would like to:

  1. summarize pros and cons of @jesultra's two proposed solutions, and
  2. propose an additional (hybrid) approach.

Option 1 (@jesultra's "test-the-waters" approach)

When reading a potential ROM_EXT/Owner FW payload from flash data pages during ROM/ROM_EXT execution (respectively) to perform both manifest checks and hashing for a signature check, we would only access these pages through the flash_ctrl's register interface.

Pros

  • We can attempt a read of potentially (multi-bit) corrupted data without triggering a load access fault.
  • Simple to implement: if corrupted data is encountered, we can handle it in place (clearing the alert source).

Cons

  • This will be slow, and increase boot time (potentially significantly). Every flash word read now requires multiple write and read bus transactions.

Option 2 (@jesultra's "ask for forgiveness later" approach)

The current ROM/ROM_EXT IRQ/exception handlers extremely simple (on purpose) and all trigger a chip shutdown. Instead of this, we enhance the ROM/ROM_EXT exception handlers to recover from multibit ECC errors by adding a "load access fault" exception handler that clears the fatal alert and aborts the current attempt to boot a given FW slot.

Pros

  • We get to keep the current speed of direct memory-mapped flash data reads, maintaining our current boot time.

Cons

  • This is difficult to implement since:
    • as @jesultra points out, there is quite a bit of pointer dereferencing sprinkled around manifest accesses, and
    • handling an ECC error in a portion of the manifest is different than handling ECC error within the code section that is being hashed for signature verification.
  • Exception handling in the ROM/ROM_EXT becomes complex to deal with above.

Option 3 (hybrid approach)

We combine options 1 and 2 above to:

  • access each boot slot's manifests only through the register interface (buffering it into SRAM) to simplify potential data corruption handling in a boot image's manifest fields (since these are small compared to the signed code regions),
  • access each boot slot's code_start through code_end section through the memory mapped flash interface, and add a simple load access fault exception handler in the ROM/ROM_EXT to return a known value (i.e., all ones) for the access

Pros

  • We maintain a majority of the performance gained from reading flash data pages through the memory mapped flash_ctrl interface.
  • We keep the ROM/ROM_EXT exception handling to just handling load access faults on the signed code regions of a boot image.

Cons

  • Will require moderate changes to ROM/ROM_EXT, and additional ROM E2E tests to verify correctness.
  • Will still degrade performance (since SPX+ signatures are quite large), but to a lesser degree than Option 1.

Thoughts? @moidx @cfrantz @jesultra

timothytrippel avatar May 14 '24 05:05 timothytrippel

After further discussion with @cfrantz, the performance hit of option 1 may not be that bad, and is worth measuring.

timothytrippel avatar May 14 '24 15:05 timothytrippel

For option 3, how do we resume the ROM_EXT's selection algorithm of the owner firmware to boot if we encounter an ECC error while reading the code section from the memory map address?

I agree with measuring performance of option 1, it may be worth the performance hit to have the simplicity and robustness.

jettr avatar May 14 '24 15:05 jettr

The idea would be that when copying the code section from flash to the HMAC input FIFO, we would preload a temp value with something that would poison the signature like we currently do for checking the anti-rollback counter.

void rom_main {
...
  for (size_t i = 0; i < CODE_SIZE; ++i) {
      uint32_t temp = UINT32_MAX;
      temp = code_section[i]; // Fault here if multibit ECC error encountered.
      hash_fifo[i] = temp;
  }
...
}

void load_access_fault_handler(void) {
   if (/* mtval is inside flash address region */) {
    return;  // instead of shutdown
   }
   shutdown();
}

This then causes sigverify to fail and the next slot is tried.

Regardless, I am going to write a simple test to compute block flash read speeds through

  1. the memory mapped window, and
  2. the flash_ctrl register/FIFO interface.

timothytrippel avatar May 14 '24 15:05 timothytrippel

That provided code snippet makes me pretty nervous with respect to compiler optimizations. I am pretty sure that the compiler would be free to not set the initial value of temp to UINT_MAX since the value is not observable in the later part of for loop (I am not an expert on compiler optimizations though). Maybe we could set some global volatile variable to communicate the failure.

jettr avatar May 14 '24 16:05 jettr

It is just pseudocode to illustrate the idea, there are several ways we can take of potential compiler optimizations, volatile being one.

timothytrippel avatar May 14 '24 16:05 timothytrippel