trezor-firmware Introduce stack overflow detection by moving stack to the start of RAM

Moved stack to the start of RAM in bootloader & firmware. In case of stack overflow, this causes MemManage fault with properly configured MPU.

Since inside the MemManage handler stack is out of bounds, the error display function does not work, so i changed the handler so it resets the MSP. I think it might be handy to see whether the fault is stack overflow or some other MM, so i check this and show different error on the display inside the handler - not sure it't the right approach though. (this paragraph applies to FW only)

Aug 09 '22 17:08 TychoVrahe

I think this setting stack for uPy does not seem to be handled right in firmware/main.c:

mp_stack_set_limit((char *)&_estack - (char *)&_heap_end - 1024);

Since now stack is under heap.

Aug 09 '22 18:08 hiviah

thanks, fixed in https://github.com/trezor/trezor-firmware/pull/2444/commits/edf84c1694104f6cf0f7297f081fb6ab146c5d14

Aug 09 '22 18:08 TychoVrahe

This layout works, but in my case the error shown when I exhaust stack is HF (hard fault). When stepping through it actuall shows (SO) stack overflow for a fraction of second, but then call to svc_shutdown somehow triggers hard fault so the final message to user is HF instead of SO.

Changing shutdown() to shutdown_privileged() in error_shutdown fixes this in this case, final message is (SO), but this then won't work for stack smashing error.

I don't quite understand why SVC instruction is tripping up and causing HF, maybe because it's called from exception handler and something with banked registers (?). Maybe the is_mode_unprivileged should account that both exception and privileged state are privileged.

stack_underflow_02

The testing function that exhausts stack is just simple pushing into it:

    void push_stack() {
      for(int i=0; i<60000; i++) {
      __asm__ volatile("push {LR}");
      }
    }

Aug 10 '22 15:08 hiviah

Interesting. I was testing the stack overflow in FW before actually entering the unprivileged mode, thus it behaved correctly.

Seems unrelated to changes done in this PR. master behaves the same when MM is triggered.

Aug 10 '22 16:08 TychoVrahe

According to this old post, one of the thing that triggers HF on SVC instruction is that SVC has lower priority than context that calls it, which makes sense.

Since AFAIK we don't change the SVC priority (nor the faults handler's), the SVC has default priority 3, which is lower than Reset, NMI, HardFault, MemManage, BusFault and UsageFault (the last three priorities could be changed). Calling svc from these handlers therefore triggers HF.

Since we can't change the priority of HF/NMI, we need to not call svc in these handlers.

Maybe the is_mode_unprivileged should account that both exception and privileged state are privileged.

This could be a way

Aug 10 '22 17:08 TychoVrahe

Implemented fix in 9490acf814c0d030da4d15a79613b82b90aad9c3 .

Aug 11 '22 08:08 TychoVrahe

and added a mask to account for reserved bits in the register https://github.com/trezor/trezor-firmware/commit/f4e56df5d69465522c7b135b1457b311221a036e

Aug 11 '22 09:08 TychoVrahe

T1 core build is broken

Aug 15 '22 10:08 matejcik

yeah i noticed, working on a fix

Aug 15 '22 10:08 TychoVrahe

pushed: https://github.com/trezor/trezor-firmware/pull/2444/commits/7cc20b08fe28c3967204b49b70c55bd2846b636e fixed heap end setting fixed T1 core build - stack is also moved to the beginning

@hiviah please recheck this, with regard to how the last 8 bytes in RAM are used in T1 bootloader

Aug 15 '22 11:08 TychoVrahe

The T1 fixes seem reasonable, but I can't get core to work properly on this branch or master, there is something wrong with USB - descriptor is published, but data do not flow to/from endpoints.

I've tried old bootloaders that were known to work with T1 core port before (1.8.1 - 1.10.1), but the issue is somewhere else. From debugging the code seems to think that there is no data to read.

Aug 15 '22 15:08 hiviah

Turns out though UART is working and we are getting exceptions in python, which is unrelated to this change most likely.

255281000 session DEBUG Restarting main loop
255338000 trezor.workflow DEBUG setting a new default: <generator>
255341000 trezor.loop DEBUG spawn new task: <generator object 'homescreen' at 20013b90>
255343000 trezor.workflow DEBUG start default: <generator object 'homescreen' at 20013b90>
265186000 trezor.wire DEBUG 0:0 receive: <Initialize>
265188000 trezor.loop DEBUG spawn new task: <generator object 'handle_Initialize' at 200189b0>
265191000 trezor.workflow DEBUG start: <generator object 'handle_Initialize' at 200189b0>
265211000 trezor.loop ERROR exception:
Traceback (most recent call last):
  File "trezor/loop.py", line 186, in _step
  File "apps/base.py", line 130, in handle_Initialize
  File "apps/base.py", line 39, in get_features
RuntimeError: Failed to read vendor header.
265214000 trezor.workflow DEBUG close: <generator object 'handle_Initialize' at 200189b0>
265217000 trezor.workflow DEBUG default already started
265223000 trezor.wire ERROR exception:
Traceback (most recent call last):
  File "trezor/wire/__init__.py", line 341, in _handle_single_message
  File "trezor/loop.py", line 533, in __iter__
  File "trezor/loop.py", line 186, in _step
  File "apps/base.py", line 130, in handle_Initialize
  File "apps/base.py", line 39, in get_features
RuntimeError: Failed to read vendor header.
265226000 trezor.wire DEBUG 0:0 write: Failure
265241000 trezor.loop DEBUG finish: <generator object 'handle_session' at 20013ea0>
265244000 session DEBUG Restarting main loop

Aug 15 '22 15:08 hiviah

OK, the error has something to with build having wrong vendor header, but after workardound works correctly.

So the T1 fix is fine.

Aug 15 '22 15:08 hiviah

Noted the issue with vendor header in build in issue #2451.

Aug 16 '22 12:08 hiviah

trezor-firmware trezor-firmware copied to clipboard

Introduce stack overflow detection by moving stack to the start of RAM

trezor-firmware
trezor-firmware copied to clipboard