trezor-firmware Consistent error handling

This PR proposes a consistent approach to error handling in the C codebase. The proposal includes the following:

Introduction of system-wide error codes (ts_t type) with predefined errors prefixed with TS_ (Trezor status) and built-in fault injection hardening.
A straightforward mechanism/framework for error handling within functions, utilizing goto statements to facilitate cleanup at the end of the function.
Addition of new ensure() macro types to simplify top-level code.
Simplification of STM32 HAL error handling with the hal_status_to_ts() mapping function and the TS_CHECK_HAL_OK() macro.

This PR consists of three commits. Two of them serve as examples: a refactored haptic driver and a refactored SD card driver.

Error handling pattern:

ts_t module_function(int arg1, int arg2) {
  // Initialize the `__status` variable and set it to `TS_OK`.
  // This should be on the first line inside the function (rule 1).
  TS_INIT; 

  // Check arguments to ensure they are greater than 0.
  TS_CHECK_ARG(arg1 > 0);
  TS_CHECK_ARG(arg2 > 0);
  
  // Check the result of another function with a `ts_t` return value.
  // The `status` value is properly propagated.
  // Splitting it into two lines makes debugging easier (rule 2).
  ts_t status = another_function();
  TS_CHECK_OK(status);
  
  // Check the result of an STM32 HAL function returning `HAL_StatusTypeDef`.
  // The HAL_xxx error is properly mapped to TS_xxx and propagated.
  HAL_StatusTypeDef hal_status = HAL_function();
  TS_CHECK_HAL_OK(hal_status);
  
  // Check a generic condition inside the function and propagate the status code
  // if the condition is not fulfilled.
  TS_CHECK(arg1 * arg2 < 1000, TS_ERROR);

  // Alternatively, use the secure variant working with `secbool`.
  secbool ready = usb_ready();
  TS_CHECK_SEC(ready);
  
  // In some cases it's useful to return before the cleanup code, (rule 3: never use `return ...` directly)
  // TS_RETURN;
  
cleanup:
  // Potential cleanup code goes here.
  // ...

  // Return the content of the `__status` variable.
  // In case of no error, it's `TS_OK`.
  TS_RETURN;
}

We can now use the ensure_xxx() set of macros similar to verify_xxx() for handling fatal errors. The ensure_xxx() macros display an error message on the screen and shut down the device as before. (Note: This feature is not yet complete in this PR.)

  // The generic condition must be `true`
  //  TS_CHECK(condition, ts_t) ->  
  ensure(condition, message);

  // Status must be `TS_OK`
  // TS_CHECK_OK(status) -> 
  ensure_ok(status, message);

  // The generic condition (secbool) must be `sectrue`
  // TS_CHECK_SEC(secbool, status) -> 
  ensure_sec(secbool, message);

I understand that this is a completely new approach, which is not very common and requires some time to get used to. However, it is very simple and quickly understandable. I will not merge this until we reach a consensus that this is the right approach. Please feel free to provide any comments or suggestions.

Jul 09 '24 08:07 cepetr

core UI changes device test click test persistence test

T2T1 Model T test(screens) main(screens) test(screens) main(screens) test(screens) main(screens)

T2B1 Safe 3 test(screens) main(screens) test(screens) main(screens) 2724

T3T1 test(screens) main(screens) test(screens) main(screens) test(screens) main(screens)

All main(screens)

core UI changes	device test	click test	persistence test
T2T1 Model T	test(screens) main(screens)	test(screens) main(screens)	test(screens) main(screens)
T2B1 Safe 3	test(screens) main(screens)	test(screens) main(screens)	2724
T3T1	test(screens) main(screens)	test(screens) main(screens)	test(screens) main(screens)
All	main(screens)

Jul 09 '24 08:07 github-actions[bot]

The weird definition and usage of ensure with multiplication which was made against being removed by compiler and generated better barrier assembler code. Hamming weight of sectrue and secfalse was made to be high, while Hamming distance between TS_OK and TS_ERROR is 2 bits. Not entirely sure if highest bit should not be masked to 0 in case it passes through ints and uints (e.g. as a part of some call or other macro from other library).

It might be usable to have macro shorthand for __attribute__((cleanup(func))) which could help with cleanup no matter how the function ends (limited in usability with arrays, but I think for already defined "deinit" functions for HW it might work). This way a cleanup for a function can be used in a sub-block of a function. In this tiny library they seem to make it work also with arrays - https://github.com/Snaipe/libcsptr (kind of C replacement for unique_ptr and shared_ptr)

To me goto inside macro seems very...unexpected by someone who did not read this particular PR. And maybe better name for error: is cleanup: since code branches without error go through error: part without return first.

Jul 09 '24 13:07 hiviah

The weird definition and usage of ensure with multiplication which was made against being removed by compiler and generated better barrier assembler code. Hamming weight of sectrue and secfalse was made to be high, while Hamming distance between TS_OK and TS_ERROR is 2 bits. Not entirely sure if highest bit should not be masked to 0 in case it passes through ints and uints (e.g. as a part of some call or other macro from other library).

Yes, good point. Hamming distance between TS_OK and TS_ERROR is low. There are certainly several ways to improve it. I'll think about it and try to suggest something.

It might be usable to have macro shorthand for __attribute__((cleanup(func))) which could help with cleanup no matter how the function ends (limited in usability with arrays, but I think for already defined "deinit" functions for HW it might work). This way a cleanup for a function can be used in a sub-block of a function. In this tiny library they seem to make it work also with arrays - https://github.com/Snaipe/libcsptr (kind of C replacement for unique_ptr and shared_ptr)

I'm not sure that the __attribute__((cleanup(func))) approach is what we are looking for, although it's interesting. Cleanup code is not just about freeing resources, but also about performing some reverting actions, setting output arguments, etc. This might not be easy to do in a separate function. In some situations, you don't want to call any cleanup code if the function succeeds.

To me goto inside macro seems very...unexpected by someone who did not read this particular PR. And maybe better name for error: is cleanup: since code branches without error go through error: part without return first.

Yes, it's a bit unusual and requires some getting used to.

error or cleanup... I have no problem with either. Both work well, but neither is perfect (finally changed in 7b70f0ae4f0bcc59b6ac9637825a6b59a262bb1b).

Jul 09 '24 15:07 cepetr

I created a document in notion summarizing my concerns. https://www.notion.so/satoshilabs/Unify-error-handling-in-C-aebe0fe5afa1417996329350869835c4

Jul 10 '24 10:07 andrewkozlik

trezor-firmware trezor-firmware copied to clipboard

Consistent error handling

trezor-firmware
trezor-firmware copied to clipboard