error-docs Seeking Best Practices for Error and Warning Handling in Complex Rust Projects

Hi @nrc!

I’m reaching out to get some expert advice on the challenges we’re facing with error and warning handling in our Rust-based project, rustic_core. Our project is relatively complex, and we’re struggling to find the right balance between propagating errors (soft- and hard errors), handling warnings, and maintaining good user experience with clear error messages.

Context of Our Problem

Error Handling:
- In our current setup, we primarily rely on returning Result<T, RusticError> to propagate errors. Kind of a god enum approach, where we convert sub errors into that god error for handing it over at our API boundary. However, we often find ourselves in scenarios where multiple errors can occur (e.g., batch operations or validation processes of data collections), and handling only the first error results in lost context.
- We are considering three primary options:
  1. Returning a single error (Result<T, RusticError>)
  2. Returning a list of errors (Result<T, Vec<RusticError>>)
  3. Returning nested Results (Result<Result<T, Vec<RusticSoftError>>, RusticHardError>)
- We're also facing cases where we need to continue execution in the presence of some errors but fail fast in others.
Warning Handling:
- We need a consistent way to handle warnings. So far, we’ve identified three potential approaches:
  1. Logging warnings locally and not passing them back to the caller.
  2. Returning a boolean flag (is_warn) to indicate if warnings occurred.
  3. Returning a list of warnings to provide detailed information about all non-critical issues that the caller can process.
- We’re trying to decide if warnings should be purely for operational visibility (handled via logging), or if the caller should be made aware of warnings explicitly.
General Pain Points:
- We struggle with missing contextual information in error messages, leaving the end-users without actionable guidance.
- We want to include error codes or links to documentation in error messages for better guidance and debugging.
- In scenarios like async operations, the logging and error handling become more difficult to manage, especially when errors are collected from multiple spawned tasks.
- Finally, we want to reconsider how we handle warnings and errors over function boundaries, thinking we may need to simplify or keep more localized handling without propagating too much information upward.

Questions

Error Propagation:
- When should we prefer returning a single error (e.g., Result<T, RusticError>) vs. returning a list of errors (e.g., Result<T, Vec<RusticError>>)? Are there performance or architectural concerns that we should consider when deciding between these two approaches?
- In complex async operations or batch processing, where multiple errors might occur, what would be the best way to handle error accumulation without losing key context? Is there a common pattern in Rust for handling this elegantly? Like spawning an error handling thread and communicating with it via a channel?
Warnings:
- When handling warnings, would you recommend keeping them local (i.e., logging only) or propagating them back to the caller? Under what circumstances is it better to pass warnings up vs. treating them as internal operational feedback?
- How would you handle situations where a function should continue executing but may want to indicate that warnings occurred (e.g., via an is_warn boolean flag or a list of warnings)? What is the best approach here to maintain simplicity while giving the caller enough control over decision-making?
Async/Concurrency:
- In async tasks and concurrent operations, how do you typically manage error propagation and structured logging, especially when errors are collected from multiple spawned tasks? How can we ensure we get full visibility into errors without complicating error management?
General Best Practices:
- Are there any best practices or patterns you would recommend for error and warning handling that balance performance, code maintainability, and user experience in Rust-based systems?
- How can we maintain a simple API for callers while ensuring we capture all relevant issues (both errors and warnings) during complex or long-running operations?
- We also thought about a nested Result where the outer Result can contain hard errors that lead to aborting the program. While the inner Result would contain a list of errors that were coming up during the processing of data collections. Which is inspired by http://sled.rs/errors

We appreciate any guidance or patterns you’ve found useful in these situations!

Oct 14 '24 23:10 simonsan

@simonsan Hey, sorry for the delay in replying, I have been meaning to write a blog post about this and was hoping I could get that out and just point at it, but I haven't even started and if I'm going to be honest, it's not going to get done very soon.

Anyway, my perspective on error handling has changed a little bit from when I wrote these docs and I should update them. Here's some notes:

Every project has different constraints and requirements for error handling and there is no single right answer, it's all just another engineering problem with a bunch of trade-offs.
Realistically, error recovery only happens locally. Optimise your error design around that fact. You can just use the foreign error types rather than wrapping them up in your own (since they're not going to be passed very far). You might have shared code for recovery but it's just another library of functions you call directly from close to the error happening rather than bubbling the error all the way up to some central error handler. Likewise, if you want to log the error, just call a function from close to the error site and then panic or whatever, don't bubble the error up to the top to deal with it.
Any error that does need to bubble up a long way is not going to get recovered from, only logged or reported to the user or whatever. If it's impossible to just kill the program or thread, then bubbling up is fine, but acknowledge that you're not going to recover and so the error can be pretty simple (probably just a string). So just use anyhow or something rather than designing a complicated error type.
This is a bit more complicated for a library crate rather than an app. I would say keep the error structure as simple as possible, don't over-index on recovery which few users will actually do. Keep errors structured so the user is in charge of formatting and reporting errors (don't try and guess their needs).
Be aware that error handling is for unexpected events. Errors in most input are not unexpected events and shouldn't use error handling - detecting, recovering from, and reporting these errors are part of normal execution and you probably need to design a system more complex than just Result (e.g., consider a parser in a compiler, it is good to separate the expected user input errors from real errors. Use Result for the latter but not for the former). Adding domain-specific context, recovery in a parser, handling multiple errors, producing good error messages are all really hard to do within the constraints of Rust error handling, so don't even try. The only advantage is the control flow stuff, and although that feels nice at first, it is inevitably a bad trade-off in the long run.

Some specific answers (all of which are very much 'IMO'):

When should we prefer returning a single error (e.g., Result<T, RusticError>) vs. returning a list of errors (e.g., Result<T, Vec<RusticError>>)?

Always single error. If you have multiple errors, it is probably not a true error in the error handling sense of the term, but more just an expected error in user input which should be handled as part of the 'happy path'

In complex async operations or batch processing, where multiple errors might occur, what would be the best way to handle error accumulation without losing key context?

Basically avoid this at all costs. Handle the error close to where it occured so you don't need to propagate. Treat errors as a form of the regular output where appropriate. If it's a library crate, let the user handle this; API should just look like single async functions which might error in a simple way. If you've got complex concurrent futures stuff going on, that is a smell that the library is doing too much orchestration.

When handling warnings, would you recommend keeping them local (i.e., logging only) or propagating them back to the caller?

In an app process them locally or treat them as part of the 'happy path' code rather than an error. In a library, just the latter.

How would you handle situations where a function should continue executing but may want to indicate that warnings occurred (e.g., via an is_warn boolean flag or a list of warnings)?

Warnings should be accumulated somewhere and returned as part of the normal execution flow, not treated as errors.

In async tasks and concurrent operations, how do you typically manage error propagation and structured logging, especially when errors are collected from multiple spawned tasks? How can we ensure we get full visibility into errors without complicating error management?

This is very hard! Let me know if you figure it out :-) Especially for a library rather than an app.

We also thought about a nested Result where the outer Result can contain hard errors that lead to aborting the program.

I would avoid over-engineering your error types. Keep it simple and keep error types just for unexpected errors.

Again, this is just my PoV and it is a rather opinionated one (some would call it extreme). Reasonable people may disagree and the specifics of a project take priority over general principles, however, I think this is a good starting point.

Oct 22 '24 23:10 nrc

I'll chime in to say that I generally agree with Nick's perspective here. Errors are for things that get passed up the call chain 80% of the time or more. Recoverability for errors is not common, and when it's needed you generally would create a small error type indicating the recoverable cases. The rest of the time, in a binary crate, just use anyhow.

I maintain a crate called woah which is intended to be an ergonomic version of Result<Result<T, LocalErr>, FatalErr> as a single enum, but unfortunately the relevant trait, Try, is not stable (and likely won't be stable soon), so while it's ergonomic on nightly Rust builds, it's not very easy to use on stable. You can use it on stable, but you can't apply the ? operator to it.

Oct 23 '24 00:10 alilleybrinker