Define error handling strategy
Currently we disregard all errors coming from a request to the API. An ideal outcome of this ticket would be a definition of how basic errors (e.g. server couldn't be reached/network down) could be handled - some kind of universal solution, which then would be implemented in all existing cases of failure and should be followed by all subsequent PRs.
Not only our API can throw an error, but also other components. This is why the component <NuxtErrorBoundary> was introduced. I'm unsure about its effectiveness compared to the vue lifecycle onError, but all these things need to go in here as part of an error-reporting strategy. More on this: https://nuxt.com/docs/getting-started/error-handling#errors-during-the-vue-rendering-lifecycle-ssr-spa
I see 3 categories of errors, which should be handled and logged differently:
- User relevant errors - the user has done something wrong. These errors should be shown to the user. Logging is not that relevant, because its nothing we should deal with, but maybe good to have them logged somewhere to improve UX by updating user-flows to avoid them. I don't see many cases for this being possible, because we only have limited space for user-input.
- Errors within this application (4-axis https://youtu.be/AdMqCUhvRz8?t=1201 - 4 min). Fatal errors will mostly be Electron internal, but the non-fatal ones are ours to deal with, which I wouldn't care much about as of now. Explicit ones can be logged, and for implicit ones monitoring system resources and user response can help.
- Errors in the API. While also part of non-fatal errors (explicit or implicit), both should IMO not be logged by this instance, but by the API. This includes an API not responding (could be our problem - API is down - or the users - network is disconnected) or returning a 5xx error-code.
A plan for the strategy of handling all errors: For 1st we will need a way to report those errors, which should be discussed with @sclausendk. For 2nd and 3rd, we need a system to centrally report errors and exceptions but also to monitor abnormal system behavior. Logs of the 2nd category should be logged distinguishable from the 3rd. A set of developers should have access to the mentioned system and within a given time-frame (e.g. each month) have a look at the errors logged there and define action-items for handling them.
Development-tasks I see to fulfill the plan above:
- Decide on a system for reporting errors and logs and implementing it for Web and Electron.
- (low) Find out if there's a way of logging and reporting fatal errors (e.g. MacOS has crash-reports)
- (low) Define which system-metrics and user-actions to log to supervise implicit, non-fatal errors.
Next Steps: This is just a set of ideas I have. @kkuepper Would be nice to get your opinion on this.
Something like Bugsnag or Sentry could be nice, but I don't think they are self-hostable if that's a requirement.
I'd like to use ApplicationInsights for logging since we use that in the other projects as well. It probably makes sense that I set this up. Sometimes it makes sense to use a more native solution for crash reporting. E.g. we use AppCenter for our Xamarin apps. That might be necessary in addition to ApplicationInsights, since it can't always be guaranteed that we manage to log to ApplicationInsights.
Do you have some examples for User relevant errors? Do you mean searching for something that doesn't exist and show a message that there are no results?
Do you have some examples for User relevant errors?
In general everything the user has access to change something - any kind of input. Search shouldn't have user-related errors (e.g. an empty search result is not an error), but when creating a playlist where one with the same name already exists is something we might not want to allow ... Another example is when opening a link which cannot be resolved by the router. The above is just a very general statement from my side. You could categorize them as explicit non-fatal errors on the 4-axis, but I still see them in a different responsibility - of the user.
Should this be discussed at the upcoming frivillighetshelg? Would be nice to get something working (ref #270).
@sifferhans yes. that's a good idea. I will be joining remotely. Your comment makes it sound like you'll be working on bmm-web as well?
@sifferhans yes. that's a good idea. I will be joining remotely. Your comment makes it sound like you'll be working on bmm-web as well?
I planned to, but i was needed in the orchestra under the easter camp, and we're using the frivillighetshelg for practice 😅 So I cannot join this time either, but I am planning to join next FH 👍🏼
Is this still relevant now that we are using Sentry?
@sifferhans I don't think that just by using Sentry, you get your error handling strategy magically answered. Furthermore, there are some types of error where I don't know if you can cover those by Sentry (e.g. crashes of the application or all of the implicit non-fatal errors).
Maybe it would be good to meet and talk a bit about the expected outcome of this issue.
I agree that just saying Sentry is not enough.
The realistic answer is that we're logging certain errors to ApplicationInsights and that Karsten tries to pay attention if something unusual happens. Below is a chart of the past months and the number of "errors" is pretty stable.
There's also a table with counts from February 1st to March 28th if you want to read the actual error:
Fact is that bmm-web has been in production for almost 1 year and if there were many critical errors, we would have fixed them by now. Therefore I will close this issue and we can rather create more specific issues e.g. #576 .