turnilo
turnilo copied to clipboard
Error handling for timeouts
If druid returns {"error":"could not compute","message":"timeout"}
we should show timeout to user
If some gateway between user and druid returns 504 we should show timeout to user.
I would like to get HTTP status end error message with root cause on UI. Currently I have to use Devtools to inspect response from Turnilo server (or from LB between Turnilo server and client).
How to map errors on Turnilo server? Some proposals:
Timeout error from Druid broker - HTTP 408, 500 or 502 Query errors from Druid broker - HTTP 400 or 500 429 errors from Druid Broker - HTTP 429 or 500 Gateway Timeout (502) from LB in front on Druid broker(s) - HTTP 502 , 500 or 408 ???
Or just map all errors as HTTP 500 with proper error message/details and handle HTTP 500 in a special way on the client (hide the status and show the root cause / details). For all other errors show the HTTP status code with details if available (e.g for HTTP 502 Bad gateway from LB between Turnilo server and client)
All errors handled on Turnilo Server and returned to the clients should be logged (with all available details).
I would like to get HTTP status end error message with root cause on UI
I don't think that's correct. We should show user only actionable information. Right now we have two group of errors:
- Errors after user is encouraged to try again (429 from Druid, any timeout)
- Errors after user should change query or stop trying and report to administrators.
Of course all information should be preserved in logs/sentry for administrator to recover.
I agree that actionable information is better. But at the end you will not be able to figure out what was the reason of the error if the user sends the screenshot with generic error only. How to correlate this screenshot with the logs? So I think that some details on the screen are really needed, it is a help for admin not for end user :) Remember that some errors are network/lb errors - and there is no chance to see those errors in the Turnilo server log.
Let's be more specific, if Druid returns 429, Turnilo server could return 500 with message "Server is overloaded, please try later". This message is totally fine for end user as well. In the scenario when server for all errors returns 500, the actionable message must be provided by the server because the server knows the root cause. The generic message "Server error" on the client is unusable at all.
If LB returns 502 Bad Gateway, we should show exactly this message to the client (perhaps with some hint to refresh the page).
But at the end you will not be able to figure out what was the reason of the error if the user sends the screenshot with generic error only. How to correlate this screenshot with the logs?
Report error to sentry and show report id on screen. Just like we do with frontend errors. I don't know how to correlate logs with reported error, but that's another issue.
Remember that some errors are network/lb errors - and there is no chance to see those errors in the Turnilo server log.
And will be reported as network errors to sentry.
Error messages types on client
- Persistent error: user can't do anything, maybe go back to previous view / home page / current viz with default params
- Transient error: user can retry action
- Retry right away
- Retry "later" because Druid is overloaded.
Bonus task: identify which errors should result in which type
User should get generic/minimal error message. If we have sentry (or something similar) we should present error id. Error details should be visible in hidden accordion (or something?)
Server errors
- Server should always send error details to client
- Server should always log the same error details
- If we have sentry, error details should be added to error report metadata
Bonus task: Format messages in Logger.ts so Kibana can understand levels, stack traces, timestamps etc.