druid
druid copied to clipboard
Enhanced Druid Exceptions for improved user error messages
Description
Proposed is a standardized, more robust form for Druid error messages to upgrade the somewhat ad-hoc approach employed today. The design focuses on the needs of the end user: what went wrong and how can they fix it. The design allows incremental upgrade from the existing code.
This is a first-draft proposal to explore the basic idea of a revised error mechanism. Comments and suggestions encouraged!
Motivation
Druid uses Java exceptions for many different purposes
- The user provided input which is not valid. (An invalid native query, or invalid SQL query, say.)
- The user configured the system incorrectly. (A required directory doesn’t actually exist.)
- Capacity limits (out of memory, reached capacity limits)
- Detecting “should not occur” conditions during development
At present, Druid uses the same form of exception for all these cases. Exceptions tend to have very simple messages which focus on the immediate condition such as “invalid syntax” or “no more slots”. As with all Java applications, developers rely on the stack trace to identify the context in which the error occurs.
Users, however, find Java stack traces unhelpful. Within the last year or two, Druid did work to sanitize errors, stripping off the stack trace and removing unhelpful messages. Unfortulately, the result is often an error of the form “something went wrong”, which does turn out to be helpful to users.
What users want is to know:
- What specific item went wrong?
- What do I need to do to fix the error?
That information cannot be inferred from the current error messages: it is something that we have to think through and add to the error at the point it is thrown.
This proposal discusses how we do this. This design is based on on successfully used in Apache Drill.
Goals
We propose to acknowledge that Druid exceptions serve two entirely different needs: help a developer debug an issue with the code, and help a production user diagnose problems related to their usage of the system. Any given exception may serve both needs, though there will be many more that are primarily for developers.
Proposal - Druid Exception
We propose to create a new exception class, say, DruidException
, which captures exceptions that are likely to be presented to the user. (Those exceptions which are primarily internal can continue to use the existing assortment of exception classes.)
A DruidException
:
- Provides a clear description of the situation from the user’s perspective.
- Provides context (such as the specific node, data source, column or other item) that has a problem.
- Has an error “kind”: user error, config error, capacity limit, internal error.
- Is logged for use in diagnosing problems after the fact
- Is presented to the user as nicely-formatted text without a stack trace
- May include an error code to map to SQL and JDBC error codes.
Proposal - Response Builder
Druid uses REST for (most of) its APIs. Druid is, however, wonderfully creative in the form of responses when an error occurs. Sometimes the error is plain text. Sometimes it is an object with one field. Sometimes is has multiple fields. Let a 100 error responses bloom!
The one draw-back is when someone tries to write a Druid client (such as this one in Python). Each messages is a bit of a problem-child: trial and error is necessary to determine which error format is used for which message. (Some OL messages use plain text some times, JSON other times. Fun!)
The existing mechanism is (sometimes) based on a SanitizableException
class that generates three fields in a REST response:
-
error
- An error “code” such as "SQL query is unsupported" -
errorMessage
- The message from the original exception, which is often unclear. -
errorClass
- ?
Unfortunately, these categories are not well defined or consistently used. (That is one of the issues that lead to this proposal.) In practice, the fields tend to be used inconsistent, and are often null.
To address this, we propose to leverage the proposed Druid Exception to standardize our HTTP response via a "response builder". This idea was prototyped in the Catalog PR. Would be great to roll that out more widely. The basic idea is to have functions to build our standard HTTP responses. For example:
class ResponseBuilder {
public static Response ok() ...
public static Response error(DruidException ex) ...
}
We have to pick an HTTP status. This can come in from the caller:
public static Response error(Response.Status status, DruidException ex) ...
Or, better, we can leverage error types to pick the status automatically. For example, a "UserErrormight return a 400 (bad request), while a
SystemErroror
ConfigError` might return a 500. That way, the code that raises the exception would not pick the status code, only the error type. (Code deep in Druid probably should not try to second-guess the HTTP resources.)
The proposed ResponseBuilder
would translate a DruidException
into a common, uniform error response with fields such as:
-
error
- Brief summary of the error -
message
- Multi-line, detailed description, including context -
class
- One of the major categories: user error, configuration error, etc. -
code
- Optional in the case that we assign specific numeric or string error codes, such as standard SQL error codes.
An example might be:
{
"error":{
"error":"Invalid SQL",
"message":"The table 'foo' is not defined in schema 'druid'."
"context": {
"location": "Line 4, column 23"
},
"class":"user",
"sqlError":"HV00R"
}
}
Style Guide
Error messages presented to users are, essentially, just-in-time documentation. As such we’d want to craft user-visible error messages carefully. To help with this, we should create a simple style guide that explains what a good error message looks like, and common wording.
Older systems used a “message catalog” to separate the text of errors from code. However, more recent Java applications just put the messages directly in the code. Either is fine as long as we can ensure the messages are clear and to-the-point.
Example Code
The following is a crude straw-man example of how the new mechanism may appear in code. The idea is to define a builder that gathers information about the error:
throw DruidException
.userError()
.error(DruidException.INVALID_SQL)
.message(“The table ‘%s’ is not defined in schema ‘%s’.", tableName, schemaName)
.context("location", "Line %d, column %d", node.lineNo, node.columnNo)
.build();
The build()
call creates the exception and logs it. Log level is determined the the error kind. Here, we raise a user exception, which needs only, say, info
logging level. An internalError()
might be logged at the error
level. The builder can offer an exception, a logLevel()
method that lets the developer override the general rules.
Roll-out Plan
This proposal requires two distinct parts:
- Design the
DruidException
mechanism itself. - Incrementally upgrade existing exceptions based on priority (frequency of occurrence.)
Of course, the process is iterative: converting the first few exceptions will certainly identify enhancements we’d like in the exception mechanism.
One thing I would like to see is a globally unique error code associated with error messages. This way it'll be easier for people to google for help. It'll be also helpful to use this to identify frequent errors.
I'm very interested in improving our error reporting, so I thank you for writing this up.
I wanted to note that there is a similar concept to the proposed DruidException in the MSQ task engine. If we introduce a Druid-wide system — which would be great — it'd be good to be able to merge the MSQ task error system into that. It's of course different in the details, but I think we could work that out.
When you get an error from an MSQ task, it looks like the following. There's a full error report (including hostname, stack trace) but there's also a friendly error
section with information that is meant to make sense to an end user. In the code, the model class for the full error report (including hostname, stack trace) is MSQErrorReport
. The model class for the friendly error (the error
key) is MSQFault
.
"errorReport": {
"taskId": "query-97e1513c-6c57-4484-8f01-ab4611af13de",
"host": "ip-a-b-c-d.ec2.internal:8100",
"error": {
"errorCode": "TooManyColumns",
"numColumns": 2003,
"maxColumns": 2000,
"errorMessage": "Too many output columns (requested = 2003, max = 2000)"
},
"exceptionStackTrace": "org.apache.druid.msq.indexing.error.MSQException: TooManyColumns: Too many output columns (requested = 2003, max = 2000)\n\tat org.apache.druid.msq.exec.QueryValidator.validateQueryDef(QueryValidator.java:48)\n\tat org.apache.druid.msq.exec.ControllerImpl.initializeQueryDefAndState(ControllerImpl.java:529)\n\tat org.apache.druid.msq.exec.ControllerImpl.runTask(ControllerImpl.java:346)\n\tat org.apache.druid.msq.exec.ControllerImpl.run(ControllerImpl.java:296)\n\tat org.apache.druid.msq.indexing.MSQControllerTask.run(MSQControllerTask.java:192)\n\tat org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:477)\n\tat org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:449)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"
}
Thanks for writing this up. We also would like to be able to store these exceptions in the database. We do that already for task failures message for example but there is a limit to the size of error text. With that in mind, I am thinking that we have an error code and a verbose error message associated with that example. And we only need to store the error code in the database but not the verbose error message.
For example, a task failed because there were no valid input records. Now, that can be caused by a bad timestamp spec, a badly configured filter, or because data has multiple values for some column in each row. We don't want to put all these reasons in the database when storing task result status. I can store NO_VALID_INPUT_RECORDS
as the error code and then map it to a nice error message in the application. What do you think?
For example, a task failed because there were no valid input records. Now, that can be caused by a bad timestamp spec, a badly configured filter, or because data has multiple values for some column in each row. We don't want to put all these reasons in the database when storing task result status. I can store
NO_VALID_INPUT_RECORDS
as the error code and then map it to a nice error message in the application. What do you think?
IMO, in this example, NO_VALID_INPUT_RECORDS
is still hard for problem solving. When facing such error code, we still don't know what kind of exception that Druid encounters, is it due to bad timestamp spec or due to badly configured filter or sth else? We still need the error message to figure out what happened. So, error code needs to be more precise.
Even if we split such error code NO_VALID_INPUT_RECORDS
into 3 error codes to match each case, I think we still need to put the error message in the database. Because error message gives exact message to help us address problems.
For example, if a BAD_TIMESTAMP
error is raised, without the exception message, we don't know what the error input is, and we don't know we know where the source is and have no idea where we should fix the problem.
But I think this is out of the scope of the main topic. I totally agree that we should add a global error code to the exception, but how to design an error code is more complicated.
A builder pattern is friendly for developers. It's a nice API.
throw DruidException
.userError()
.error(DruidException.INVALID_SQL)
.message(“The table ‘%s’ is not defined in schema ‘%s’.", tableName, schemaName)
.context("location", "Line %d, column %d", node.lineNo, node.columnNo)
.build();
Take this as an example, I actually don't understand the difference of message
and context
even though I know the literal meaning of these two words. I mean sometimes it's confusion for developers to figure out whether we should put the information in the context
or message
. When to use message
and when to use context
? We can of course put the location
information above in the message
too.
For example, if a BAD_TIMESTAMP error is raised, without the exception message, we don't know what the error input is, and we don't know we know where the source is and have no idea where we should fix the problem.
That info should be stored in the context. The error code will be similar to the class-name of an exception. so you have a BAD_TIMESTAMP error that is equivalent to BadTimestampException in a different world. There is going to be a description of this error code. The developer cannot change that description. This description is what a developer will add to the documentation (https://druid.apache.org/docs/latest/multi-stage-query/reference.html#error-codes). Developers can, however, pass supplement info to the error message.
@gianm, thank you for the suggestion to look at the MSQ errors. The example you showed is very much in line with the idea outlined in the proposal. The format of the example is somewhat MSQ-specific (with row counts, query IDs, etc.) We can reconcile it with the proposal by viewing the MSQ-specific items as the "context" suggested in the proposal.
@FrankChen021, this example may clarify the meaning of "context": it is additional information relevant to the error in question. Different errors have different "extra" context (line number for an SQL error, record number for an ingest error, segment ID for other things, etc.) The context provides a generic way of handling situation-specific information useful to the user.
Making up a quick straw-man example:
throw DruidException
.userError("TooManyColumns")
.error(DruidException.RUNTIME_ERROR)
.message("Too many output columns (requested = %d, max = %d)", colCount, COL_LIMIT)
.context( "numColumns", colCount)
.context("maxColumns", 2000)
.build();
One could image this being serialized to JSON as:
"error": {
"code": "TooManyColumns",
"kind": "Runtime Error",
"message": "Too many output columns (requested = 2003, max = 2000)"
"context": {
"numColumns": 2003,
"maxColumns": 2000
}
},
Using your TooManyColumns
example, I hunted down how things are handled in MSQ today. Looks like there are multiple Fault
classes, with JSON-serializable fields to capture the error-specific information. This works, but may lead to a large number of classes. The DruidException
, by contrast, tries to be a one-size-fits-all error.
This raises the question, what if we want to catch certain exceptions? Generally, a DruidException
would occur when an operation-fatal error occurs that will bubble back up to the user. If we have an "internal" exception that we will handle elsewhere, we'd continue to use "normal" exceptions. And, of course, if it makes sense to have subclasses for different use cases, we could do so.
@abhishekagarwal87, good point about storing errors in a DB. I suppose the format depends on the purpose. If we want general trends, we'd store very course grain. User error count can suggest the need for training. Config errors counts more than 0 mean we've got to look at our system setup. Runtime errors may indicate we're reaching cluster capacity, etc.
If we want to track down issues, then, as @FrankChen021, suggests, we'd need a bit more detail. Perhaps the first 30 characters of the error message (which forces us to start with the important info.)
If we can more clearly characterize the purpose of the DB, and what we want to capture, we can design the DruidException
mechanism to classify errors in a useful way.
@gianm's example points out one more item about errors: some of the information the user wants is not available at the point that the error is thrown. In Gian's example, the query ID and task ID are not likely known to the thing that is computing row layout and hits the column limit. The MSQ (really, an Overlord task) has a standard way to handle such things at the "top" of a task: if stuff fails, create a report with the overall context.
Still, there are things in the "middle" not known to the top level, but not known at the site of the error. For example, consider a bad record. We'd like to know the row that failed (which we probably have) and the query (which MSQ can provide.) But, we'd also like to know the file being ingested, among the 50 for this job. How do we get this mid-level context?
In Drill, we sometimes passed the information down to the error site via an "error context." That works, but is rather unsatisfying: it is extra work needed just because Drill's UserError
is immutable, and logged at the point it is created. We can learn from that and try a different approach: allow the DruidException
to be extended as it bubbles back up. For our bad-record example, making up a simple file parser:
try (Handle handle = openFile(fileName)) {
while (handle.hasMore()) {
parseRow(handle.getLine());
}
} catch (DruidException e) {
throw e.withContext("file name", fileName, "line number", handle.getLineNumber);
}
Now the user gets the specific error ("cannot parse JSON record"), the high-level context (query id) and the mid-level context (the file and line number).
Errors have a point at which they are turned over to the user. That point is the REST request for a query, or the task for MSQ, etc. To make the above work, we'd log the error just before we forward it to the user. This logging would be the perfect place to capture the error for insertion into @abhishekagarwal87's DB (to save the need to parse the logs.)
Extended the proposal with an HTTP ResponseBuilder
to provide a uniform HTTP error response format.
After some discussion, here is an alternative approach to satisfy some additional requirements. This is primarily for discussion to help brainstorm a solution.
- Error messages must be localizable: not so much for language, as to use terminology consistent with a deployment or to redact information not appropriate in a given context.
- Every error must have a unique error code (also required to enable the message catalog).
- Follow the MSQ pattern where possible.
Overview
Given this, we propose that all errors have a unique text error code. We can standardize around a pattern such as "SQL" + "Validation" + "SpecificError" or "MSQ" + "IO" + "SpecificError". Error codes form a hierarchy.
Each error provides a set of named parameters that are interpolated into the error text. For example:
Code: ['SQL'. 'Validation', 'ColumNotFound']
Column: foo
Line: 4
Position: 3
ID: 0000-1111-2222...
Each error instance is given a unique UUID to allow system operators to find the specific error instance in logs that users saw in a UI.
Then, the message catalog can have a variety of messages depending on the level of detail appropriate to given deployment:
For example, for out-of-the-box Druid:
Line [{Line}], Column [{Position]]: Column [{Column}] was not found in any table in the query.
For a deployment when the front end allows the user to select columns, but not edit the SQL:
No such field [{Column}]
For a deployment where the SQL queries are hard-coded, power a dashboard, and only a development team can modify the SQL. The error indicates a bug in the application:
Invalid SQL: contact Support. Case {ID}
The message "translations" would follow the resource bundle mechanism originally used for language translations.
DruidException
To make this work, each application would be a subclass of a DruidException
which has a set of attributes, but not a message:
- Error code
- Event ID
- Parameters, as fields on the class
The text of each message is given in message catalog, represented as a properties file following the Java I18N pattern. For example, for the stock druid.properties
:
SQLValidationColumNotFound=Line [{Line}], Column [{Position]]: Column [{Column}] was not found in any table in the query.
Each deployment can provide a custom file. The file overrides the standard Druid messages. Messages not defined in the override catalog use the Druid version. For example for a deployment for ABC Co., define abc.properties
:
SQLValidationColumNotFound=No such column '{Column}'
Logging
Each exception can be logged. Logging can be set up based on the code hierarchy. For example, on out-of-the-box Druid, SQL errors might not be logged because they are directed at users. Other sites may choose to log everything. Logging of errors uses a site-defined format. Options may include:
- Include the ID or not.
- Include the error code or not.
- Use message or individual fields instead.
For backward compatibility, out-of-the-box may log the error messages formatted from the catalog. Other sides may want messages of the form:
{ID} {Code} {Params}
Example:
0000-1111-2222...: SQL-Validation-ColumNotFound - Line: 4, Position: 3, Column: 'foo'
To make the above work, every exception is a subclass of DruidException
. Each class can represent one or more error codes. The exception includes parameters as fields. The class is Jackson-serializable so it can be passed between nodes. The class provides metadata about the error, the field serialization order for logs, etc.
REST API
The prior proposal suggested returning the details to the client to sort out the correct formatting. This proposal pushes that work to Druid. As a result, the SQL REST API message format undergoes minimal change:
{
"error": "<error code in form X-Y-Z>",
"errorMessage": "<message after message catalog formatting>"
}
Example for the out-of-the-box Druid:
{
"error": "SQL-Validation-ColumNotFound",
"errorMessage": "Line [4], Column [3]: Column [foo] was not found in any table in the query."
}
The same error, using the "sanitized" version above:
{
"error": "SQL-Validation-ColumNotFound",
"errorMessage": "No such column 'foo'"
}
For messages returned from the Broker, we omit the host
and errorClass
fields which tend to reveal information that some systems may wish to redact. The errorClass
would change anyway given the new implementation. This is a chance to not expose implementation classes to the user. The host
, if wanted, can be formatted into the message just like any other field, or redacted if the host is not appropriate in a given situation.
All other REST APIs use ad-hoc formats. To avoid breaking clients, we retain these ad-hoc formats (text, arbitrary JSON, etc.). For each, we use the formatted message from the message catalog.
This approach got the discussion rolling, but we'll likely do something different.