sql icon indicating copy to clipboard operation
sql copied to clipboard

[RFC] Error Report-building Mechanism for PPL Errors

Open Swiddis opened this issue 1 month ago • 2 comments

Problem Statement

End-users currently receive error messages with minimal context about how errors occurred, making it difficult to diagnose and fix issues. When exceptions propagate through multiple layers of the system (parsing → semantic analysis → execution → REST response), contextual information is lost, leaving users with generic error messages that don't provide actionable guidance.

Examples of current pain points:

  • #4872: Error shows Failed to read mapping for index pattern [[Ljava.lang.String;@9acdf8] instead of explaining which indices have incompatible mappings
  • #4869: Error shows "Insufficient resources to run the query" without indicating which resource (memory/CPU), current values, or how to fix it
  • #4896: Error is ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0 with no context on what type of array might have been involved. In this case, we removed the error condition entirely, but debugging took several hours from multiple engineers

Current State

The OpenSearch SQL/PPL plugin uses a simple exception hierarchy where exceptions carry only:

  1. Exception type (e.g., SemanticCheckException)
  2. A single error message string
  3. Optional cause chain (for wrapped exceptions)

Architecture limitations:

  • No context accumulation: As errors bubble up through layers (Parser → Analyzer → Executor → REST), intermediate layers cannot add context without modifying the original exception message
  • Single-point error creation: All context must be known at the point where the exception is first thrown
  • Loss of structured information: Important details like query text, table names, field positions, datasource IDs, and execution context are unavailable to error handlers
  • Generic error responses: The ErrorMessage class can only extract type, reason, and details from the exception's localized message

Example error flow:

[Semantic Analyzer] throw new SemanticCheckException("Field 'x' not found")
  ↓
[Query Service] catch + rethrow (no context added)
  ↓
[REST Handler] ErrorMessage.toString() → {"type": "SemanticCheckException", "details": "Field 'x' not found"}
  ↓
[User] Receives minimal context, must guess which table, datasource, or query caused the issue

Long-Term Goals

  • Rich error context chains: Users should see the full story of how an error occurred, with context added at each layer:

    Error: Field 'timestamp' not found
    -> in table 'logs' (index pattern: 'logs-*')
    -> while analyzing query: "source=logs-* | fields timestamp, message | where status > 500"
    -> at position: line 1, column 25
    -> datasource: 'default-opensearch'
    
  • Actionable error messages: Each error should guide users toward resolution with specific details about what went wrong and how to fix it

  • Structured error information: Enable programmatic error handling with machine-readable error codes and structured context fields

  • Consistent error experience: Unified error handling across SQL, PPL, Calcite, and legacy engines

How confident are we that this solution solves the problem?

High confidence. Similar mechanisms (Rust's anyhow/eyre, Java's Spring NestedExceptionUtils, Python's exception chaining) have proven effective in providing rich error context in existing projects. The OpenSearch SQL plugin's multi-layered architecture works well for context accumulation at each layer.

Proposal

Introduce a Report-building error mechanism inspired by Rust's anyhow/eyre libraries that:

  1. Wraps exceptions with contextual information as they propagate through system layers
  2. Accumulates context without modifying original exception messages
  3. Formats rich error responses with context chains for end-users
  4. Maintains backwards compatibility with existing exception types and error handlers

Core components:

// Report wrapper that accumulates context
public class ErrorReport {
    private final Throwable rootCause;
    private final List<ErrorContext> contextChain;
    
    // Idempotent wrapper: determines whether `e` contains a Report and either returns that report or starts a new one.
    public static ErrorReport wrap(Throwable e);

    public ErrorReport context(String key, Object value);
    public ErrorReport context(ErrorContext ctx);
    public String toDetailedMessage();
    public JSONObject toJSON();
}

// Context information at each layer
public class ErrorContext {
    private final String layer;  // "Parser", "Analyzer", "Executor". Can come from a constants file for consistency.
    private final Map<String, Object> attributes;  // query, table, position, datasource, etc.
}

// Enhanced error response
{
  "status": 400,
  "error": {
    "type": "SemanticCheckException",
    "code": "FIELD_NOT_FOUND",
    "reason": "Invalid Query",
    "details": "Field 'timestamp' not found in table 'logs'",
    "context": [
      {"layer": "semantic_analysis", "table": "logs", "index_pattern": "logs-*"},
      {"layer": "query_parser", "query": "source=logs-* | fields timestamp", "position": {"line": 1, "column": 25}},
      {"layer": "datasource", "datasource_name": "default-opensearch"}
    ],
    "suggestion": "Check that field 'timestamp' exists in indices matching 'logs-*' pattern"
  }
}

This structured output approach makes it straightforward for the front-end to enrich the error presentation, e.g. highlighting the position of the error.

Approach

Phase 1: Core Infrastructure (Non-breaking)

  1. Create Report Wrapper Classes

    • ErrorReport: Wrapper that accumulates context around any Throwable
    • ErrorContext: Structured context at each layer (query, position, datasource, table, etc.)
    • ErrorCode: Enum of machine-readable error codes (FIELD_NOT_FOUND, SYNTAX_ERROR, INDEX_NOT_FOUND, etc.)
  2. Enhance Exception Base Classes

    • Add optional ErrorReport field to QueryEngineException base class
    • Add builder methods: .withContext(key, value), .withQuery(query), .withPosition(line, col)
    • Maintain backwards compatibility: existing constructors work unchanged
  3. Update ErrorMessage Formatting

    • Enhance ErrorMessage.java to extract and format ErrorReport if present
    • Fall back to current behavior for exceptions without reports

Phase 2: Incremental Context Adoption

  1. Add Context at Key Layers (can be done incrementally per component):

    • Parsing layer (PPLSyntaxParser, SQLSyntaxParser): Add query text, position info
    • Semantic analysis (Analyzer): Add table names, field names, datasource context
    • Query execution (QueryService): Add execution engine (Calcite/V2/Legacy), memory usage
    • REST handlers (RestPPLQueryAction, RestSQLQueryAction): Add request ID, user context
  2. Enhance Specific Error Types (prioritize by user impact):

    • Start with errors from GH issues #4872, #4869
    • SemanticCheckException: Add table, field, datasource context
    • MemoryUsageException: Add current/max memory, settings to adjust
    • Index pattern errors: Add which indices matched, mapping differences

Implementation details

// Example: Adding context in semantic analyzer
public class Analyzer {
    public LogicalPlan analyze(UnresolvedPlan plan, AnalysisContext context) {
        try {
            return doAnalyze(plan, context);
        } catch (SemanticCheckException e) {
            throw ErrorReport.wrap(e)
                    .withContext("table", context.getTableName())
                    .withContext("datasource", context.getDatasourceName())
                    .withContext("index_pattern", context.getIndexPattern());
        }
    }
}

// Example: Adding context at REST layer
public class RestPPLQueryAction {
    private void executeQuery(String query, ResponseListener listener) {
        try {
            pplService.execute(query, listener);
        } catch (Exception e) {
            ErrorReport report = ErrorReport.wrap(e)
                .withContext("query", query)
                .withContext("request_id", QueryContext.current().getRequestId())
                .withContext("format", format);

            listener.onFailure(report.toException());
        }
    }
}

// Example: Enhanced error response
public class ErrorMessage {
    protected String fetchDetails() {
        if (exception instanceof QueryEngineException qee && qee.hasReport()) {
            return qee.getReport().toDetailedMessage();
        }
        return exception.getLocalizedMessage();
    }

    private JSONObject getErrorAsJson() {
        JSONObject errorJson = new JSONObject();
        errorJson.put("type", type);
        errorJson.put("reason", reason);
        errorJson.put("details", details);

        // Add context chain if available
        if (exception instanceof QueryEngineException qee && qee.hasReport()) {
            errorJson.put("code", qee.getReport().getErrorCode());
            errorJson.put("context", qee.getReport().getContextChain());
            errorJson.put("suggestion", qee.getReport().getSuggestion());
        }

        return errorJson;
    }
}

Alternatives

1. Structured Exception Messages (Simpler approach)

Instead of wrapping exceptions, enhance exception constructors to accept structured context:

public class SemanticCheckException extends QueryEngineException {
    public SemanticCheckException(String message,
                                  String table,
                                  String field,
                                  String datasource) {
        super(formatMessage(message, table, field, datasource));
    }
}

Pros:

  • Simpler implementation, less infrastructure
  • Backwards compatible if old constructors remain

Cons:

  • Context must be known at exception creation time
  • Each exception class needs custom constructor for its context
  • Intermediate layers can't add context without catching/rethrowing with new exception
  • Harder to extract context programmatically (must parse message string)

2. ThreadLocal Context (MDC-style)

Use ThreadLocal storage to accumulate context throughout request lifecycle:

public class ErrorContext {
    private static final ThreadLocal<Map<String, Object>> context = new ThreadLocal<>();

    public static void addContext(String key, Object value) {
        context.get().put(key, value);
    }
}

Pros:

  • No changes to exception hierarchy
  • Context automatically available to all error handlers
  • Commonly used pattern (SLF4J MDC)

Cons:

  • Risk of context leaking between requests if not properly cleared
  • Harder to track which context is added where
  • Context not attached to specific exception, harder to serialize/deserialize
  • Complex in async execution contexts

3. Status Quo with Better Messages (Minimal change)

Keep current architecture but improve individual error messages:

  • Add more details to existing exception messages
  • Better formatting in ErrorMessage.fetchDetails()
  • Fix specific issues like #4872, #4869 individually

Pros:

  • No architectural changes
  • Can be done incrementally

Cons:

  • Doesn't solve systematic context loss problem
  • Still limited to single error message string
  • No structured error information for programmatic handling
  • Difficult to add context at intermediate layers

Swiddis avatar Dec 09 '25 00:12 Swiddis

Looks good.

One suggestion, we should add queryId in query context.

{ "status": 400, "error": { "type": "SemanticCheckException", "code": "FIELD_NOT_FOUND", "reason": "Invalid Query", "details": "Field 'timestamp' not found in table 'logs'", "context": [ {"layer": "semantic_analysis", "table": "logs", "index_pattern": "logs-"}, {"layer": "query_parser", "query": "source=logs- | fields timestamp", "position": {"line": 1, "column": 25}}, {"layer": "datasource", "datasource_name": "default-opensearch"} ], "suggestion": "Check that field 'timestamp' exists in indices matching 'logs-*' pattern" } }

  • Which component fill details, reasons? In this case, I assume analyzer?
  • Which component compose suggestion?

penghuo avatar Dec 10 '25 18:12 penghuo

Which component fill details, reasons? In this case, I assume analyzer? Which component compose suggestion?

I didn't do a great job of highlighting this. At its core: The Report will carry all error content, and any layer can enrich it with the builder methods. Report generation is a final serialization step.

These sections are available with the report class & can be set by any layer. The type, reason and details fields will probably default to the underlying exception, so we stay compatible with existing frontend error-handling logic.

try {
  // To resolve fields
} catch (SemanticCheckException e) { // A missing field
  String possibleField = lookup(actualField, availableFields);

  throw ErrorReport.wrap(e)
    .atLayer("field_resolution")
    .withDetails(String.format("Failed to resolve field: '%s'", actualField))
    .withSuggestion(String.format("Did you mean: '%s'", possibleField))
    .withContext("index_pattern", context.getIndexPattern());
}

I think there's some wiggle room for exactly how we define the builder interface, I could look into proof-of-concepting some handling of specific issues and testing a few approaches out. For instance, it might be more natural to make the context chain a string array (location here) and have all the structured context data (context here) be part of a single wide object.

/* Field resolution layer */
try {
  // To resolve fields
  throw SemanticCheckException(String.format("Failed to resolve field: '%s'", fieldName))
} catch (SemanticCheckException e) {
  String possibleField = lookup(fieldName, availableFields);

  throw ErrorReport.wrap(e)
    .code(FIELD_NOT_FOUND)
    .location("while resolving fields in the index mapping")
    .suggestion(String.format("Did you mean: '%s'", possibleField))
    .context("index_pattern", indexPatternName)
    .context("position", currentQueryPosition);
}

/* The outer planning process */
try {
  // Some process containing the above & potentially more context
} catch (PlanningException e) {
  throw ErrorReport.wrap(e)
    .location("while planning the query")
    .context("query", queryStr)
    .context("query_id", queryId); // <-- add this at the outermost layer
}

Which renders as something like:

{
  "status": 400,
  "error": {
    "type": "SemanticCheckException",
    "code": "FIELD_NOT_FOUND",
    "reason": "Invalid Query",
    "details": "Failed to resolve field 'foo'",
    "location": [
      "while planning the query",
      "while resolving fields in the index mapping"
    ],
    "context": {
      "index_pattern": "logs-*",
      "position": {"line": 1, "column": 25},
      "query": "source=logs-* | fields foo",
      "query_id": "b6627794-3939-4ac4-8c5b-821ccc400f4f"
    },
    "suggestion": "Did you mean: 'foobar'"
  }
}

For the most part, this is an aesthetic issue & separate from the technical core of the proposal. I'm open to any other preferences. I'll keep my biases open and say that a lot of my aesthetic taste here comes from working with the lovely eyre library.

Swiddis avatar Dec 10 '25 21:12 Swiddis