sql icon indicating copy to clipboard operation
sql copied to clipboard

[RFC] Support permissive mode in PPL

Open penghuo opened this issue 3 months ago • 3 comments

Problem Statement

OpenSearch PPL is strong-schema and fail-fast. When data is dirty or drifting, queries abort instead of degrading gracefully. At scale, this makes analytics and operations brittle.

  • Type/data mismatch. PPL evaluates expressions on strictly typed fields derived from mappings. Mixed representations—e.g., "1" (string), [1] (array) trigger cast/class errors that terminate the query and break dashboards. Operators lack TRY/SAFE semantics, so evaluation cannot proceed null-on-error. Example: source=log* | eval xx=flags+1 fails when any shard holds "1" or [1] in flags; desired behavior is 2 or NULL with an error counter.
  • Schema-missing (dynamic:false) fields are invisible to PPL. PPL relies on mappings for field addressability and typing. When objects are mapped with dynamic:false, their subfields are present in _source but unmapped, so PPL cannot reference them, even though the Query DSL can project them from _source. This creates a usability gap, forces reindex/mapping changes, or pushes users back to DSL. It also pressures teams toward mapping explosion for discoverability. Example: object event with dynamic:false; fields event.user returns 400 in PPL, while DSL with "_source": ["event.user"] succeeds.
  • Explict cast required when access extract field, When using the REX/PARSE/SPATH command to extract numeric values and then performing arithmetic operations on them, the system fails because the extracted values are treated as strings rather than numbers.
  • Cross-index type conflicts. In observability datasets, schemas drift: the same logical field may be string in some indices and long in others (often from JSON/text extraction). When a PPL query spans those indices, the engine effectively picks one variant to type the field, which is non-deterministic and error-prone.

Requirements

  • Coverage: Handle type/data mismatches, dynamic:false invisibility, and cross-index type conflicts under permissive TRY/SAFE semantics.
  • Performance: No performance regression.
  • Reusability: Engine-agnostic design reusable for PPL on other engines, including Spark.

Expected Behaviour

Setting: ppl.query.permissive=true, default is false, true enable permissive mode. Scenario: intA * stringB where intA is mapped as integer and stringB is unmapped. Analyze stage, The planner resolves intA as type INTEGER. Since stringB is unmapped, the planner treats it as ANY (unknown at plan time). Function resolution stage, The engine selects the multiplication function with signature (INTEGER, INTEGER). It automatically inserts safe casts so that both operands conform to the required types. The final expression is *( safe_cast(intA AS INTEGER), safe_cast(stringB AS INTEGER) ). Data scan and load, Rows are read from OpenSearch as Object[]. No premature casting occurs during scanning. Data processing (compiled expression), For each row, the engine attempts to convert intA and stringB to integers. If conversion fails for either operand, the value becomes NULL and an error counter is incremented for telemetry. If both casts succeed, the product is computed. If either cast fails, the result is NULL.


public Object current() {
  final Object[] current = (Object[]) inputEnumerator.current();
  final Integer v1 = current[0] == null ? null : (Integer) new org.apache.calcite.linq4j.function.Function0() {
    public Object apply() {
      try {
        return org.apache.calcite.runtime.SqlFunctions.toInt(current[0]);
      } catch (Exception e) {
        return null;
      }
    }
  }.apply()

  final Integer v2 = current[1] == null ? null : (Integer) new org.apache.calcite.linq4j.function.Function0() {
    public Object apply() {
      try {
        return org.apache.calcite.runtime.SqlFunctions.toInt(current[1]);
      } catch (Exception e) {
        return null;
      }
    }
  }.apply();
  return new Object[] {Integer.valueOf(v1 * v2)};
}

Results, PPL correctly process data and tolerate type/data mismatch. e.g.

intA stringB intA * stringB comments
2 "3" 6
"2" "3" 6 "2" can convert to int, return 6
2 "3a" null 3a can not be cast to int, return null
[2, 3, 4] "3" null array [2,3,4] can not be cast to int, return null

Decision Points

1.Type to use for missing or unknown fields Two models are viable.

  • String (Java String) may require extra parsing when numeric or boolean semantics are needed.
  • Any (Java Object) minimizes unnecessary conversions but demands strict runtime checks. (Preferred)

2.Function resolution for unknown fields Every function has one or more signatures. During resolution, the engine maps input field types, including unknowns, to compatible target types and inserts casts as needed. To handle type and data mismatches, choose a consistent approach:

  • Normalize through strings by applying safe_cast(field AS STRING) followed by safe_cast(STRING AS expected_type) to harmonize arrays, numerics, and booleans.
  • Keep values as Object , always surround inputs with safe_cast(field AS expected_type) so that failures yield NULL instead of exceptions. (Preferred)

3.Data loaded from OpenSearch There are two loading strategies.

  • Convert all values to strings when constructing rows. This simplifies downstream logic but can increase CPU cost for numeric workloads.
  • Preserve native Java types in rows and cast on demand. (Current approach)

4.Output schema publication When mappings are reliable, publish dataset-driven types. When mappings are unavailable or inconsistent, derive the output schema from operator requirements and the inserted casts. Fields affected by TRY semantics should be nullable, and the response should include drift and casting counters.

penghuo avatar Sep 22 '25 21:09 penghuo

Is ppl.query.permissive=true cluster setting? Wondering if there is requirement to enable/disable per query or per index.

ykmr1224 avatar Oct 15 '25 19:10 ykmr1224

Wondering if there is requirement to enable/disable per query or per index.

I would think so -- I don't want to lose performance and safety for all my static queries because our team has one query that's dynamic. On instinct I'd expect a dynamic keyword to be sufficient: dynamic source=big5 | eval ...

Swiddis avatar Oct 27 '25 22:10 Swiddis

Support https://github.com/opensearch-project/sql/pull/4754 on schema-less mode is challenging.

penghuo avatar Dec 15 '25 17:12 penghuo