sql icon indicating copy to clipboard operation
sql copied to clipboard

[RFC] Support Array type: Calcite Schema-on-Read

Open LantaoJin opened this issue 1 month ago • 3 comments

Problem Statement PPL with Calcite enabling cannot handle the fields with array values. The issue was described in https://github.com/opensearch-project/sql/issues/4173.

Current State https://github.com/opensearch-project/sql/pull/4909 provided a mitigation that handle those array columns by picking the first value instead of failure as a short-term solution. But array values are common usage in OpenTelemetry metrics/logs, as a long-term solution, Calcite should identify whether a field/column is a real array type or not. But this information is not provided in OpenSearch Mapping information, for example The mapping in OpenSearch is

{
  "mappings": {
    "properties": {
      "nums": {
        "type": "long"
      },
      "others": {
        "type": "keyword"
      }
    }
  }
}

But the value stored in nums is

{ "nums": [1, 2, 3], "others": ["a", "b", "c"] }

Assumption We assume that a field is either entirely array type or entirely non-array type. Cases where some documents are arrays and some are non-arrays are not within the scope of this solution, at least not in the OpenTeleMetry scenario.

Long-Term Goals

  • Schema in Calcite (a table name (String) with its RelDataType) could restore the correct data type (Array Type or Non-Array Type) on-the-fly.
  • The feature can be disabled to strong schema mode (force using the OpenSearch existed mapping)
  • When the feature enabled, no specific performance downgrade, 5% or less could be acceptable when enabled.

Proposal Build Calcite table/index schema-on-read for OpenSearch datasource.

Approach Background: Current, Calcite builds its Catalog in memory by registering a table and its schema at the first query by user (lazy). During the table registering, we send a GetMappingsRequest via node client to server and convert the mapping to a RelDataType. Since there is no array type in OpenSearch, the column type of nums is converted to Long and others is converted to String (in above example).

The new approach includes serval steps:

  1. Send a GetMappingsRequest to server
  2. When feature is enabled, send a MultiSearchRequest to server
GET /_msearch
{ "index": "test"}
{ "query": {"exists": { "field": "nums" } }, "size": 1, "_source": "nums" }
{ "index": "test"}
{ "query": {"exists": { "field": "others" } }, "size": 1, "_source": "others" }
  1. If the SearchResponse returns a Array results, mapping the Long type to Array<Long>.
  2. If the SearchResponse returns empty, means no data in this existed field, submits the same SearchResponse for this field in an async scheduler which will refresh schema (replace the existed table schema) in a configurable period (e.g. 10 mins, 30min, 1 hour), until returns data or exceeds the max refresh limitation (3 for example).

Alternative 1 Add ActionFilter to monitor the Index Actions, such as IndexCreate, IndexDelete, Bulk, etc. rather than above push mode. Alternative 2 Store all columns as VARIANT type. Implementation Discussion

  • ActionFilter (mentioned in alternative 2) can reduce the frequent async queries but it may have additional cost to intercept all actions in transport layer.
  • VARIANT solution (mentioned in alternative 2) is a big changing which also has performance downgrade risk, especially on a large dataset.
  • This solution has less changes and no significant performance downgrade on a large dataset.

LantaoJin avatar Dec 11 '25 09:12 LantaoJin

We assume that a field is either entirely array type or entirely non-array type.

The assumption may be not true for log and trace use case.

VARIANT solution (mentioned in alternative 2) is a big changing which also has performance downgrade risk, especially on a large dataset.

If assumption does not hold. We should support field with any type data. this is very command asking in log/trace. I think VARIANT is solution we should explore. Clickhouse Dynamic data type looks promising to me.

penghuo avatar Dec 11 '25 22:12 penghuo

We assume that a field is either entirely array type or entirely non-array type.

The assumption may be not true for log and trace use case.

Can you provide some user cases? After reviewing the sample data of jaeger-span-2025-12-11, ss4o_metrics-otel-2025.12.11, and logs-otel-v1-000008., I came up with this assumption, I didn't find a max case.

VARIANT solution (mentioned in alternative 2) is a big changing which also has performance downgrade risk, especially on a large dataset.

If assumption does not hold. We should support field with any type data. this is very command asking in log/trace. I think VARIANT is solution we should explore. Clickhouse Dynamic data type looks promising to me.

From a technical perspective, VARIANT can be used for dynamic data, but storing all columns as VARIANT type may not be a common solution. Even in Clickhouse, user must specify which columns are dynamic. Defining all columns as VARIANT type may cause all current operators and functions to behave unexpectedly.

From the perspective ROI, the change in this proposal is less and requires 1 week to implement, not sure how much effort will take to support VARIANT (maybe 2 months or longer). BTW, even we support defining all columns as VARIANT, this feature should be applied in index level IMO.

Even if the above assumptions are not met, we can still treat the schema-on-read feature as a default way to identify the true Array type in OpenSearch. For mixed types of data, we will maintain the current support for reading the first element or read-only access.

Another optimal solution is that we can discover certain columns as mixed types through schema-on-read, allowing us to use VARIANT for these columns, but unfortunately, it is not possible to make a judgment through simple sampling.

LantaoJin avatar Dec 12 '25 02:12 LantaoJin

Can you provide some user cases? After reviewing the sample data of jaeger-span-2025-12-11, ss4o_metrics-otel-2025.12.11, and logs-otel-v1-000008., I came up with this assumption, I didn't find a max case.

OTEL log / trace define Attributes data type as map<string, any>, for your example, maybe always ARRAY, but it depend on user's data.

From the perspective ROI, the change in this proposal is less and requires 1 week to implement, not sure how much effort will take to support VARIANT (maybe 2 months or longer).

I think we both agreee VARIANT is long-term solution, but we need a short-term fix, right? I have couple questions of short-term fix.

  1. What is limitation of ARRAY support? only support source and fields command, and no function support, correct?
  2. What is latency to run MultiSearchRequest on a index with total 100 / 1000 fields (top and nested fields)?
  3. Instead of sampling, what is we asked user to tell us what fields is ARRAY? this info could store in index mapping metadata or cluster settings.

penghuo avatar Dec 12 '25 17:12 penghuo