Add native support for mvexpand command in PPL
Is your feature request related to a problem?
Currently, OpenSearch SQL does not provide a native way to expand multivalue fields into separate rows/events, similar to Splunk's mvexpand command. This limits the ability to efficiently work with and analyze multivalue fields directly within queries, which is a common use case in log analytics and data exploration.
What solution would you like?
Introduce an mvexpand command (or equivalent SQL function/operator) in OpenSearch SQL that:
- Expands a multivalue field into separate rows, one per value in the field, for each input event/document.
- Supports an optional
limitparameter to restrict how many values per event/document are expanded (similar tomvexpand <field> [limit=<int>]in Splunk). - Handles memory/resource limits similar to Splunk's RAM usage (with configurable limits and clear error messaging if limits are exceeded).
- Works as a streaming/distributable command for performance at scale.
What alternatives have you considered?
- Manual preprocessing of documents to flatten multivalue fields before indexing, which is inflexible and not always possible.
- Using scripting or custom plugins, which increases complexity for users and maintainers.
- Workarounds using nested queries or Painless scripts, which are less intuitive and harder for users familiar with Splunk/SQL semantics.
Do you have any additional context?
This feature would help users migrating from Splunk or other analytics platforms that rely on multivalue field expansion. For reference, see Splunk's mvexpand documentation:
Expands the values of a multivalue field into separate events, one event for each value in the multivalue field. For each result, the mvexpand command creates a new result for every multivalue field.
Example usage in Splunk:
... | mvexpand foo
... | mvexpand foo limit=100
It would be valuable to have similar functionality in OpenSearch SQL to facilitate analytics, dashboarding, and data preparation tasks involving arrays or multivalue fields.
Examples and Edge Cases:
-
Basic Example
- Input: A document with a field
tags: ["error", "warning", "info"] - Query:
SELECT mvexpand(tags) FROM logs - Output:
tags error warning info
- Input: A document with a field
-
With Limit Option
- Input: A document with a field
ids: [1, 2, 3, 4, 5] - Query:
SELECT mvexpand(ids, limit=3) FROM docs - Output:
ids 1 2 3
- Input: A document with a field
-
Multiple Multivalue Fields (Edge Case)
- Input: Document with
a: [10, 20],b: [100, 200] - Query: Should clarify whether expansion is supported on more than one field at a time. If not, document the limitation.
- Input: Document with
-
Empty or Null Arrays (Edge Case)
- Input:
tags: []ortags: null - Output: No rows or a row with null value, depending on desired behavior (should be documented).
- Input:
-
Large Arrays / Memory Limits (Edge Case)
- Input: Document with a field containing thousands of values
- Behavior: Should trigger a warning or error if memory/resource usage exceeds limits, and document how this is handled.
-
Non-Array Fields (Edge Case)
- Input:
tags: "error"(not an array) - Output: Should document if the field is left as-is, converted to a single-value array, or an error is thrown.
- Input:
@srikanthpadakanti do u want to contribute?
@srikanthpadakanti do u want to contribute?
Hello @penghuo Yes. I want to contribute and work in progress.
Before that, a quick follow-up on your suggestion about referencing mvjoin. Just to clarify, mvjoin is a scalar function (doesn’t change row count, just joins array elements as a string), so it plugs in pretty cleanly as an eval/expr function.
But mvexpand is a pipeline operator that explodes a row into multiple rows, so its implementation is more like unnest/explode in SQL systems—needs full-blown physical/logical plan nodes, different chaining, etc. Is there a specific aspect of the mvjoin implementation you wanted me to look at for inspiration? Or was it just for general reference? Let me know. Thanks.