sql icon indicating copy to clipboard operation
sql copied to clipboard

Add native support for mvexpand command in PPL

Open srikanthpadakanti opened this issue 3 months ago • 3 comments

Is your feature request related to a problem? Currently, OpenSearch SQL does not provide a native way to expand multivalue fields into separate rows/events, similar to Splunk's mvexpand command. This limits the ability to efficiently work with and analyze multivalue fields directly within queries, which is a common use case in log analytics and data exploration.

What solution would you like? Introduce an mvexpand command (or equivalent SQL function/operator) in OpenSearch SQL that:

  • Expands a multivalue field into separate rows, one per value in the field, for each input event/document.
  • Supports an optional limit parameter to restrict how many values per event/document are expanded (similar to mvexpand <field> [limit=<int>] in Splunk).
  • Handles memory/resource limits similar to Splunk's RAM usage (with configurable limits and clear error messaging if limits are exceeded).
  • Works as a streaming/distributable command for performance at scale.

What alternatives have you considered?

  • Manual preprocessing of documents to flatten multivalue fields before indexing, which is inflexible and not always possible.
  • Using scripting or custom plugins, which increases complexity for users and maintainers.
  • Workarounds using nested queries or Painless scripts, which are less intuitive and harder for users familiar with Splunk/SQL semantics.

Do you have any additional context? This feature would help users migrating from Splunk or other analytics platforms that rely on multivalue field expansion. For reference, see Splunk's mvexpand documentation:

Expands the values of a multivalue field into separate events, one event for each value in the multivalue field. For each result, the mvexpand command creates a new result for every multivalue field.

Example usage in Splunk:

... | mvexpand foo
... | mvexpand foo limit=100

It would be valuable to have similar functionality in OpenSearch SQL to facilitate analytics, dashboarding, and data preparation tasks involving arrays or multivalue fields.


Examples and Edge Cases:

  1. Basic Example

    • Input: A document with a field tags: ["error", "warning", "info"]
    • Query: SELECT mvexpand(tags) FROM logs
    • Output:
      tags
      error
      warning
      info
  2. With Limit Option

    • Input: A document with a field ids: [1, 2, 3, 4, 5]
    • Query: SELECT mvexpand(ids, limit=3) FROM docs
    • Output:
      ids
      1
      2
      3
  3. Multiple Multivalue Fields (Edge Case)

    • Input: Document with a: [10, 20], b: [100, 200]
    • Query: Should clarify whether expansion is supported on more than one field at a time. If not, document the limitation.
  4. Empty or Null Arrays (Edge Case)

    • Input: tags: [] or tags: null
    • Output: No rows or a row with null value, depending on desired behavior (should be documented).
  5. Large Arrays / Memory Limits (Edge Case)

    • Input: Document with a field containing thousands of values
    • Behavior: Should trigger a warning or error if memory/resource usage exceeds limits, and document how this is handled.
  6. Non-Array Fields (Edge Case)

    • Input: tags: "error" (not an array)
    • Output: Should document if the field is left as-is, converted to a single-value array, or an error is thrown.

srikanthpadakanti avatar Oct 06 '25 15:10 srikanthpadakanti

@srikanthpadakanti do u want to contribute?

penghuo avatar Oct 07 '25 17:10 penghuo

@srikanthpadakanti do u want to contribute?

Hello @penghuo Yes. I want to contribute and work in progress.

Before that, a quick follow-up on your suggestion about referencing mvjoin. Just to clarify, mvjoin is a scalar function (doesn’t change row count, just joins array elements as a string), so it plugs in pretty cleanly as an eval/expr function.

But mvexpand is a pipeline operator that explodes a row into multiple rows, so its implementation is more like unnest/explode in SQL systems—needs full-blown physical/logical plan nodes, different chaining, etc. Is there a specific aspect of the mvjoin implementation you wanted me to look at for inspiration? Or was it just for general reference? Let me know. Thanks.

srikanthpadakanti avatar Oct 10 '25 18:10 srikanthpadakanti

PFA for manual tests

mvexpand_manual_test_results.md

srikanthpadakanti avatar Oct 27 '25 19:10 srikanthpadakanti