sql icon indicating copy to clipboard operation
sql copied to clipboard

[FEATURE] Add mvcombine command in PPL

Open srikanthpadakanti opened this issue 2 months ago • 3 comments

Proposal — Command syntax, options, and precise semantics

Command - mvcombine <field> [delim="<string>"]

Arguments

  • field (required): the field whose values will be combined.
  • delim (optional): join delimiter. If provided the combined values are returned as a single string joined by delim; if omitted the combined values are returned as a multivalue array.

Precise semantics

  • mvcombine operates on an ordered stream of results.
  • For each contiguous run of rows where every field except <field> is equal, emit a single row:
    • non-target fields are preserved (they are identical across the run),
    • <field> becomes:
      • an array of run values (preserving run order) if delim not provided, or
      • a string of run values joined by delim if delim provided.
  • Only consecutive (adjacent) rows are combined. To combine non-adjacent values, results must be sorted/grouped first so that desired values become adjacent.
  • Null/empty target values: preserved in arrays; when delim is used they appear as empty tokens (implementation detail; see open questions).
  • The operator does not re-key/aggregate across non-adjacent rows.

Recommended defaults (proposal)

  • Parameter name: delim (short, common in code); delimiter is acceptable if reviewers prefer explicitness.
  • Return policy: default → multivalue array; delim present → string. This preserves downstream processing flexibility.

Examples

  1. Basic Input: | user | action | id | | joe | login | 1 | | joe | login | 2 | | joe | logout | 3 |

Query: ... | mvcombine id

Output: | user | action | id | | joe | login | [1,2] | | joe | logout | [3] |

  1. With delim Query: ... | mvcombine id delim=";"

Output: | user | action | id | | joe | login | "1;2" | | joe | logout | "3" |

  1. Non-consecutive Input: | user | action | id | | joe | login | 1 | | sam | login | 5 | | joe | login | 2 |

Query: ... | mvcombine id

Output: | user | action | id | | joe | login | [1] | | sam | login | [5] | | joe | login | [2] |


Approach — implementation plan, short-term and long-term

Short-term (implement in PPL runtime)

  • Implement mvcombine as a streaming pipeline operator:
    • Maintain a "current run" buffer keyed by the non-target fields.
    • For each incoming row:
      • If non-target fields match current-run key → append target value.
      • Else → flush current run (emit one row) and start a new run.
    • Flushing converts the run buffer to array or joined string per delim.
  • Memory usage: O(1) per run plus buffer proportional to run size. Add safeguards (configurable caps) for very large runs.

Long-term / optimizations

  • Explore safe pushdown to OpenSearch when grouping keys and ordering align (see Pushdown below).
  • If feasible, add an optimized path that uses server-side aggregations (top_hits) when grouping keys equal adjacency keys and ordering is expressible in DSL.
  • Add config knobs: run-size cap, behavior on overflow (error/warn/truncate).

Limitations to document

  • Deterministic ordering requirement (users must sort prior to mvcombine to get non-current adjacency).
  • Single-field per invocation (recommended). Multiple-field behavior can be approximated by sequential mvcombine calls.
  • Large runs may cause memory pressure — add config and tests.

Implementation discussion

Is this a subset or composition of existing commands?

  • Not directly. GROUP BY/array_agg aggregate across all matching keys regardless of adjacency, so they cannot reproduce adjacency-only semantics without complex windowing or boundary detection.
  • Therefore a native pipeline operator is recommended rather than composing existing commands.

Equivalent SQL statement(s)

  • No simple, single SQL statement matches adjacency-only behavior. Emulation requires:
    • deterministic ordering,
    • row_number/window functions,
    • detecting group boundaries via lag/lead,
    • computing group ids,
    • aggregating per group id.
  • This emulation is verbose and less efficient, especially in a streaming/distributed PPL runtime.

Pushdown to OpenSearch DSL — when possible and when not

  • Safe pushdown conditions:
    • The adjacency grouping key set (all non-target fields) maps exactly to an OpenSearch grouping key (i.e., natural grouping, no dependency on adjacency boundaries that differ from grouping),
    • The desired ordering for combined values can be expressed to OpenSearch (top_hits can return ordered lists).
  • In general pushdown is NOT safe because OpenSearch aggregations aggregate all docs for a key (not adjacency); pushdown is only safe when adjacency equals grouping by key (for example when a precomputed group id exists).
  • When pushdown is safe, server-side aggregation (top_hits or scripted collection) can produce ordered value lists; otherwise run mvcombine in PPL.

Open questions (for maintainers/reviewers)

  1. Parameter name: delim vs delimiter? (I recommend delim.)
  2. Return policy: default array, delim → string (recommended). Agree?
  3. Null/empty handling policy: preserve vs skip?
  4. Memory limits: add ppl.mvcombine.max_values or rely on general PPL limits?
  5. Multiple-field support in single invocation — allow or keep single-field-only? (Recommend single-field-only for initial implementation.)

Acceptance criteria

  • Operator implemented with specified syntax/semantics.
  • Documentation updated with examples and ordering requirement.
  • Tests cover basic behavior, delim, null handling, non-consecutive, and large-run behavior.
  • Configurable safeguards for run size and memory.

Reference

  • mvcombine command reference: https://help.splunk.com/en/splunk-cloud-platform/search/search-reference/10.1.2507/search-commands/mvcombine

srikanthpadakanti avatar Nov 10 '25 03:11 srikanthpadakanti

@srikanthpadakanti Thanks for opening this issue! Could you update it to follow our RFC template and add the missing sections? To align on scope and requirements, please include:

  1. Proposal — Command syntax, options, and precise semantics - for example, your draft uses delimeter, shall we use delim? Is return value array or string?
  2. Approach — Any implementation limitations in PPL; a short-term/long-term plan if applicable.
  3. Implementation discussion
    • Is this command a subset or composition of existing commands?
    • What are the equivalent SQL statement(s)?
    • Can this be pushed down to OpenSearch DSL, and under what conditions?

dai-chen avatar Nov 12 '25 17:11 dai-chen

@srikanthpadakanti Thanks for opening this issue! Could you update it to follow our RFC template and add the missing sections? To align on scope and requirements, please include:

  1. Proposal — Command syntax, options, and precise semantics - for example, your draft uses delimeter, shall we use delim? Is return value array or string?

  2. Approach — Any implementation limitations in PPL; a short-term/long-term plan if applicable.

  3. Implementation discussion

    • Is this command a subset or composition of existing commands?
    • What are the equivalent SQL statement(s)?
    • Can this be pushed down to OpenSearch DSL, and under what conditions?

Hello @dai-chen Thanks — I updated the issue with an RFC-style Proposal / Approach / Implementation section and examples. Recommended defaults: delim, default → array (if delim present → string), preserve nulls, single-field per invocation. Please confirm: delim vs delimiter, null handling policy, and memory-cap behavior.

srikanthpadakanti avatar Nov 14 '25 04:11 srikanthpadakanti

Only consecutive (adjacent) rows are combined ...

Could you confirm this? I don't see it in SPL's documentation. I did quick test and it seems not so.

| makeresults format=json data="[
  {\"ip\":\"A\", \"packets\":10},
  {\"ip\":\"B\", \"packets\":60},
  {\"ip\":\"A\", \"packets\":20},
  {\"ip\":\"C\", \"packets\":30}
]" | mvcombine packets

A - [10, 20]
B - 60
C 30

dai-chen avatar Nov 18 '25 23:11 dai-chen