sql [FEATURE] Add mvcombine command in PPL

Proposal — Command syntax, options, and precise semantics

Command - mvcombine <field> [delim="<string>"]

Arguments

field (required): the field whose values will be combined.
delim (optional): join delimiter. If provided the combined values are returned as a single string joined by delim; if omitted the combined values are returned as a multivalue array.

Precise semantics

mvcombine operates on an ordered stream of results.
For each contiguous run of rows where every field except <field> is equal, emit a single row:
- non-target fields are preserved (they are identical across the run),
- <field> becomes:
  - an array of run values (preserving run order) if delim not provided, or
  - a string of run values joined by delim if delim provided.
Only consecutive (adjacent) rows are combined. To combine non-adjacent values, results must be sorted/grouped first so that desired values become adjacent.
Null/empty target values: preserved in arrays; when delim is used they appear as empty tokens (implementation detail; see open questions).
The operator does not re-key/aggregate across non-adjacent rows.

Recommended defaults (proposal)

Parameter name: delim (short, common in code); delimiter is acceptable if reviewers prefer explicitness.
Return policy: default → multivalue array; delim present → string. This preserves downstream processing flexibility.

Examples

Basic Input: | user | action | id | | joe | login | 1 | | joe | login | 2 | | joe | logout | 3 |

Query: ... | mvcombine id

Output: | user | action | id | | joe | login | [1,2] | | joe | logout | [3] |

With delim Query: ... | mvcombine id delim=";"

Output: | user | action | id | | joe | login | "1;2" | | joe | logout | "3" |

Non-consecutive Input: | user | action | id | | joe | login | 1 | | sam | login | 5 | | joe | login | 2 |

Query: ... | mvcombine id

Output: | user | action | id | | joe | login | [1] | | sam | login | [5] | | joe | login | [2] |

Approach — implementation plan, short-term and long-term

Short-term (implement in PPL runtime)

Implement mvcombine as a streaming pipeline operator:
- Maintain a "current run" buffer keyed by the non-target fields.
- For each incoming row:
  - If non-target fields match current-run key → append target value.
  - Else → flush current run (emit one row) and start a new run.
- Flushing converts the run buffer to array or joined string per delim.
Memory usage: O(1) per run plus buffer proportional to run size. Add safeguards (configurable caps) for very large runs.

Long-term / optimizations

Explore safe pushdown to OpenSearch when grouping keys and ordering align (see Pushdown below).
If feasible, add an optimized path that uses server-side aggregations (top_hits) when grouping keys equal adjacency keys and ordering is expressible in DSL.
Add config knobs: run-size cap, behavior on overflow (error/warn/truncate).

Limitations to document

Deterministic ordering requirement (users must sort prior to mvcombine to get non-current adjacency).
Single-field per invocation (recommended). Multiple-field behavior can be approximated by sequential mvcombine calls.
Large runs may cause memory pressure — add config and tests.

Implementation discussion

Is this a subset or composition of existing commands?

Not directly. GROUP BY/array_agg aggregate across all matching keys regardless of adjacency, so they cannot reproduce adjacency-only semantics without complex windowing or boundary detection.
Therefore a native pipeline operator is recommended rather than composing existing commands.

Equivalent SQL statement(s)

No simple, single SQL statement matches adjacency-only behavior. Emulation requires:
- deterministic ordering,
- row_number/window functions,
- detecting group boundaries via lag/lead,
- computing group ids,
- aggregating per group id.
This emulation is verbose and less efficient, especially in a streaming/distributed PPL runtime.

Pushdown to OpenSearch DSL — when possible and when not

Safe pushdown conditions:
- The adjacency grouping key set (all non-target fields) maps exactly to an OpenSearch grouping key (i.e., natural grouping, no dependency on adjacency boundaries that differ from grouping),
- The desired ordering for combined values can be expressed to OpenSearch (top_hits can return ordered lists).
In general pushdown is NOT safe because OpenSearch aggregations aggregate all docs for a key (not adjacency); pushdown is only safe when adjacency equals grouping by key (for example when a precomputed group id exists).
When pushdown is safe, server-side aggregation (top_hits or scripted collection) can produce ordered value lists; otherwise run mvcombine in PPL.

Open questions (for maintainers/reviewers)

Parameter name: delim vs delimiter? (I recommend delim.)
Return policy: default array, delim → string (recommended). Agree?
Null/empty handling policy: preserve vs skip?
Memory limits: add ppl.mvcombine.max_values or rely on general PPL limits?
Multiple-field support in single invocation — allow or keep single-field-only? (Recommend single-field-only for initial implementation.)

Acceptance criteria

Operator implemented with specified syntax/semantics.
Documentation updated with examples and ordering requirement.
Tests cover basic behavior, delim, null handling, non-consecutive, and large-run behavior.
Configurable safeguards for run size and memory.

Reference

mvcombine command reference: https://help.splunk.com/en/splunk-cloud-platform/search/search-reference/10.1.2507/search-commands/mvcombine

Nov 10 '25 03:11 srikanthpadakanti

@srikanthpadakanti Thanks for opening this issue! Could you update it to follow our RFC template and add the missing sections? To align on scope and requirements, please include:

Proposal — Command syntax, options, and precise semantics - for example, your draft uses delimeter, shall we use delim? Is return value array or string?
Approach — Any implementation limitations in PPL; a short-term/long-term plan if applicable.
Implementation discussion
- Is this command a subset or composition of existing commands?
- What are the equivalent SQL statement(s)?
- Can this be pushed down to OpenSearch DSL, and under what conditions?

Nov 12 '25 17:11 dai-chen

@srikanthpadakanti Thanks for opening this issue! Could you update it to follow our RFC template and add the missing sections? To align on scope and requirements, please include:

Proposal — Command syntax, options, and precise semantics - for example, your draft uses delimeter, shall we use delim? Is return value array or string?

Approach — Any implementation limitations in PPL; a short-term/long-term plan if applicable.

Implementation discussion

Is this command a subset or composition of existing commands?

What are the equivalent SQL statement(s)?

Can this be pushed down to OpenSearch DSL, and under what conditions?

Hello @dai-chen Thanks — I updated the issue with an RFC-style Proposal / Approach / Implementation section and examples. Recommended defaults: delim, default → array (if delim present → string), preserve nulls, single-field per invocation. Please confirm: delim vs delimiter, null handling policy, and memory-cap behavior.

Nov 14 '25 04:11 srikanthpadakanti

Only consecutive (adjacent) rows are combined ...

Could you confirm this? I don't see it in SPL's documentation. I did quick test and it seems not so.

| makeresults format=json data="[
  {\"ip\":\"A\", \"packets\":10},
  {\"ip\":\"B\", \"packets\":60},
  {\"ip\":\"A\", \"packets\":20},
  {\"ip\":\"C\", \"packets\":30}
]" | mvcombine packets

A - [10, 20]
B - 60
C 30

Nov 18 '25 23:11 dai-chen