[FEATURE] Add mvcombine command in PPL
Proposal — Command syntax, options, and precise semantics
Command
- mvcombine <field> [delim="<string>"]
Arguments
- field (required): the field whose values will be combined.
- delim (optional): join delimiter. If provided the combined values are returned as a single string joined by delim; if omitted the combined values are returned as a multivalue array.
Precise semantics
- mvcombine operates on an ordered stream of results.
- For each contiguous run of rows where every field except
<field>is equal, emit a single row:- non-target fields are preserved (they are identical across the run),
<field>becomes:- an array of run values (preserving run order) if delim not provided, or
- a string of run values joined by delim if delim provided.
- Only consecutive (adjacent) rows are combined. To combine non-adjacent values, results must be sorted/grouped first so that desired values become adjacent.
- Null/empty target values: preserved in arrays; when delim is used they appear as empty tokens (implementation detail; see open questions).
- The operator does not re-key/aggregate across non-adjacent rows.
Recommended defaults (proposal)
- Parameter name:
delim(short, common in code);delimiteris acceptable if reviewers prefer explicitness. - Return policy: default → multivalue array;
delimpresent → string. This preserves downstream processing flexibility.
Examples
- Basic Input: | user | action | id | | joe | login | 1 | | joe | login | 2 | | joe | logout | 3 |
Query: ... | mvcombine id
Output: | user | action | id | | joe | login | [1,2] | | joe | logout | [3] |
- With delim Query: ... | mvcombine id delim=";"
Output: | user | action | id | | joe | login | "1;2" | | joe | logout | "3" |
- Non-consecutive Input: | user | action | id | | joe | login | 1 | | sam | login | 5 | | joe | login | 2 |
Query: ... | mvcombine id
Output: | user | action | id | | joe | login | [1] | | sam | login | [5] | | joe | login | [2] |
Approach — implementation plan, short-term and long-term
Short-term (implement in PPL runtime)
- Implement mvcombine as a streaming pipeline operator:
- Maintain a "current run" buffer keyed by the non-target fields.
- For each incoming row:
- If non-target fields match current-run key → append target value.
- Else → flush current run (emit one row) and start a new run.
- Flushing converts the run buffer to array or joined string per
delim.
- Memory usage: O(1) per run plus buffer proportional to run size. Add safeguards (configurable caps) for very large runs.
Long-term / optimizations
- Explore safe pushdown to OpenSearch when grouping keys and ordering align (see Pushdown below).
- If feasible, add an optimized path that uses server-side aggregations (top_hits) when grouping keys equal adjacency keys and ordering is expressible in DSL.
- Add config knobs: run-size cap, behavior on overflow (error/warn/truncate).
Limitations to document
- Deterministic ordering requirement (users must sort prior to mvcombine to get non-current adjacency).
- Single-field per invocation (recommended). Multiple-field behavior can be approximated by sequential mvcombine calls.
- Large runs may cause memory pressure — add config and tests.
Implementation discussion
Is this a subset or composition of existing commands?
- Not directly. GROUP BY/array_agg aggregate across all matching keys regardless of adjacency, so they cannot reproduce adjacency-only semantics without complex windowing or boundary detection.
- Therefore a native pipeline operator is recommended rather than composing existing commands.
Equivalent SQL statement(s)
- No simple, single SQL statement matches adjacency-only behavior. Emulation requires:
- deterministic ordering,
- row_number/window functions,
- detecting group boundaries via lag/lead,
- computing group ids,
- aggregating per group id.
- This emulation is verbose and less efficient, especially in a streaming/distributed PPL runtime.
Pushdown to OpenSearch DSL — when possible and when not
- Safe pushdown conditions:
- The adjacency grouping key set (all non-target fields) maps exactly to an OpenSearch grouping key (i.e., natural grouping, no dependency on adjacency boundaries that differ from grouping),
- The desired ordering for combined values can be expressed to OpenSearch (top_hits can return ordered lists).
- In general pushdown is NOT safe because OpenSearch aggregations aggregate all docs for a key (not adjacency); pushdown is only safe when adjacency equals grouping by key (for example when a precomputed group id exists).
- When pushdown is safe, server-side aggregation (top_hits or scripted collection) can produce ordered value lists; otherwise run mvcombine in PPL.
Open questions (for maintainers/reviewers)
- Parameter name:
delimvsdelimiter? (I recommenddelim.) - Return policy: default array, delim → string (recommended). Agree?
- Null/empty handling policy: preserve vs skip?
- Memory limits: add
ppl.mvcombine.max_valuesor rely on general PPL limits? - Multiple-field support in single invocation — allow or keep single-field-only? (Recommend single-field-only for initial implementation.)
Acceptance criteria
- Operator implemented with specified syntax/semantics.
- Documentation updated with examples and ordering requirement.
- Tests cover basic behavior, delim, null handling, non-consecutive, and large-run behavior.
- Configurable safeguards for run size and memory.
Reference
- mvcombine command reference: https://help.splunk.com/en/splunk-cloud-platform/search/search-reference/10.1.2507/search-commands/mvcombine
@srikanthpadakanti Thanks for opening this issue! Could you update it to follow our RFC template and add the missing sections? To align on scope and requirements, please include:
- Proposal — Command syntax, options, and precise semantics - for example, your draft uses
delimeter, shall we usedelim? Is return value array or string? - Approach — Any implementation limitations in PPL; a short-term/long-term plan if applicable.
- Implementation discussion
- Is this command a subset or composition of existing commands?
- What are the equivalent SQL statement(s)?
- Can this be pushed down to OpenSearch DSL, and under what conditions?
@srikanthpadakanti Thanks for opening this issue! Could you update it to follow our RFC template and add the missing sections? To align on scope and requirements, please include:
Proposal — Command syntax, options, and precise semantics - for example, your draft uses
delimeter, shall we usedelim? Is return value array or string?Approach — Any implementation limitations in PPL; a short-term/long-term plan if applicable.
Implementation discussion
- Is this command a subset or composition of existing commands?
- What are the equivalent SQL statement(s)?
- Can this be pushed down to OpenSearch DSL, and under what conditions?
Hello @dai-chen Thanks — I updated the issue with an RFC-style Proposal / Approach / Implementation section and examples. Recommended defaults: delim, default → array (if delim present → string), preserve nulls, single-field per invocation. Please confirm: delim vs delimiter, null handling policy, and memory-cap behavior.
Only consecutive (adjacent) rows are combined ...
Could you confirm this? I don't see it in SPL's documentation. I did quick test and it seems not so.
| makeresults format=json data="[
{\"ip\":\"A\", \"packets\":10},
{\"ip\":\"B\", \"packets\":60},
{\"ip\":\"A\", \"packets\":20},
{\"ip\":\"C\", \"packets\":30}
]" | mvcombine packets
A - [10, 20]
B - 60
C 30