pinot
pinot copied to clipboard
Proper support for Array/List type
Pinot currently has a multi-value column type: which essentially is a ordered, bag-of-elements, set data structure.
However, many requirement comes in to ask for proper array support (see https://github.com/apache/pinot/issues/6083)
This proposal is for proper array/list type as a separate DataType.
- [x] conversion between MV and Array/List type should be supported (
CAST
)- NOTE: as @Jackie-Jiang mentioned, we can directly operate on MV columns since they are ordered arrays
- [ ] array operations should be done on array type, including
- [ ] predicates (
match_all
,match_none
,match_any
orcontains
) - [ ] whole array operations producing a single value (
sum
/min
/max
/cardinality
) - [ ] element-wise operations (such as
filter
,transform
) - [ ] indexed access (
arr[1]
orelement_at
)
- [ ] predicates (
Currently MV column is similar to list:
- The order of elements are preserved
- It may have multiple occurrences of the same element
We may reuse the MV column index to store list of basic types (no list of list)
The main challenge here is to support the semantic for list operations
updated the descriptions, thanks @Jackie-Jiang , i think we can actually support these out of MV columns. see: previous issue https://github.com/apache/pinot/issues/6083.
I think one unsolve problem is how to support element-wise operations such as filter(mvCol, <filterFunction>)
or transform(mvCol, <transformFunction>)
do we assume the input to these are the base type inside the mvCol? if so, how do we define the functions.
@walterddr @Jackie-Jiang Is there any way we can support a Split function in Pinot that splits a text based on a delimiter and returns an array/MV?
@nizarhejazi I think it is already supported as a scalar function split
. See StringFunctions.split()
Thanks @Jackie-Jiang. Need to add Split
to the String transformations listed here.
@Jackie-Jiang @walterddr I get back the reference of the string array created by Split
:
Query: split('a,b,c', ',')
Result: [Ljava.lang.String;@5edb4950
As a result, Split
does not work w/ array functions:
Query: arraylength(split('a,b,c', ','))
Result: IllegalArgumentException: The argument of ARRAYLENGTH transform function must be a multi-valued column or a transform function
Query: arrayIndexOfString(split('a,b,c', ','), 'a')
Result: -1
Query: arrayElementAtString(split('a,b,c', ','), 1)
Result: [Ljava.lang.String;@2f9a7d52
@nizarhejazi This is a bug on literal only query (no need to send the query to server). When I query a column, it can return an array. But seems like when I put ', '
as the delimiter, it will try to split on both ','
and ' '
. Will submit a fix
There are 2 issues:
- split should work on whole separator, fixed in #9650
- array not properly returned, which requires array literal support. Created #9659 to track it