pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Proper support for Array/List type

Open walterddr opened this issue 2 years ago • 2 comments

Pinot currently has a multi-value column type: which essentially is a ordered, bag-of-elements, set data structure.

However, many requirement comes in to ask for proper array support (see https://github.com/apache/pinot/issues/6083)

This proposal is for proper array/list type as a separate DataType.

  • [x] conversion between MV and Array/List type should be supported (CAST)
    • NOTE: as @Jackie-Jiang mentioned, we can directly operate on MV columns since they are ordered arrays
  • [ ] array operations should be done on array type, including
    • [ ] predicates (match_all, match_none, match_any or contains)
    • [ ] whole array operations producing a single value (sum/min/max/cardinality)
    • [ ] element-wise operations (such as filter, transform)
    • [ ] indexed access (arr[1] or element_at)

walterddr avatar Aug 12 '22 16:08 walterddr

Currently MV column is similar to list:

  • The order of elements are preserved
  • It may have multiple occurrences of the same element

We may reuse the MV column index to store list of basic types (no list of list)

The main challenge here is to support the semantic for list operations

Jackie-Jiang avatar Aug 12 '22 19:08 Jackie-Jiang

updated the descriptions, thanks @Jackie-Jiang , i think we can actually support these out of MV columns. see: previous issue https://github.com/apache/pinot/issues/6083.

I think one unsolve problem is how to support element-wise operations such as filter(mvCol, <filterFunction>) or transform(mvCol, <transformFunction>) do we assume the input to these are the base type inside the mvCol? if so, how do we define the functions.

walterddr avatar Aug 29 '22 17:08 walterddr

@walterddr @Jackie-Jiang Is there any way we can support a Split function in Pinot that splits a text based on a delimiter and returns an array/MV?

nizarhejazi avatar Oct 21 '22 21:10 nizarhejazi

@nizarhejazi I think it is already supported as a scalar function split. See StringFunctions.split()

Jackie-Jiang avatar Oct 22 '22 07:10 Jackie-Jiang

Thanks @Jackie-Jiang. Need to add Split to the String transformations listed here.

nizarhejazi avatar Oct 23 '22 19:10 nizarhejazi

@Jackie-Jiang @walterddr I get back the reference of the string array created by Split: Query: split('a,b,c', ',') Result: [Ljava.lang.String;@5edb4950

As a result, Split does not work w/ array functions: Query: arraylength(split('a,b,c', ',')) Result: IllegalArgumentException: The argument of ARRAYLENGTH transform function must be a multi-valued column or a transform function

Query: arrayIndexOfString(split('a,b,c', ','), 'a') Result: -1

Query: arrayElementAtString(split('a,b,c', ','), 1) Result: [Ljava.lang.String;@2f9a7d52

nizarhejazi avatar Oct 23 '22 20:10 nizarhejazi

@nizarhejazi This is a bug on literal only query (no need to send the query to server). When I query a column, it can return an array. But seems like when I put ', ' as the delimiter, it will try to split on both ',' and ' '. Will submit a fix

Jackie-Jiang avatar Oct 25 '22 00:10 Jackie-Jiang

There are 2 issues:

  1. split should work on whole separator, fixed in #9650
  2. array not properly returned, which requires array literal support. Created #9659 to track it

Jackie-Jiang avatar Oct 25 '22 21:10 Jackie-Jiang