pinot Support multi-valued dimensions in the StarTree index

Currently, the star-tree index can not be used with multi-valued dimensional columns. The problem with supporting multi-valued columns in the split order of the star-tree index is that the behaviour for splitting on multiple values would be require handling all the permutations of the values present in the list, which is not practical.

This ticket is a suggestion to implement support for multi-valued dimensional columns in the split order for the star-tree index, with a simply constraint of only allowing one specific value to be used when splitting for the index.

Suppose we have a table which has the following dimensions:

entityId: String
groupIds: List of strings
status: String

And a query looking like this SELECT COUNT(*) FROM table WHERE entityId = 'ID' AND groupIds CONTAINS 'group_1' AND status = 'ONLINE'

If there are many rows here, this could be a very intense query to resolve. If the star tree index could pre-aggregate on each value in groupId, we could make this lookup much more performant. This is a problem that we are facing today. A group can contain multiple entities, and an entity can be in multiple groups. This many to many relationship means there is a dynamic constraint on our queries.

Some queries will ignore group (that would be the star node), while some need to access the aggregation on a group basis.

The problem with implementing support for this, is that it wouldn't work for a query defined as this: SELECT COUNT(*) FROM table WHERE entityId = 'ID' AND groupIds IN ('group_1', 'group_2') AND status = 'ONLINE'

This behaviour would be undefined, as there is no way to create one aggregation per groupId if we also need to handle all the permutations of possible values in the multi-valued column. This is why I propose that we support multi-valued columns for the star-tree index with the constraint that each value would have an aggregation associated with it, but not allowing it to span a selection of multiple values.

For example..

With the following star-tree index:

"starTreeIndexConfigs": [
        {
          "dimensionsSplitOrder": [
            "entityId",
            "groupIds",
            "status"
          ],
          "skipStarNodeCreationForDimensions": [],
          "functionColumnPairs": [
            "COUNT__*"
          ],
          "maxLeafRecords": 10000
        }
      ],

This would be possible

entity_1 -> count(*) = 500
   entity_1 -> group_1 -> count(*) = (200)
       entity_1 -> group_1 -> ONLINE -> count(*) = 100
       entity_1 -> group_1 -> OFFLINE -> count(*) = 100
   entity_1 -> group_2 -> count(*) = (300)
       entity_1 -> group_2 -> ONLINE -> count(*) = 150
       entity_1 -> group_2 -> OFFLINE -> count(*) = 150

Jul 21 '22 10:07 lksvenoy-r7

Good suggestion! Here are some extra points we need to consider:

In order to split on MV dimension, we need to flatten the record into multiple records, each with one element of the MV entry. This can lead to much higher storage cost
If there are multiple MV dimensions to be split, we need to store all the element combination for the record
When there is no filter on the MV dimension, we must use the star-node to answer the query. Star-node creation is also different for the MV dimension because we cannot simply aggregate all the records under all non-star-node

Jul 21 '22 19:07 Jackie-Jiang

Thanks for the well thought out reply @Jackie-Jiang. I haven't spent much time thinking about the implications of this feature, and these observations are great.

You're right that this could lead to much higher storage cost, perhaps it would be possible to have some type of limit option for creating the permutations? I'm not sure this makes sense, as the behaviour could be unpredictable. There would be no way of knowing which types of multi-valued column values have been indexed, so it would have to be discussed whether this is an acceptable trade-off or not.
What exactly do you mean? Are you saying storing all permutations, or simply one index split per value? If it is the latter as I had written up in my issue, then that makes sense. However, if it's all possible permutations being indexed, then this would probably not be feasible. We would then suffer exponential storage explosion due to all the possible permutations. This is why I am arguing that an acceptable trade-off would be to allow a lookup on simple one value
I hadn't considered the implications on the star-node. This seems like one of the most challenging problems to solve, and it makes it seem like perhaps the star node is not compatible at that level. We will need to think about this one more..

Jul 22 '22 08:07 lksvenoy-r7

We may put the following limitations so that the star-tree does not explode:

Limit the total nodes in the star-tree, and abort the star-tree creation when the total node is reached (this can apply to current star-tree as well)
Allow up to one MV column in the split list. We want this constraint because if we put multiple MV columns in the split list, we have to compute all possible permutations. But this constraint also means we can only solve the queries the filter on a single MV column
We can use star-tree to solve the query when there is only one node hit for the MV column

Jul 22 '22 18:07 Jackie-Jiang

pinot pinot copied to clipboard

Support multi-valued dimensions in the StarTree index

pinot
pinot copied to clipboard