pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Proposal: Extending merge rollup capabilities

Open davecromberge opened this issue 4 months ago • 1 comments

What needs to be done?

Extend the merge-rollup framework to create additional transformations:

  • dimensionality reduction/erasure
  • varying aggregate behaviour over time

Dimensionality reduction/erasure

Eliminate a particular dimension column's values to allow more rows to aggregate as duplicates.

For example:

Dimension Pre-transformation Post-transformation
Country United States United States
Device Mobile Mobile
Browser Safari Null / Other

The above example shows the Browser dimension erased or set to some default value after some time window has passed.

Varying aggregate behaviour over time

Some aggregate values could change precision over time. The multi-level merge functionality can be used to reduce the resolution or precision of aggregates for older segments. This applies primarily to sketches, but could also be used for other binary aggregate types.

Sketch Pre-transformation Post-transformation
Theta sketch 1 512kb 256kb
Theta sketch 2 400kb 200kb
Theta sketch 3 512kb 256kb

The above example shows a size reduction of 2x on existing sketches which could be achieved by reducing the lgK value by a factor of 1 as data ages. Be aware that this could cause varying precisions for queries that span time ranges, where the sketch implementation supports this.

Why the feature is needed (e.g. describing the use case).

The primary justification for such a feature is more aggressive space saving for historic data. As the merge rollup task processes older time windows, users could eliminate non-critical dimensions which would result in a greater degree of documents rolling up into a single aggregate. Similarly, users could sacrifice aggregate accuracy for historic queries and thus trade this off for a smaller storage footprint - especially when dealing with Theta / Tuple sketches which can be in the order of megabytes at lgK = 16.

Idea on how this may be implemented

Both extensions would require changes to the configuration for the Minion Merge rollup task. In particular, the most flexible approach would be to have a dynamic bag of properties that could apply to each individual aggregation function, where these could be interpreted before rolling up or merging the data.

Dimensionality reduction/erasure

  • applies to “map” phase of the SegmentProcessorFramework.
  • default reducer will function as normal
  • configuration should include:
    • time bucket periods
    • dimension name
    • leverage default value
  • configuration should be part of merge rollup task / segment refresh config
    • "dimensionName.eliminate.after": "7d",

Varying aggregate behaviour over time

  • applies to “map” phase of the SegmentProcessorFramework.
  • configuration could be uniformly applied in a global manner or part of the specific table task config:
    • hard coded parameters for Theta and Tuple sketch lgK (cumbersome)
    • dynamic bag of properties associated with time bucket (hard to validate)
    • not necessary to extend the function name parameter parser

Note: This issue should be treated PEP-request.

davecromberge avatar Oct 25 '24 12:10 davecromberge