pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Do not create dictionary for high-cardinality columns

Open KKcorps opened this issue 3 years ago • 1 comments

  • Disable dicts for JSON and TEXT indexing columns
  • Extend optimizeDictionary config for dimension columns with fixed width as well.

Objective is to reduce the segment size and server space/memory usage.

KKcorps avatar Oct 04 '22 11:10 KKcorps

Codecov Report

Merging #9527 (a5026af) into master (f7c5511) will decrease coverage by 54.16%. The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #9527       +/-   ##
=============================================
- Coverage     70.00%   15.83%   -54.17%     
+ Complexity     4933      175     -4758     
=============================================
  Files          1946     1912       -34     
  Lines        104280   102715     -1565     
  Branches      15808    15624      -184     
=============================================
- Hits          72998    16265    -56733     
- Misses        26157    85253    +59096     
+ Partials       5125     1197     -3928     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 ?
unittests2 15.83% <0.00%> (+0.14%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ment/creator/impl/SegmentColumnarIndexCreator.java 0.00% <0.00%> (-79.61%) :arrow_down:
...ot/segment/spi/creator/SegmentGeneratorConfig.java 0.00% <0.00%> (-82.00%) :arrow_down:
.../apache/pinot/spi/config/table/IndexingConfig.java 0.00% <0.00%> (-90.70%) :arrow_down:
...src/main/java/org/apache/pinot/sql/FilterKind.java 0.00% <0.00%> (-100.00%) :arrow_down:
...ain/java/org/apache/pinot/core/data/table/Key.java 0.00% <0.00%> (-100.00%) :arrow_down:
...in/java/org/apache/pinot/spi/utils/BytesUtils.java 0.00% <0.00%> (-100.00%) :arrow_down:
...n/java/org/apache/pinot/core/data/table/Table.java 0.00% <0.00%> (-100.00%) :arrow_down:
.../java/org/apache/pinot/core/data/table/Record.java 0.00% <0.00%> (-100.00%) :arrow_down:
.../java/org/apache/pinot/core/util/GroupByUtils.java 0.00% <0.00%> (-100.00%) :arrow_down:
...java/org/apache/pinot/spi/trace/BaseRecording.java 0.00% <0.00%> (-100.00%) :arrow_down:
... and 1515 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Oct 04 '22 11:10 codecov-commenter

I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied. With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added). How about adding a config to only apply this to json/text column?

Actually the reason for this change was to introduce this config for dimension columns (after complaints about space amplification and memory usage from users). Json and text index got introduced later in the scope. IMO, what we can do though is then introduce a seperate metric optimizeDictionaryForDimensions but mention the risk with setting this config.

users do have cases where they keep String columns as dimensions but don't really do any filtering on top of them.

KKcorps avatar Nov 02 '22 08:11 KKcorps

In that case, we can keep both optimizeDictionary (apply to both dimensions and metrics) and optimizeDictionaryForMetrics (only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics

Jackie-Jiang avatar Nov 05 '22 05:11 Jackie-Jiang

In that case, we can keep both optimizeDictionary (apply to both dimensions and metrics) and optimizeDictionaryForMetrics (only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics

Makes sense. Incorporated this now.

KKcorps avatar Nov 21 '22 19:11 KKcorps