pinot
pinot copied to clipboard
Do not create dictionary for high-cardinality columns
- Disable dicts for JSON and TEXT indexing columns
- Extend
optimizeDictionaryconfig for dimension columns with fixed width as well.
Objective is to reduce the segment size and server space/memory usage.
Codecov Report
Merging #9527 (a5026af) into master (f7c5511) will decrease coverage by
54.16%. The diff coverage is0.00%.
@@ Coverage Diff @@
## master #9527 +/- ##
=============================================
- Coverage 70.00% 15.83% -54.17%
+ Complexity 4933 175 -4758
=============================================
Files 1946 1912 -34
Lines 104280 102715 -1565
Branches 15808 15624 -184
=============================================
- Hits 72998 16265 -56733
- Misses 26157 85253 +59096
+ Partials 5125 1197 -3928
| Flag | Coverage Δ | |
|---|---|---|
| integration1 | ? |
|
| integration2 | ? |
|
| unittests1 | ? |
|
| unittests2 | 15.83% <0.00%> (+0.14%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...ment/creator/impl/SegmentColumnarIndexCreator.java | 0.00% <0.00%> (-79.61%) |
:arrow_down: |
| ...ot/segment/spi/creator/SegmentGeneratorConfig.java | 0.00% <0.00%> (-82.00%) |
:arrow_down: |
| .../apache/pinot/spi/config/table/IndexingConfig.java | 0.00% <0.00%> (-90.70%) |
:arrow_down: |
| ...src/main/java/org/apache/pinot/sql/FilterKind.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...ain/java/org/apache/pinot/core/data/table/Key.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...in/java/org/apache/pinot/spi/utils/BytesUtils.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...n/java/org/apache/pinot/core/data/table/Table.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| .../java/org/apache/pinot/core/data/table/Record.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| .../java/org/apache/pinot/core/util/GroupByUtils.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...java/org/apache/pinot/spi/trace/BaseRecording.java | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ... and 1515 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied. With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added). How about adding a config to only apply this to json/text column?
Actually the reason for this change was to introduce this config for dimension columns (after complaints about space amplification and memory usage from users).
Json and text index got introduced later in the scope.
IMO, what we can do though is then introduce a seperate metric optimizeDictionaryForDimensions but mention the risk with setting this config.
users do have cases where they keep String columns as dimensions but don't really do any filtering on top of them.
In that case, we can keep both optimizeDictionary (apply to both dimensions and metrics) and optimizeDictionaryForMetrics (only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics
In that case, we can keep both
optimizeDictionary(apply to both dimensions and metrics) andoptimizeDictionaryForMetrics(only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics
Makes sense. Incorporated this now.