pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Dynamic evaluation of GroupBy Initial Capacity

Open praveenc7 opened this issue 1 year ago • 8 comments

Summary

For GroupBy queries, the default size of the GroupByResultHolder is set to 10K(or limit length), which can lead to inefficient resource usage in cases where fewer group-by keys are expected, such as in queries with highly selective filters. Screenshot 2024-09-26 at 4 04 44 PM

select column1, sum(column2) from testTable where column1 in ("123") group by column1 limit 20000

Description

This update dynamically adjusts the initial capacity of the GroupByResultHolder based on the filter predicates for such queries. By aligning the result holder size with the filter, we aim to optimize resource allocation and improve performance for filtered group-by queries.

Testing

We did some perf testing on some prod queries and great improvement in the p99 latency

  1. Performance Improvement We observed a 1.5x improvement in P99.9 latency on a set of queries that benefited from the optimization. Below are sample query formats tested:

SELECT column1, SUM(column2) FROM testTable WHERE column1 IN ("123") GROUP BY column1 LIMIT 20000 ``SELECT column1, column2, SUM(column3) FROM testTable WHERE column1 IN ("123") AND column2 = '123' GROUP BY column1, column2 LIMIT 20000

The Flame graph is also much more normal Screenshot 2024-09-29 at 3 05 19 PM

  1. Impact on Queries with Unmet Conditions For queries processed by the new logic but ultimately exiting the optimization due to unmatched conditions (e.g., insufficient GROUP BY clauses), no significant latency overhead was observed. The framegraph results showed similar performance to existing logic:

SELECT column1, column2, SUM(column3) FROM testTable WHERE column1 IN ("123") GROUP BY column1, column2 LIMIT 20000

Screenshot 2024-09-29 at 3 10 42 PM

praveenc7 avatar Sep 15 '24 05:09 praveenc7

Codecov Report

Attention: Patch coverage is 67.16418% with 22 lines in your changes missing coverage. Please review.

Project coverage is 63.78%. Comparing base (59551e4) to head (325121b). Report is 1256 commits behind head on master.

Files with missing lines Patch % Lines
...upby/NoDictionaryMultiColumnGroupKeyGenerator.java 25.00% 6 Missing and 3 partials :warning:
...ry/aggregation/groupby/DefaultGroupByExecutor.java 80.00% 1 Missing and 5 partials :warning:
...pby/NoDictionarySingleColumnGroupKeyGenerator.java 33.33% 3 Missing and 1 partial :warning:
...tion/groupby/DictionaryBasedGroupKeyGenerator.java 83.33% 1 Missing and 2 partials :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14001      +/-   ##
============================================
+ Coverage     61.75%   63.78%   +2.02%     
- Complexity      207     1555    +1348     
============================================
  Files          2436     2660     +224     
  Lines        133233   145740   +12507     
  Branches      20636    22308    +1672     
============================================
+ Hits          82274    92954   +10680     
- Misses        44911    45915    +1004     
- Partials       6048     6871     +823     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) :arrow_up:
integration 100.00% <ø> (+99.99%) :arrow_up:
integration1 100.00% <ø> (+99.99%) :arrow_up:
integration2 0.00% <ø> (ø)
java-11 63.73% <67.16%> (+2.02%) :arrow_up:
java-21 63.67% <67.16%> (+2.04%) :arrow_up:
skip-bytebuffers-false 63.75% <67.16%> (+2.00%) :arrow_up:
skip-bytebuffers-true 63.65% <67.16%> (+35.92%) :arrow_up:
temurin 63.78% <67.16%> (+2.02%) :arrow_up:
unittests 63.77% <67.16%> (+2.02%) :arrow_up:
unittests1 55.47% <67.16%> (+8.58%) :arrow_up:
unittests2 34.18% <0.00%> (+6.45%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Sep 15 '24 06:09 codecov-commenter

These queries are very uncommon, and I don't know if it is worth taking the overhead for all group-by queries. Alternatively, we might just add a query option to override the initial capacity if desired

Jackie-Jiang avatar Sep 26 '24 22:09 Jackie-Jiang

These queries are very uncommon, and I don't know if it is worth taking the overhead for all group-by queries. Alternatively, we might just add a query option to override the initial capacity if desired

@Jackie-Jiang Thanks you for your input! We currently optimize this behavior specifically for GROUP BY queries with filters, and based on our initial performance testing, the overhead was comparable.

However, I see your concern and can certainly add a query option to enable this optimization if desired

praveenc7 avatar Sep 29 '24 22:09 praveenc7

Discussed offline. I think a more general problem here is when we group by on columns that partially/none have constraints (in filter clause), but themselves low cardinality. @praveenc7 could you maybe try to explore if it is easy to do some thing like min(cartesian_product, default_min_value). The cartesian_product part can be product of the in/eq filter size or column cardinality? In this sense I feel it would probably more generally applicable. In https://github.com/apache/pinot/pull/9420/files we made some of the cardinality accessible, you can maybe take a look.

@jasperjiaguo Looking more closer at the code, it looks like this condition is already handled in DictionaryBasedGroupKeyGenerator and when intializing the size we do the min(cardinality_product, default_product) or min(cardinality_product, optimized value) [If our optimization is in place] in DefaultGroupByResultHolder

praveenc7 avatar Oct 05 '24 23:10 praveenc7

@praveenc7 I see, thanks for checking! I think the only (small) gap here is that when we have

select ... where A in (a1, a2, a3) group by A,B

Do we currently handle this mixture of bounded + unbounded? ideally the effective card estimation could be calculated with cardinality_product/card(A) * filterlen(A)

jasperjiaguo avatar Oct 07 '24 16:10 jasperjiaguo

@praveenc7 I see, thanks for checking! I think the only (small) gap here is that when we have

select ... where A in (a1, a2, a3) group by A,B

Do we currently handle this mixture of bounded + unbounded? ideally the effective card estimation could be calculated with cardinality_product/card(A) * filterlen(A)

@jasperjiaguo Agree this might create a more tighter bound, currently the cardinality is available for DictionaryBasedGroupKey so the unbounded case would be mostly applicable to Dictionary based column, might need some more change to support this. Was thinking of doing in a follow up PR

praveenc7 avatar Oct 09 '24 04:10 praveenc7

Let's think about the edge case of

select ... where (A in (a1, a2, a3) or C > c) group by A

jasperjiaguo avatar Oct 09 '24 21:10 jasperjiaguo

Let's think about the edge case of

select ... where (A in (a1, a2, a3) or C > c) group by A

@jasperjiaguo Added condition that checks for top-level AND and also added mixed support for both bounded & unbounded

praveenc7 avatar Oct 13 '24 21:10 praveenc7