[Draft v2] Another Multi group by optimization
Which issue does this PR close?
Closes #.
Related to https://github.com/apache/datafusion/issues/9403
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Great news! Although this approach is slightly slow on small string, but it is significant better if string is large.
THIS PR
Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2873.6 ms and returned 10 rows
Query 0 iteration 1 took 2595.7 ms and returned 10 rows
Query 0 iteration 2 took 2320.4 ms and returned 10 rows
Query 0 iteration 3 took 2269.0 ms and returned 10 rows
Query 0 iteration 4 took 2321.2 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 2280.9 ms and returned 10 rows
Query 1 iteration 1 took 2679.1 ms and returned 10 rows
Query 1 iteration 2 took 2683.2 ms and returned 10 rows
Query 1 iteration 3 took 2534.8 ms and returned 10 rows
Query 1 iteration 4 took 2804.4 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 28927.7 ms and returned 10 rows
Query 2 iteration 1 took 27570.2 ms and returned 10 rows
Query 2 iteration 2 took 28776.2 ms and returned 10 rows
Query 2 iteration 3 took 32869.2 ms and returned 10 rows
Query 2 iteration 4 took 35559.9 ms and returned 10 rows
Done
Main
Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2010.2 ms and returned 10 rows
Query 0 iteration 1 took 1896.5 ms and returned 10 rows
Query 0 iteration 2 took 1707.0 ms and returned 10 rows
Query 0 iteration 3 took 1776.9 ms and returned 10 rows
Query 0 iteration 4 took 1685.8 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 1695.3 ms and returned 10 rows
Query 1 iteration 1 took 1727.3 ms and returned 10 rows
Query 1 iteration 2 took 1704.5 ms and returned 10 rows
Query 1 iteration 3 took 1832.4 ms and returned 10 rows
Query 1 iteration 4 took 1709.8 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 57890.4 ms and returned 10 rows
Query 2 iteration 1 took 60489.2 ms and returned 10 rows
Query 2 iteration 2 took 59249.8 ms and returned 10 rows
Query 2 iteration 3 took 57315.8 ms and returned 10 rows
Query 2 iteration 4 took 53942.7 ms and returned 10 rows
Done
BTW I hope/plan to use a plane trip on Sunday to prototype the "add a physical optimizer pass that sets types to StringView" approach
I am starting to play with this PR again
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.