datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[Draft v2] Another Multi group by optimization

Open jayzhan211 opened this issue 1 year ago • 5 comments

Which issue does this PR close?

Closes #.

Related to https://github.com/apache/datafusion/issues/9403

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 avatar Jun 18 '24 07:06 jayzhan211

Screenshot 2024-06-18 at 3 14 28 PM

jayzhan211 avatar Jun 18 '24 07:06 jayzhan211

Great news! Although this approach is slightly slow on small string, but it is significant better if string is large.

THIS PR

Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2873.6 ms and returned 10 rows
Query 0 iteration 1 took 2595.7 ms and returned 10 rows
Query 0 iteration 2 took 2320.4 ms and returned 10 rows
Query 0 iteration 3 took 2269.0 ms and returned 10 rows
Query 0 iteration 4 took 2321.2 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 2280.9 ms and returned 10 rows
Query 1 iteration 1 took 2679.1 ms and returned 10 rows
Query 1 iteration 2 took 2683.2 ms and returned 10 rows
Query 1 iteration 3 took 2534.8 ms and returned 10 rows
Query 1 iteration 4 took 2804.4 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 28927.7 ms and returned 10 rows
Query 2 iteration 1 took 27570.2 ms and returned 10 rows
Query 2 iteration 2 took 28776.2 ms and returned 10 rows
Query 2 iteration 3 took 32869.2 ms and returned 10 rows
Query 2 iteration 4 took 35559.9 ms and returned 10 rows
Done

Main

Q0: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
Query 0 iteration 0 took 2010.2 ms and returned 10 rows
Query 0 iteration 1 took 1896.5 ms and returned 10 rows
Query 0 iteration 2 took 1707.0 ms and returned 10 rows
Query 0 iteration 3 took 1776.9 ms and returned 10 rows
Query 0 iteration 4 took 1685.8 ms and returned 10 rows
Q1: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" LIMIT 10;
Query 1 iteration 0 took 1695.3 ms and returned 10 rows
Query 1 iteration 1 took 1727.3 ms and returned 10 rows
Query 1 iteration 2 took 1704.5 ms and returned 10 rows
Query 1 iteration 3 took 1832.4 ms and returned 10 rows
Query 1 iteration 4 took 1709.8 ms and returned 10 rows
Q2: SELECT "UserID", concat("SearchPhrase", repeat('hello', 100)) as s, COUNT(*) FROM hits GROUP BY "UserID", s LIMIT 10;
Query 2 iteration 0 took 57890.4 ms and returned 10 rows
Query 2 iteration 1 took 60489.2 ms and returned 10 rows
Query 2 iteration 2 took 59249.8 ms and returned 10 rows
Query 2 iteration 3 took 57315.8 ms and returned 10 rows
Query 2 iteration 4 took 53942.7 ms and returned 10 rows
Done

jayzhan211 avatar Jun 18 '24 07:06 jayzhan211

BTW I hope/plan to use a plane trip on Sunday to prototype the "add a physical optimizer pass that sets types to StringView" approach

alamb avatar Jun 21 '24 22:06 alamb

I am starting to play with this PR again

alamb avatar Jun 28 '24 20:06 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 28 '24 01:08 github-actions[bot]