datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

POC: Vectorized hashtable for aggregation

Open Rachelint opened this issue 1 year ago • 1 comments

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Rachelint avatar Oct 18 '24 09:10 Rachelint

The latest benchmark numbers:

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.67ms │                 0.68ms │     no change │
│ QQuery 1     │    67.01ms │                65.25ms │     no change │
│ QQuery 2     │   165.14ms │               157.75ms │     no change │
│ QQuery 3     │   181.43ms │               181.83ms │     no change │
│ QQuery 4     │  1566.65ms │              1574.95ms │     no change │
│ QQuery 5     │  1539.79ms │              1532.81ms │     no change │
│ QQuery 6     │    61.01ms │                57.01ms │ +1.07x faster │
│ QQuery 7     │    77.09ms │                73.02ms │ +1.06x faster │
│ QQuery 8     │  1971.64ms │              1762.88ms │ +1.12x faster │
│ QQuery 9     │  1921.59ms │              1903.47ms │     no change │
│ QQuery 10    │   516.35ms │               499.35ms │     no change │
│ QQuery 11    │   590.99ms │               556.80ms │ +1.06x faster │
│ QQuery 12    │  1814.14ms │              1816.26ms │     no change │
│ QQuery 13    │  2956.07ms │              2954.48ms │     no change │
│ QQuery 14    │  2054.42ms │              1940.82ms │ +1.06x faster │
│ QQuery 15    │  1899.87ms │              1873.73ms │     no change │
│ QQuery 16    │  4066.16ms │              3744.25ms │ +1.09x faster │
│ QQuery 17    │  3629.16ms │              3428.06ms │ +1.06x faster │
│ QQuery 18    │  8282.13ms │              7646.27ms │ +1.08x faster │
│ QQuery 19    │   144.20ms │               146.30ms │     no change │
│ QQuery 20    │  3222.65ms │              3224.85ms │     no change │
│ QQuery 21    │  3924.86ms │              3913.65ms │     no change │
│ QQuery 22    │  9144.86ms │              9022.44ms │     no change │
│ QQuery 23    │ 23875.41ms │             23664.41ms │     no change │
│ QQuery 24    │  1123.53ms │              1132.05ms │     no change │
│ QQuery 25    │  1011.03ms │              1002.87ms │     no change │
│ QQuery 26    │  1326.71ms │              1319.49ms │     no change │
│ QQuery 27    │  4666.49ms │              4662.07ms │     no change │
│ QQuery 28    │ 24069.75ms │             24145.85ms │     no change │
│ QQuery 29    │   902.07ms │               890.73ms │     no change │
│ QQuery 30    │  1813.79ms │              1722.40ms │ +1.05x faster │
│ QQuery 31    │  2008.03ms │              1977.28ms │     no change │
│ QQuery 32    │  7369.56ms │              7601.38ms │     no change │
│ QQuery 33    │  9752.79ms │              9742.50ms │     no change │
│ QQuery 34    │  9716.57ms │              9696.95ms │     no change │
│ QQuery 35    │  2760.71ms │              2244.23ms │ +1.23x faster │
│ QQuery 36    │   255.12ms │               241.01ms │ +1.06x faster │
│ QQuery 37    │   158.70ms │               154.80ms │     no change │
│ QQuery 38    │   155.15ms │               153.09ms │     no change │
│ QQuery 39    │   595.64ms │               587.48ms │     no change │
│ QQuery 40    │    57.09ms │                60.69ms │  1.06x slower │
│ QQuery 41    │    53.32ms │                52.81ms │     no change │
│ QQuery 42    │    65.53ms │                65.13ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘

Rachelint avatar Oct 19 '24 11:10 Rachelint

The logic is a bit complex, I plan to finish and do benchmark for it today.

Rachelint avatar Oct 26 '24 03:10 Rachelint

It is happy to see that the vectorized approach is promising!

The rest work is to find why it get q28 slower and fix tests.

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.78ms │                 0.66ms │ +1.18x faster │
│ QQuery 1     │    68.16ms │                67.25ms │     no change │
│ QQuery 2     │   161.83ms │               161.80ms │     no change │
│ QQuery 3     │   186.20ms │               179.59ms │     no change │
│ QQuery 4     │  1581.81ms │              1601.15ms │     no change │
│ QQuery 5     │  1556.95ms │              1504.38ms │     no change │
│ QQuery 6     │    59.36ms │                58.69ms │     no change │
│ QQuery 7     │    79.29ms │                75.72ms │     no change │
│ QQuery 8     │  1945.74ms │              1730.29ms │ +1.12x faster │
│ QQuery 9     │  1958.91ms │              1912.77ms │     no change │
│ QQuery 10    │   519.94ms │               496.55ms │     no change │
│ QQuery 11    │   592.06ms │               573.94ms │     no change │
│ QQuery 12    │  1834.06ms │              1818.58ms │     no change │
│ QQuery 13    │  2963.09ms │              2937.11ms │     no change │
│ QQuery 14    │  2053.23ms │              1959.00ms │     no change │
│ QQuery 15    │  1902.87ms │              1873.43ms │     no change │
│ QQuery 16    │  4042.73ms │              3721.92ms │ +1.09x faster │
│ QQuery 17    │  3649.27ms │              3413.73ms │ +1.07x faster │
│ QQuery 18    │  8286.60ms │              7555.43ms │ +1.10x faster │
│ QQuery 19    │   148.06ms │               143.84ms │     no change │
│ QQuery 20    │  3241.10ms │              3225.91ms │     no change │
│ QQuery 21    │  3879.00ms │              3893.55ms │     no change │
│ QQuery 22    │  9109.29ms │              9306.23ms │     no change │
│ QQuery 23    │ 23670.14ms │             23590.65ms │     no change │
│ QQuery 24    │  1166.75ms │              1126.80ms │     no change │
│ QQuery 25    │   998.82ms │              1006.37ms │     no change │
│ QQuery 26    │  1320.64ms │              1307.47ms │     no change │
│ QQuery 27    │  4772.51ms │              4675.67ms │     no change │
│ QQuery 28    │ 22374.44ms │             24834.06ms │  1.11x slower │
│ QQuery 29    │   902.94ms │               896.11ms │     no change │
│ QQuery 30    │  1829.63ms │              1739.30ms │     no change │
│ QQuery 31    │  2059.05ms │              1951.76ms │ +1.05x faster │
│ QQuery 32    │  7460.76ms │              7609.67ms │     no change │
│ QQuery 33    │  9834.20ms │              9681.51ms │     no change │
│ QQuery 34    │  9759.46ms │              9556.88ms │     no change │
│ QQuery 35    │  2786.28ms │              2258.62ms │ +1.23x faster │
│ QQuery 36    │   257.69ms │               250.87ms │     no change │
│ QQuery 37    │   155.01ms │               151.16ms │     no change │
│ QQuery 38    │   154.50ms │               147.32ms │     no change │
│ QQuery 39    │   635.06ms │               596.76ms │ +1.06x faster │
│ QQuery 40    │    60.37ms │                59.58ms │     no change │
│ QQuery 41    │    53.82ms │                53.27ms │     no change │
│ QQuery 42    │    66.93ms │                63.49ms │ +1.05x faster │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main)                     │ 140139.34ms │
│ Total Time (vectorize-append-value)   │ 139768.84ms │
│ Average Time (main)                   │   3259.05ms │
│ Average Time (vectorize-append-value) │   3250.44ms │
│ Queries Faster                        │           9 │
│ Queries Slower                        │           1 │
│ Queries with No Change                │          33 │
└───────────────────────────────────────┴─────────────┘

Rachelint avatar Oct 27 '24 17:10 Rachelint

I profile to get flamegraph of q28 and found it actually not go the path in this pr. The slower may be due to some enviorment reason in my local, I am re-running the bench on main now.

The last thing is to add some unit tests to cover some conercases found during debugging.

Rachelint avatar Oct 28 '24 10:10 Rachelint

I think this pr is ready now!

Rachelint avatar Oct 30 '24 12:10 Rachelint

Nice -- I am running some benchmarks on this PR

alamb avatar Oct 31 '24 19:10 alamb

🤩

Comparing main_base and vectorize-append-value
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.69ms │                 0.67ms │     no change │
│ QQuery 1     │    69.53ms │                69.81ms │     no change │
│ QQuery 2     │   125.56ms │               125.06ms │     no change │
│ QQuery 3     │   134.49ms │               133.53ms │     no change │
│ QQuery 4     │   986.06ms │              1002.09ms │     no change │
│ QQuery 5     │  1092.56ms │              1111.32ms │     no change │
│ QQuery 6     │    67.90ms │                66.92ms │     no change │
│ QQuery 7     │    82.12ms │                80.79ms │     no change │
│ QQuery 8     │  1381.07ms │              1055.63ms │ +1.31x faster │
│ QQuery 9     │  1376.13ms │              1371.51ms │     no change │
│ QQuery 10    │   466.36ms │               435.31ms │ +1.07x faster │
│ QQuery 11    │   508.85ms │               474.68ms │ +1.07x faster │
│ QQuery 12    │  1245.08ms │              1216.74ms │     no change │
│ QQuery 13    │  1919.50ms │              1746.68ms │ +1.10x faster │
│ QQuery 14    │  1371.00ms │              1238.46ms │ +1.11x faster │
│ QQuery 15    │  1156.13ms │              1123.78ms │     no change │
│ QQuery 16    │  2593.20ms │              2187.34ms │ +1.19x faster │
│ QQuery 17    │  2374.51ms │              2008.88ms │ +1.18x faster │
│ QQuery 18    │  5049.66ms │              4169.48ms │ +1.21x faster │
│ QQuery 19    │   125.17ms │               119.76ms │     no change │
│ QQuery 20    │  1650.50ms │              1642.21ms │     no change │
│ QQuery 21    │  2100.16ms │              2082.53ms │     no change │
│ QQuery 22    │  5039.82ms │              4996.73ms │     no change │
│ QQuery 23    │ 11902.50ms │             11835.25ms │     no change │
│ QQuery 24    │   810.90ms │               797.78ms │     no change │
│ QQuery 25    │   728.72ms │               699.22ms │     no change │
│ QQuery 26    │   868.88ms │               891.65ms │     no change │
│ QQuery 27    │  2627.37ms │              2600.82ms │     no change │
│ QQuery 28    │ 15172.21ms │             15297.59ms │     no change │
│ QQuery 29    │   572.83ms │               571.91ms │     no change │
│ QQuery 30    │  1239.14ms │              1141.57ms │ +1.09x faster │
│ QQuery 31    │  1277.14ms │              1206.06ms │ +1.06x faster │
│ QQuery 32    │  4198.97ms │              3978.68ms │ +1.06x faster │
│ QQuery 33    │  5184.58ms │              5281.99ms │     no change │
│ QQuery 34    │  5334.25ms │              5234.23ms │     no change │
│ QQuery 35    │  1799.37ms │              1351.66ms │ +1.33x faster │
│ QQuery 36    │   319.76ms │               312.16ms │     no change │
│ QQuery 37    │   220.55ms │               220.76ms │     no change │
│ QQuery 38    │   197.20ms │               195.72ms │     no change │
│ QQuery 39    │   810.69ms │               732.04ms │ +1.11x faster │
│ QQuery 40    │    91.81ms │                84.87ms │ +1.08x faster │
│ QQuery 41    │    81.72ms │                77.45ms │ +1.06x faster │
│ QQuery 42    │    94.61ms │                94.34ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 84449.27ms │
│ Total Time (vectorize-append-value)   │ 81065.67ms │
│ Average Time (main_base)              │  1963.94ms │
│ Average Time (vectorize-append-value) │  1885.25ms │
│ Queries Faster                        │         15 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │         28 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2736.57ms │              2712.16ms │     no change │
│ QQuery 1     │   772.15ms │               769.86ms │     no change │
│ QQuery 2     │  1624.12ms │              1639.77ms │     no change │
│ QQuery 3     │   827.73ms │               694.66ms │ +1.19x faster │
│ QQuery 4     │ 12716.33ms │             12364.86ms │     no change │
│ QQuery 5     │ 19095.23ms │             19058.86ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 37772.14ms │
│ Total Time (vectorize-append-value)   │ 37240.16ms │
│ Average Time (main_base)              │  6295.36ms │
│ Average Time (vectorize-append-value) │  6206.69ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          5 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.25ms │                 2.27ms │     no change │
│ QQuery 1     │    40.78ms │                38.83ms │     no change │
│ QQuery 2     │    98.26ms │                95.76ms │     no change │
│ QQuery 3     │   108.58ms │               106.87ms │     no change │
│ QQuery 4     │   922.34ms │               917.62ms │     no change │
│ QQuery 5     │   980.37ms │               938.35ms │     no change │
│ QQuery 6     │    38.42ms │                39.18ms │     no change │
│ QQuery 7     │    43.00ms │                43.24ms │     no change │
│ QQuery 8     │  1359.05ms │              1009.35ms │ +1.35x faster │
│ QQuery 9     │  1348.60ms │              1323.55ms │     no change │
│ QQuery 10    │   424.82ms │               406.74ms │     no change │
│ QQuery 11    │   475.00ms │               451.86ms │     no change │
│ QQuery 12    │  1090.45ms │              1106.96ms │     no change │
│ QQuery 13    │  1577.54ms │              1508.58ms │     no change │
│ QQuery 14    │  1259.77ms │              1086.88ms │ +1.16x faster │
│ QQuery 15    │  1086.47ms │              1081.46ms │     no change │
│ QQuery 16    │  2449.02ms │              2030.13ms │ +1.21x faster │
│ QQuery 17    │  2329.89ms │              1911.21ms │ +1.22x faster │
│ QQuery 18    │  5004.89ms │              3989.09ms │ +1.25x faster │
│ QQuery 19    │    97.70ms │                98.64ms │     no change │
│ QQuery 20    │  1474.31ms │              1478.79ms │     no change │
│ QQuery 21    │  1808.23ms │              1810.91ms │     no change │
│ QQuery 22    │  3132.66ms │              3128.20ms │     no change │
│ QQuery 23    │ 10113.15ms │             10129.78ms │     no change │
│ QQuery 24    │   656.18ms │               651.62ms │     no change │
│ QQuery 25    │   518.58ms │               524.60ms │     no change │
│ QQuery 26    │   719.15ms │               725.11ms │     no change │
│ QQuery 27    │  2266.93ms │              2264.89ms │     no change │
│ QQuery 28    │ 14304.45ms │             14598.41ms │     no change │
│ QQuery 29    │   542.95ms │               540.96ms │     no change │
│ QQuery 30    │  1091.63ms │              1000.47ms │ +1.09x faster │
│ QQuery 31    │  1140.61ms │              1070.64ms │ +1.07x faster │
│ QQuery 32    │  4198.09ms │              3928.13ms │ +1.07x faster │
│ QQuery 33    │  5178.24ms │              5025.69ms │     no change │
│ QQuery 34    │  5065.45ms │              5110.93ms │     no change │
│ QQuery 35    │  1869.61ms │              1300.99ms │ +1.44x faster │
│ QQuery 36    │   274.68ms │               270.33ms │     no change │
│ QQuery 37    │   125.61ms │               125.59ms │     no change │
│ QQuery 38    │   150.35ms │               144.38ms │     no change │
│ QQuery 39    │   732.00ms │               689.60ms │ +1.06x faster │
│ QQuery 40    │    57.04ms │                60.17ms │  1.05x slower │
│ QQuery 41    │    47.52ms │                47.56ms │     no change │
│ QQuery 42    │    65.98ms │                64.11ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 76270.60ms │
│ Total Time (vectorize-append-value)   │ 72878.40ms │
│ Average Time (main_base)              │  1773.73ms │
│ Average Time (vectorize-append-value) │  1694.85ms │
│ Queries Faster                        │         10 │
│ Queries Slower                        │          1 │
│ Queries with No Change                │         32 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  203.49ms │               201.33ms │     no change │
│ QQuery 2     │  115.50ms │               120.92ms │     no change │
│ QQuery 3     │  129.08ms │               119.50ms │ +1.08x faster │
│ QQuery 4     │   87.14ms │                83.75ms │     no change │
│ QQuery 5     │  161.47ms │               159.55ms │     no change │
│ QQuery 6     │   48.73ms │                42.26ms │ +1.15x faster │
│ QQuery 7     │  208.57ms │               200.37ms │     no change │
│ QQuery 8     │  157.16ms │               153.09ms │     no change │
│ QQuery 9     │  244.82ms │               247.43ms │     no change │
│ QQuery 10    │  226.83ms │               238.14ms │     no change │
│ QQuery 11    │   86.91ms │                91.82ms │  1.06x slower │
│ QQuery 12    │  128.79ms │               128.85ms │     no change │
│ QQuery 13    │  225.47ms │               219.07ms │     no change │
│ QQuery 14    │   72.15ms │                89.80ms │  1.24x slower │
│ QQuery 15    │  115.99ms │               107.62ms │ +1.08x faster │
│ QQuery 16    │   80.43ms │                74.99ms │ +1.07x faster │
│ QQuery 17    │  218.58ms │               218.94ms │     no change │
│ QQuery 18    │  329.54ms │               321.20ms │     no change │
│ QQuery 19    │  131.86ms │               129.84ms │     no change │
│ QQuery 20    │  136.33ms │               121.60ms │ +1.12x faster │
│ QQuery 21    │  272.83ms │               283.72ms │     no change │
│ QQuery 22    │   66.11ms │                63.99ms │     no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                │ 3447.78ms │
│ Total Time (vectorize-append-value)   │ 3417.79ms │
│ Average Time (main_base)              │  156.72ms │
│ Average Time (vectorize-append-value) │  155.35ms │
│ Queries Faster                        │         5 │
│ Queries Slower                        │         2 │
│ Queries with No Change                │        15 │
└───────────────────────────────────────┴───────────┘

Wow very nice @Rachelint -- I will plan on checking this PR out in more detail tomorrow

alamb avatar Oct 31 '24 20:10 alamb

Amazing!!!

1.24x slower

Worth profiling / checking this regression before merging I think

Dandandan avatar Oct 31 '24 20:10 Dandandan

Worth profiling / checking this regression before merging I think

For my own future self, it is this tpch query:

│ QQuery 14 │ 72.15ms │ 89.80ms │ 1.24x slower │

I think just profiling what is going on would be helpful. I can maybe help over the next few days if no one else has a chance . Here is an example of the kind of ticket we could potentially file: https://github.com/apache/datafusion/issues/13188

alamb avatar Oct 31 '24 23:10 alamb

Screenshot 2024-11-01 at 11 10 46 AM

Parquet read dominates 🤔

jayzhan211 avatar Nov 01 '24 03:11 jayzhan211

Amazing!!!

1.24x slower

Worth profiling / checking this regression before merging I think

🤔 I plan to try to check it today or tomorrow, I suspect if it is the low cardinality case that many repeated rows exist in the input cols?

Rachelint avatar Nov 01 '24 06:11 Rachelint

Tpch q14 doesn't seem to run through the change of this PR -- groupBy is empty in AggregateExec , I also doesn't see any print out in VectorizedGroupValuesColumn. I think this change is not the reason of slowdown 🤔

query TT
explain select
            100.00 * sum(case
                             when p_type like 'PROMO%'
                                 then l_extendedprice * (1 - l_discount)
                             else 0
            end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
    lineitem,
    part
where
        l_partkey = p_partkey
  and l_shipdate >= date '1995-09-01'
  and l_shipdate < date '1995-10-01';
----
logical_plan
01)Projection: Float64(100) * CAST(sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END) AS Float64) / CAST(sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS Float64) AS promo_revenue
02)--Aggregate: groupBy=[[]], aggr=[[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN __common_expr_1 ELSE Decimal128(Some(0),38,4) END) AS sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(__common_expr_1) AS sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]]
03)----Projection: lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount) AS __common_expr_1, part.p_type
04)------Inner Join: lineitem.l_partkey = part.p_partkey
05)--------Projection: lineitem.l_partkey, lineitem.l_extendedprice, lineitem.l_discount
06)----------Filter: lineitem.l_shipdate >= Date32("1995-09-01") AND lineitem.l_shipdate < Date32("1995-10-01")
07)------------TableScan: lineitem projection=[l_partkey, l_extendedprice, l_discount, l_shipdate], partial_filters=[lineitem.l_shipdate >= Date32("1995-09-01"), lineitem.l_shipdate < Date32("1995-10-01")]
08)--------TableScan: part projection=[p_partkey, p_type]
physical_plan
01)ProjectionExec: expr=[100 * CAST(sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END)@0 AS Float64) / CAST(sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@1 AS Float64) as promo_revenue]
02)--AggregateExec: mode=Final, gby=[], aggr=[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]
05)--------ProjectionExec: expr=[l_extendedprice@0 * (Some(1),20,0 - l_discount@1) as __common_expr_1, p_type@2 as p_type]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------HashJoinExec: mode=Partitioned, join_type=Inner, on=[(l_partkey@0, p_partkey@0)], projection=[l_extendedprice@1, l_discount@2, p_type@4]
08)--------------CoalesceBatchesExec: target_batch_size=8192
09)----------------RepartitionExec: partitioning=Hash([l_partkey@0], 4), input_partitions=4
10)------------------CoalesceBatchesExec: target_batch_size=8192
11)--------------------FilterExec: l_shipdate@3 >= 1995-09-01 AND l_shipdate@3 < 1995-10-01, projection=[l_partkey@0, l_extendedprice@1, l_discount@2]
12)----------------------CsvExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:0..18561749], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:18561749..37123498], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:37123498..55685247], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:55685247..74246996]]}, projection=[l_partkey, l_extendedprice, l_discount, l_shipdate], has_header=false
13)--------------CoalesceBatchesExec: target_batch_size=8192
14)----------------RepartitionExec: partitioning=Hash([p_partkey@0], 4), input_partitions=4
15)------------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
16)--------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/part.tbl]]}, projection=[p_partkey, p_type], has_header=false

jayzhan211 avatar Nov 01 '24 06:11 jayzhan211

Hm maybe some improvement landed on main not yet on this branch? Maybe rebase and profile again?

Dandandan avatar Nov 01 '24 07:11 Dandandan

My results after merging main show no regressions (tpc-h basically has no change) 🚀

Comparing main and vectorized_aggregate
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ vectorized_aggregate ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.33ms │               0.39ms │  1.17x slower │
│ QQuery 1     │    35.24ms │              35.03ms │     no change │
│ QQuery 2     │    82.54ms │              83.07ms │     no change │
│ QQuery 3     │    76.90ms │              65.62ms │ +1.17x faster │
│ QQuery 4     │   521.48ms │             491.44ms │ +1.06x faster │
│ QQuery 5     │   809.76ms │             768.98ms │ +1.05x faster │
│ QQuery 6     │    33.27ms │              32.65ms │     no change │
│ QQuery 7     │    39.26ms │              39.52ms │     no change │
│ QQuery 8     │   766.34ms │             568.72ms │ +1.35x faster │
│ QQuery 9     │   821.86ms │             823.16ms │     no change │
│ QQuery 10    │   217.90ms │             207.96ms │     no change │
│ QQuery 11    │   247.04ms │             230.55ms │ +1.07x faster │
│ QQuery 12    │   839.06ms │             814.24ms │     no change │
│ QQuery 13    │  1123.51ms │            1102.40ms │     no change │
│ QQuery 14    │   874.87ms │             771.94ms │ +1.13x faster │
│ QQuery 15    │   607.72ms │             615.06ms │     no change │
│ QQuery 16    │  1573.81ms │            1332.64ms │ +1.18x faster │
│ QQuery 17    │  1454.94ms │            1220.57ms │ +1.19x faster │
│ QQuery 18    │  3348.40ms │            2936.64ms │ +1.14x faster │
│ QQuery 19    │    59.96ms │              59.11ms │     no change │
│ QQuery 20    │   943.24ms │             964.36ms │     no change │
│ QQuery 21    │  1291.59ms │            1299.99ms │     no change │
│ QQuery 22    │  3798.86ms │            3812.87ms │     no change │
│ QQuery 23    │  8083.11ms │            7952.95ms │     no change │
│ QQuery 24    │   553.23ms │             554.28ms │     no change │
│ QQuery 25    │   530.46ms │             534.44ms │     no change │
│ QQuery 26    │   610.97ms │             606.59ms │     no change │
│ QQuery 27    │  1558.28ms │            1584.92ms │     no change │
│ QQuery 28    │ 12837.43ms │           12852.93ms │     no change │
│ QQuery 29    │   448.12ms │             446.98ms │     no change │
│ QQuery 30    │   824.09ms │             756.85ms │ +1.09x faster │
│ QQuery 31    │   796.62ms │             752.91ms │ +1.06x faster │
│ QQuery 32    │  3502.80ms │            2829.34ms │ +1.24x faster │
│ QQuery 33    │  2775.26ms │            2729.67ms │     no change │
│ QQuery 34    │  2748.59ms │            2738.52ms │     no change │
│ QQuery 35    │  1194.59ms │             840.66ms │ +1.42x faster │
│ QQuery 36    │   150.59ms │             146.20ms │     no change │
│ QQuery 37    │   102.54ms │             104.18ms │     no change │
│ QQuery 38    │   105.60ms │             104.55ms │     no change │
│ QQuery 39    │   279.20ms │             264.85ms │ +1.05x faster │
│ QQuery 40    │    34.25ms │              32.64ms │     no change │
│ QQuery 41    │    31.88ms │              31.48ms │     no change │
│ QQuery 42    │    40.98ms │              41.15ms │     no change │
└──────────────┴────────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                   │ 56776.48ms │
│ Total Time (vectorized_aggregate)   │ 54183.02ms │
│ Average Time (main)                 │  1320.38ms │
│ Average Time (vectorized_aggregate) │  1260.07ms │
│ Queries Faster                      │         14 │
│ Queries Slower                      │          1 │
│ Queries with No Change              │         28 │
└─────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ vectorized_aggregate ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 122.91ms │             120.39ms │     no change │
│ QQuery 2     │  55.64ms │              55.18ms │     no change │
│ QQuery 3     │  56.02ms │              54.98ms │     no change │
│ QQuery 4     │  39.93ms │              39.85ms │     no change │
│ QQuery 5     │  75.72ms │              75.51ms │     no change │
│ QQuery 6     │  20.83ms │              20.81ms │     no change │
│ QQuery 7     │  87.12ms │              90.26ms │     no change │
│ QQuery 8     │  75.97ms │              75.86ms │     no change │
│ QQuery 9     │ 111.84ms │             111.82ms │     no change │
│ QQuery 10    │ 101.88ms │             101.16ms │     no change │
│ QQuery 11    │  41.21ms │              40.45ms │     no change │
│ QQuery 12    │  53.79ms │              53.63ms │     no change │
│ QQuery 13    │ 120.02ms │             123.73ms │     no change │
│ QQuery 14    │  37.49ms │              38.08ms │     no change │
│ QQuery 15    │  45.55ms │              46.17ms │     no change │
│ QQuery 16    │  36.87ms │              33.76ms │ +1.09x faster │
│ QQuery 17    │  99.41ms │             100.12ms │     no change │
│ QQuery 18    │ 146.85ms │             149.60ms │     no change │
│ QQuery 19    │  65.43ms │              65.90ms │     no change │
│ QQuery 20    │  61.97ms │              58.57ms │ +1.06x faster │
│ QQuery 21    │ 121.33ms │             122.14ms │     no change │
│ QQuery 22    │  34.58ms │              34.67ms │     no change │
└──────────────┴──────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                   ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                   │ 1612.37ms │
│ Total Time (vectorized_aggregate)   │ 1612.63ms │
│ Average Time (main)                 │   73.29ms │
│ Average Time (vectorized_aggregate) │   73.30ms │
│ Queries Faster                      │         2 │
│ Queries Slower                      │         0 │
│ Queries with No Change              │        20 │
└─────────────────────────────────────┴───────────┘

Dandandan avatar Nov 01 '24 10:11 Dandandan

Is it possible to somehow unify GroupValuesColumn and VectorizedGroupValuesColumn ?

🤔I think It can unify simply, VectorizedGroupValuesColumn::scalarized_intern is similar as GroupValuesColumn::intern. But its logic is much more complex, I am afraid performance regression of streaming aggregation.

The alternative is that we support a dedicated intern in VectorizedGroupValuesColumn which is totally same as GroupValuesColumn::intern. It will not so hard to do it, because GroupValuesColumn::intern can be seen as a simpler version of VectorizedGroupValuesColumn::scalarized_intern.

🤔 I personally prefer the second one? What do you think about it @alamb ?

Rachelint avatar Nov 02 '24 13:11 Rachelint

🤔 I personally prefer the second one? What do you think about it @alamb ?

I think this makes sense -- thank you

alamb avatar Nov 03 '24 12:11 alamb

BTW I think this code is fairly well covered by the aggregate fuzz tester (also added by @Rachelint :))

Also, @LeslieKid is adding additional data type coverage which is great: https://github.com/apache/datafusion/pull/13226

cargo test --test fuzz -- aggregate

alamb avatar Nov 03 '24 12:11 alamb

🤔 I personally prefer the second one? What do you think about it @alamb ?

I think this makes sense -- thank you

Have unified the VectorizedGroupValuesColumn and GroupValuesColumn through the way mentioned in https://github.com/apache/datafusion/pull/12996#issuecomment-2452991852

Rachelint avatar Nov 04 '24 15:11 Rachelint

This is top of my list to review tomorrow morning

alamb avatar Nov 04 '24 21:11 alamb

This is top of my list to review tomorrow morning

I am sorry -- I am just finding other PRs like https://github.com/apache/datafusion/pull/12978 and https://github.com/apache/datafusion/pull/13133 very subtle and take a long time to review (aka write tests for / help make sure they are still correct)

alamb avatar Nov 04 '24 22:11 alamb

I am giving this a final review now

alamb avatar Nov 05 '24 18:11 alamb

Performance results:

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.30ms │                 2.33ms │     no change │
│ QQuery 1     │    40.32ms │                40.38ms │     no change │
│ QQuery 2     │    96.98ms │                97.06ms │     no change │
│ QQuery 3     │   106.77ms │               108.36ms │     no change │
│ QQuery 4     │   912.91ms │               923.46ms │     no change │
│ QQuery 5     │   957.04ms │               943.27ms │     no change │
│ QQuery 6     │    36.29ms │                35.65ms │     no change │
│ QQuery 7     │    44.35ms │                44.05ms │     no change │
│ QQuery 8     │  1374.70ms │              1026.74ms │ +1.34x faster │
│ QQuery 9     │  1349.37ms │              1354.92ms │     no change │
│ QQuery 10    │   308.40ms │               287.66ms │ +1.07x faster │
│ QQuery 11    │   358.62ms │               321.64ms │ +1.11x faster │
│ QQuery 12    │  1003.91ms │               981.15ms │     no change │
│ QQuery 13    │  1542.79ms │              1470.92ms │     no change │
│ QQuery 14    │  1076.66ms │               913.23ms │ +1.18x faster │
│ QQuery 15    │  1080.68ms │              1107.04ms │     no change │
│ QQuery 16    │  2434.58ms │              1986.02ms │ +1.23x faster │
│ QQuery 17    │  2243.82ms │              1854.36ms │ +1.21x faster │
│ QQuery 18    │  5145.29ms │              4294.07ms │ +1.20x faster │
│ QQuery 19    │    98.01ms │               100.58ms │     no change │
│ QQuery 20    │  1259.01ms │              1273.34ms │     no change │
│ QQuery 21    │  1524.57ms │              1495.15ms │     no change │
│ QQuery 22    │  2711.65ms │              2661.01ms │     no change │
│ QQuery 23    │  8991.12ms │              8565.66ms │     no change │
│ QQuery 24    │   521.71ms │               515.62ms │     no change │
│ QQuery 25    │   434.70ms │               423.71ms │     no change │
│ QQuery 26    │   594.60ms │               584.15ms │     no change │
│ QQuery 27    │  1884.39ms │              1857.91ms │     no change │
│ QQuery 28    │ 12978.56ms │             13103.89ms │     no change │
│ QQuery 29    │   530.69ms │               538.63ms │     no change │
│ QQuery 30    │  1023.13ms │               897.27ms │ +1.14x faster │
│ QQuery 31    │  1044.11ms │               956.21ms │ +1.09x faster │
│ QQuery 32    │  4300.17ms │              4064.21ms │ +1.06x faster │
│ QQuery 33    │  4063.10ms │              4043.20ms │     no change │
│ QQuery 34    │  4084.94ms │              4073.62ms │     no change │
│ QQuery 35    │  1926.27ms │              1355.58ms │ +1.42x faster │
│ QQuery 36    │   239.44ms │               231.08ms │     no change │
│ QQuery 37    │    96.47ms │                97.42ms │     no change │
│ QQuery 38    │   140.95ms │               142.52ms │     no change │
│ QQuery 39    │   513.28ms │               443.86ms │ +1.16x faster │
│ QQuery 40    │    57.47ms │                55.78ms │     no change │
│ QQuery 41    │    48.74ms │                50.94ms │     no change │
│ QQuery 42    │    62.29ms │                63.98ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 69245.17ms │
│ Total Time (vectorize-append-value)   │ 65387.65ms │
│ Average Time (main_base)              │  1610.35ms │
│ Average Time (vectorize-append-value) │  1520.64ms │
│ Queries Faster                        │         12 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │         31 │
└───────────────────────────────────────┴────────────┘

--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2783.88ms │              2821.00ms │     no change │
│ QQuery 1     │   690.58ms │               679.33ms │     no change │
│ QQuery 2     │  1435.85ms │              1364.87ms │     no change │
│ QQuery 3     │   781.53ms │               708.00ms │ +1.10x faster │
│ QQuery 4     │ 12395.12ms │             12441.79ms │     no change │
│ QQuery 5     │ 19443.67ms │             19077.87ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                │ 37530.62ms │
│ Total Time (vectorize-append-value)   │ 37092.86ms │
│ Average Time (main_base)              │  6255.10ms │
│ Average Time (vectorize-append-value) │  6182.14ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          5 │
└───────────────────────────────────────┴────────────┘

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ vectorize-append-value ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  227.03ms │               223.21ms │     no change │
│ QQuery 2     │  117.37ms │               118.32ms │     no change │
│ QQuery 3     │  131.23ms │               112.98ms │ +1.16x faster │
│ QQuery 4     │   80.11ms │                82.46ms │     no change │
│ QQuery 5     │  161.69ms │               157.51ms │     no change │
│ QQuery 6     │   43.57ms │                43.28ms │     no change │
│ QQuery 7     │  208.38ms │               195.13ms │ +1.07x faster │
│ QQuery 8     │  168.85ms │               163.41ms │     no change │
│ QQuery 9     │  246.40ms │               245.05ms │     no change │
│ QQuery 10    │  203.72ms │               205.29ms │     no change │
│ QQuery 11    │   94.72ms │                92.35ms │     no change │
│ QQuery 12    │  100.17ms │               115.46ms │  1.15x slower │
│ QQuery 13    │  212.30ms │               208.76ms │     no change │
│ QQuery 14    │   83.73ms │                70.51ms │ +1.19x faster │
│ QQuery 15    │  104.72ms │               112.26ms │  1.07x slower │
│ QQuery 16    │   72.68ms │                69.40ms │     no change │
│ QQuery 17    │  202.88ms │               208.88ms │     no change │
│ QQuery 18    │  309.89ms │               322.50ms │     no change │
│ QQuery 19    │  121.13ms │               118.12ms │     no change │
│ QQuery 20    │  139.03ms │               122.02ms │ +1.14x faster │
│ QQuery 21    │  260.01ms │               253.21ms │     no change │
│ QQuery 22    │   67.71ms │                67.18ms │     no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘

🚀

alamb avatar Nov 05 '24 19:11 alamb

As my admittedly sparse help for this PR I have filed some additional tickets for follow on work after this PR is merged:

  • https://github.com/apache/datafusion/issues/13262
  • https://github.com/apache/datafusion/issues/13263

alamb avatar Nov 05 '24 19:11 alamb

I don't think we need to wait on this PR anymore, let's merge it in and keep moving forward. Thank you everyone again!

alamb avatar Nov 06 '24 16:11 alamb

Update here is that this is looking like it results in some sweet clickbench improvements:

  • https://github.com/apache/datafusion/issues/13983#issuecomment-2632053433

alamb avatar Feb 04 '25 11:02 alamb