datafusion
datafusion copied to clipboard
POC: Vectorized hashtable for aggregation
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
The latest benchmark numbers:
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.67ms │ 0.68ms │ no change │
│ QQuery 1 │ 67.01ms │ 65.25ms │ no change │
│ QQuery 2 │ 165.14ms │ 157.75ms │ no change │
│ QQuery 3 │ 181.43ms │ 181.83ms │ no change │
│ QQuery 4 │ 1566.65ms │ 1574.95ms │ no change │
│ QQuery 5 │ 1539.79ms │ 1532.81ms │ no change │
│ QQuery 6 │ 61.01ms │ 57.01ms │ +1.07x faster │
│ QQuery 7 │ 77.09ms │ 73.02ms │ +1.06x faster │
│ QQuery 8 │ 1971.64ms │ 1762.88ms │ +1.12x faster │
│ QQuery 9 │ 1921.59ms │ 1903.47ms │ no change │
│ QQuery 10 │ 516.35ms │ 499.35ms │ no change │
│ QQuery 11 │ 590.99ms │ 556.80ms │ +1.06x faster │
│ QQuery 12 │ 1814.14ms │ 1816.26ms │ no change │
│ QQuery 13 │ 2956.07ms │ 2954.48ms │ no change │
│ QQuery 14 │ 2054.42ms │ 1940.82ms │ +1.06x faster │
│ QQuery 15 │ 1899.87ms │ 1873.73ms │ no change │
│ QQuery 16 │ 4066.16ms │ 3744.25ms │ +1.09x faster │
│ QQuery 17 │ 3629.16ms │ 3428.06ms │ +1.06x faster │
│ QQuery 18 │ 8282.13ms │ 7646.27ms │ +1.08x faster │
│ QQuery 19 │ 144.20ms │ 146.30ms │ no change │
│ QQuery 20 │ 3222.65ms │ 3224.85ms │ no change │
│ QQuery 21 │ 3924.86ms │ 3913.65ms │ no change │
│ QQuery 22 │ 9144.86ms │ 9022.44ms │ no change │
│ QQuery 23 │ 23875.41ms │ 23664.41ms │ no change │
│ QQuery 24 │ 1123.53ms │ 1132.05ms │ no change │
│ QQuery 25 │ 1011.03ms │ 1002.87ms │ no change │
│ QQuery 26 │ 1326.71ms │ 1319.49ms │ no change │
│ QQuery 27 │ 4666.49ms │ 4662.07ms │ no change │
│ QQuery 28 │ 24069.75ms │ 24145.85ms │ no change │
│ QQuery 29 │ 902.07ms │ 890.73ms │ no change │
│ QQuery 30 │ 1813.79ms │ 1722.40ms │ +1.05x faster │
│ QQuery 31 │ 2008.03ms │ 1977.28ms │ no change │
│ QQuery 32 │ 7369.56ms │ 7601.38ms │ no change │
│ QQuery 33 │ 9752.79ms │ 9742.50ms │ no change │
│ QQuery 34 │ 9716.57ms │ 9696.95ms │ no change │
│ QQuery 35 │ 2760.71ms │ 2244.23ms │ +1.23x faster │
│ QQuery 36 │ 255.12ms │ 241.01ms │ +1.06x faster │
│ QQuery 37 │ 158.70ms │ 154.80ms │ no change │
│ QQuery 38 │ 155.15ms │ 153.09ms │ no change │
│ QQuery 39 │ 595.64ms │ 587.48ms │ no change │
│ QQuery 40 │ 57.09ms │ 60.69ms │ 1.06x slower │
│ QQuery 41 │ 53.32ms │ 52.81ms │ no change │
│ QQuery 42 │ 65.53ms │ 65.13ms │ no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
The logic is a bit complex, I plan to finish and do benchmark for it today.
It is happy to see that the vectorized approach is promising!
The rest work is to find why it get q28 slower and fix tests.
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.78ms │ 0.66ms │ +1.18x faster │
│ QQuery 1 │ 68.16ms │ 67.25ms │ no change │
│ QQuery 2 │ 161.83ms │ 161.80ms │ no change │
│ QQuery 3 │ 186.20ms │ 179.59ms │ no change │
│ QQuery 4 │ 1581.81ms │ 1601.15ms │ no change │
│ QQuery 5 │ 1556.95ms │ 1504.38ms │ no change │
│ QQuery 6 │ 59.36ms │ 58.69ms │ no change │
│ QQuery 7 │ 79.29ms │ 75.72ms │ no change │
│ QQuery 8 │ 1945.74ms │ 1730.29ms │ +1.12x faster │
│ QQuery 9 │ 1958.91ms │ 1912.77ms │ no change │
│ QQuery 10 │ 519.94ms │ 496.55ms │ no change │
│ QQuery 11 │ 592.06ms │ 573.94ms │ no change │
│ QQuery 12 │ 1834.06ms │ 1818.58ms │ no change │
│ QQuery 13 │ 2963.09ms │ 2937.11ms │ no change │
│ QQuery 14 │ 2053.23ms │ 1959.00ms │ no change │
│ QQuery 15 │ 1902.87ms │ 1873.43ms │ no change │
│ QQuery 16 │ 4042.73ms │ 3721.92ms │ +1.09x faster │
│ QQuery 17 │ 3649.27ms │ 3413.73ms │ +1.07x faster │
│ QQuery 18 │ 8286.60ms │ 7555.43ms │ +1.10x faster │
│ QQuery 19 │ 148.06ms │ 143.84ms │ no change │
│ QQuery 20 │ 3241.10ms │ 3225.91ms │ no change │
│ QQuery 21 │ 3879.00ms │ 3893.55ms │ no change │
│ QQuery 22 │ 9109.29ms │ 9306.23ms │ no change │
│ QQuery 23 │ 23670.14ms │ 23590.65ms │ no change │
│ QQuery 24 │ 1166.75ms │ 1126.80ms │ no change │
│ QQuery 25 │ 998.82ms │ 1006.37ms │ no change │
│ QQuery 26 │ 1320.64ms │ 1307.47ms │ no change │
│ QQuery 27 │ 4772.51ms │ 4675.67ms │ no change │
│ QQuery 28 │ 22374.44ms │ 24834.06ms │ 1.11x slower │
│ QQuery 29 │ 902.94ms │ 896.11ms │ no change │
│ QQuery 30 │ 1829.63ms │ 1739.30ms │ no change │
│ QQuery 31 │ 2059.05ms │ 1951.76ms │ +1.05x faster │
│ QQuery 32 │ 7460.76ms │ 7609.67ms │ no change │
│ QQuery 33 │ 9834.20ms │ 9681.51ms │ no change │
│ QQuery 34 │ 9759.46ms │ 9556.88ms │ no change │
│ QQuery 35 │ 2786.28ms │ 2258.62ms │ +1.23x faster │
│ QQuery 36 │ 257.69ms │ 250.87ms │ no change │
│ QQuery 37 │ 155.01ms │ 151.16ms │ no change │
│ QQuery 38 │ 154.50ms │ 147.32ms │ no change │
│ QQuery 39 │ 635.06ms │ 596.76ms │ +1.06x faster │
│ QQuery 40 │ 60.37ms │ 59.58ms │ no change │
│ QQuery 41 │ 53.82ms │ 53.27ms │ no change │
│ QQuery 42 │ 66.93ms │ 63.49ms │ +1.05x faster │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main) │ 140139.34ms │
│ Total Time (vectorize-append-value) │ 139768.84ms │
│ Average Time (main) │ 3259.05ms │
│ Average Time (vectorize-append-value) │ 3250.44ms │
│ Queries Faster │ 9 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 33 │
└───────────────────────────────────────┴─────────────┘
I profile to get flamegraph of q28 and found it actually not go the path in this pr.
The slower may be due to some enviorment reason in my local, I am re-running the bench on main now.
The last thing is to add some unit tests to cover some conercases found during debugging.
I think this pr is ready now!
Nice -- I am running some benchmarks on this PR
🤩
Comparing main_base and vectorize-append-value
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.69ms │ 0.67ms │ no change │
│ QQuery 1 │ 69.53ms │ 69.81ms │ no change │
│ QQuery 2 │ 125.56ms │ 125.06ms │ no change │
│ QQuery 3 │ 134.49ms │ 133.53ms │ no change │
│ QQuery 4 │ 986.06ms │ 1002.09ms │ no change │
│ QQuery 5 │ 1092.56ms │ 1111.32ms │ no change │
│ QQuery 6 │ 67.90ms │ 66.92ms │ no change │
│ QQuery 7 │ 82.12ms │ 80.79ms │ no change │
│ QQuery 8 │ 1381.07ms │ 1055.63ms │ +1.31x faster │
│ QQuery 9 │ 1376.13ms │ 1371.51ms │ no change │
│ QQuery 10 │ 466.36ms │ 435.31ms │ +1.07x faster │
│ QQuery 11 │ 508.85ms │ 474.68ms │ +1.07x faster │
│ QQuery 12 │ 1245.08ms │ 1216.74ms │ no change │
│ QQuery 13 │ 1919.50ms │ 1746.68ms │ +1.10x faster │
│ QQuery 14 │ 1371.00ms │ 1238.46ms │ +1.11x faster │
│ QQuery 15 │ 1156.13ms │ 1123.78ms │ no change │
│ QQuery 16 │ 2593.20ms │ 2187.34ms │ +1.19x faster │
│ QQuery 17 │ 2374.51ms │ 2008.88ms │ +1.18x faster │
│ QQuery 18 │ 5049.66ms │ 4169.48ms │ +1.21x faster │
│ QQuery 19 │ 125.17ms │ 119.76ms │ no change │
│ QQuery 20 │ 1650.50ms │ 1642.21ms │ no change │
│ QQuery 21 │ 2100.16ms │ 2082.53ms │ no change │
│ QQuery 22 │ 5039.82ms │ 4996.73ms │ no change │
│ QQuery 23 │ 11902.50ms │ 11835.25ms │ no change │
│ QQuery 24 │ 810.90ms │ 797.78ms │ no change │
│ QQuery 25 │ 728.72ms │ 699.22ms │ no change │
│ QQuery 26 │ 868.88ms │ 891.65ms │ no change │
│ QQuery 27 │ 2627.37ms │ 2600.82ms │ no change │
│ QQuery 28 │ 15172.21ms │ 15297.59ms │ no change │
│ QQuery 29 │ 572.83ms │ 571.91ms │ no change │
│ QQuery 30 │ 1239.14ms │ 1141.57ms │ +1.09x faster │
│ QQuery 31 │ 1277.14ms │ 1206.06ms │ +1.06x faster │
│ QQuery 32 │ 4198.97ms │ 3978.68ms │ +1.06x faster │
│ QQuery 33 │ 5184.58ms │ 5281.99ms │ no change │
│ QQuery 34 │ 5334.25ms │ 5234.23ms │ no change │
│ QQuery 35 │ 1799.37ms │ 1351.66ms │ +1.33x faster │
│ QQuery 36 │ 319.76ms │ 312.16ms │ no change │
│ QQuery 37 │ 220.55ms │ 220.76ms │ no change │
│ QQuery 38 │ 197.20ms │ 195.72ms │ no change │
│ QQuery 39 │ 810.69ms │ 732.04ms │ +1.11x faster │
│ QQuery 40 │ 91.81ms │ 84.87ms │ +1.08x faster │
│ QQuery 41 │ 81.72ms │ 77.45ms │ +1.06x faster │
│ QQuery 42 │ 94.61ms │ 94.34ms │ no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base) │ 84449.27ms │
│ Total Time (vectorize-append-value) │ 81065.67ms │
│ Average Time (main_base) │ 1963.94ms │
│ Average Time (vectorize-append-value) │ 1885.25ms │
│ Queries Faster │ 15 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 28 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2736.57ms │ 2712.16ms │ no change │
│ QQuery 1 │ 772.15ms │ 769.86ms │ no change │
│ QQuery 2 │ 1624.12ms │ 1639.77ms │ no change │
│ QQuery 3 │ 827.73ms │ 694.66ms │ +1.19x faster │
│ QQuery 4 │ 12716.33ms │ 12364.86ms │ no change │
│ QQuery 5 │ 19095.23ms │ 19058.86ms │ no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base) │ 37772.14ms │
│ Total Time (vectorize-append-value) │ 37240.16ms │
│ Average Time (main_base) │ 6295.36ms │
│ Average Time (vectorize-append-value) │ 6206.69ms │
│ Queries Faster │ 1 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 5 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2.25ms │ 2.27ms │ no change │
│ QQuery 1 │ 40.78ms │ 38.83ms │ no change │
│ QQuery 2 │ 98.26ms │ 95.76ms │ no change │
│ QQuery 3 │ 108.58ms │ 106.87ms │ no change │
│ QQuery 4 │ 922.34ms │ 917.62ms │ no change │
│ QQuery 5 │ 980.37ms │ 938.35ms │ no change │
│ QQuery 6 │ 38.42ms │ 39.18ms │ no change │
│ QQuery 7 │ 43.00ms │ 43.24ms │ no change │
│ QQuery 8 │ 1359.05ms │ 1009.35ms │ +1.35x faster │
│ QQuery 9 │ 1348.60ms │ 1323.55ms │ no change │
│ QQuery 10 │ 424.82ms │ 406.74ms │ no change │
│ QQuery 11 │ 475.00ms │ 451.86ms │ no change │
│ QQuery 12 │ 1090.45ms │ 1106.96ms │ no change │
│ QQuery 13 │ 1577.54ms │ 1508.58ms │ no change │
│ QQuery 14 │ 1259.77ms │ 1086.88ms │ +1.16x faster │
│ QQuery 15 │ 1086.47ms │ 1081.46ms │ no change │
│ QQuery 16 │ 2449.02ms │ 2030.13ms │ +1.21x faster │
│ QQuery 17 │ 2329.89ms │ 1911.21ms │ +1.22x faster │
│ QQuery 18 │ 5004.89ms │ 3989.09ms │ +1.25x faster │
│ QQuery 19 │ 97.70ms │ 98.64ms │ no change │
│ QQuery 20 │ 1474.31ms │ 1478.79ms │ no change │
│ QQuery 21 │ 1808.23ms │ 1810.91ms │ no change │
│ QQuery 22 │ 3132.66ms │ 3128.20ms │ no change │
│ QQuery 23 │ 10113.15ms │ 10129.78ms │ no change │
│ QQuery 24 │ 656.18ms │ 651.62ms │ no change │
│ QQuery 25 │ 518.58ms │ 524.60ms │ no change │
│ QQuery 26 │ 719.15ms │ 725.11ms │ no change │
│ QQuery 27 │ 2266.93ms │ 2264.89ms │ no change │
│ QQuery 28 │ 14304.45ms │ 14598.41ms │ no change │
│ QQuery 29 │ 542.95ms │ 540.96ms │ no change │
│ QQuery 30 │ 1091.63ms │ 1000.47ms │ +1.09x faster │
│ QQuery 31 │ 1140.61ms │ 1070.64ms │ +1.07x faster │
│ QQuery 32 │ 4198.09ms │ 3928.13ms │ +1.07x faster │
│ QQuery 33 │ 5178.24ms │ 5025.69ms │ no change │
│ QQuery 34 │ 5065.45ms │ 5110.93ms │ no change │
│ QQuery 35 │ 1869.61ms │ 1300.99ms │ +1.44x faster │
│ QQuery 36 │ 274.68ms │ 270.33ms │ no change │
│ QQuery 37 │ 125.61ms │ 125.59ms │ no change │
│ QQuery 38 │ 150.35ms │ 144.38ms │ no change │
│ QQuery 39 │ 732.00ms │ 689.60ms │ +1.06x faster │
│ QQuery 40 │ 57.04ms │ 60.17ms │ 1.05x slower │
│ QQuery 41 │ 47.52ms │ 47.56ms │ no change │
│ QQuery 42 │ 65.98ms │ 64.11ms │ no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base) │ 76270.60ms │
│ Total Time (vectorize-append-value) │ 72878.40ms │
│ Average Time (main_base) │ 1773.73ms │
│ Average Time (vectorize-append-value) │ 1694.85ms │
│ Queries Faster │ 10 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 32 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 203.49ms │ 201.33ms │ no change │
│ QQuery 2 │ 115.50ms │ 120.92ms │ no change │
│ QQuery 3 │ 129.08ms │ 119.50ms │ +1.08x faster │
│ QQuery 4 │ 87.14ms │ 83.75ms │ no change │
│ QQuery 5 │ 161.47ms │ 159.55ms │ no change │
│ QQuery 6 │ 48.73ms │ 42.26ms │ +1.15x faster │
│ QQuery 7 │ 208.57ms │ 200.37ms │ no change │
│ QQuery 8 │ 157.16ms │ 153.09ms │ no change │
│ QQuery 9 │ 244.82ms │ 247.43ms │ no change │
│ QQuery 10 │ 226.83ms │ 238.14ms │ no change │
│ QQuery 11 │ 86.91ms │ 91.82ms │ 1.06x slower │
│ QQuery 12 │ 128.79ms │ 128.85ms │ no change │
│ QQuery 13 │ 225.47ms │ 219.07ms │ no change │
│ QQuery 14 │ 72.15ms │ 89.80ms │ 1.24x slower │
│ QQuery 15 │ 115.99ms │ 107.62ms │ +1.08x faster │
│ QQuery 16 │ 80.43ms │ 74.99ms │ +1.07x faster │
│ QQuery 17 │ 218.58ms │ 218.94ms │ no change │
│ QQuery 18 │ 329.54ms │ 321.20ms │ no change │
│ QQuery 19 │ 131.86ms │ 129.84ms │ no change │
│ QQuery 20 │ 136.33ms │ 121.60ms │ +1.12x faster │
│ QQuery 21 │ 272.83ms │ 283.72ms │ no change │
│ QQuery 22 │ 66.11ms │ 63.99ms │ no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base) │ 3447.78ms │
│ Total Time (vectorize-append-value) │ 3417.79ms │
│ Average Time (main_base) │ 156.72ms │
│ Average Time (vectorize-append-value) │ 155.35ms │
│ Queries Faster │ 5 │
│ Queries Slower │ 2 │
│ Queries with No Change │ 15 │
└───────────────────────────────────────┴───────────┘
Wow very nice @Rachelint -- I will plan on checking this PR out in more detail tomorrow
Amazing!!!
1.24x slower
Worth profiling / checking this regression before merging I think
Worth profiling / checking this regression before merging I think
For my own future self, it is this tpch query:
│ QQuery 14 │ 72.15ms │ 89.80ms │ 1.24x slower │
I think just profiling what is going on would be helpful. I can maybe help over the next few days if no one else has a chance . Here is an example of the kind of ticket we could potentially file: https://github.com/apache/datafusion/issues/13188
Parquet read dominates 🤔
Amazing!!!
1.24x slower
Worth profiling / checking this regression before merging I think
🤔 I plan to try to check it today or tomorrow, I suspect if it is the low cardinality case that many repeated rows exist in the input cols?
Tpch q14 doesn't seem to run through the change of this PR -- groupBy is empty in AggregateExec , I also doesn't see any print out in VectorizedGroupValuesColumn. I think this change is not the reason of slowdown 🤔
query TT
explain select
100.00 * sum(case
when p_type like 'PROMO%'
then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= date '1995-09-01'
and l_shipdate < date '1995-10-01';
----
logical_plan
01)Projection: Float64(100) * CAST(sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END) AS Float64) / CAST(sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS Float64) AS promo_revenue
02)--Aggregate: groupBy=[[]], aggr=[[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN __common_expr_1 ELSE Decimal128(Some(0),38,4) END) AS sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(__common_expr_1) AS sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]]
03)----Projection: lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount) AS __common_expr_1, part.p_type
04)------Inner Join: lineitem.l_partkey = part.p_partkey
05)--------Projection: lineitem.l_partkey, lineitem.l_extendedprice, lineitem.l_discount
06)----------Filter: lineitem.l_shipdate >= Date32("1995-09-01") AND lineitem.l_shipdate < Date32("1995-10-01")
07)------------TableScan: lineitem projection=[l_partkey, l_extendedprice, l_discount, l_shipdate], partial_filters=[lineitem.l_shipdate >= Date32("1995-09-01"), lineitem.l_shipdate < Date32("1995-10-01")]
08)--------TableScan: part projection=[p_partkey, p_type]
physical_plan
01)ProjectionExec: expr=[100 * CAST(sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END)@0 AS Float64) / CAST(sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@1 AS Float64) as promo_revenue]
02)--AggregateExec: mode=Final, gby=[], aggr=[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]
05)--------ProjectionExec: expr=[l_extendedprice@0 * (Some(1),20,0 - l_discount@1) as __common_expr_1, p_type@2 as p_type]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------HashJoinExec: mode=Partitioned, join_type=Inner, on=[(l_partkey@0, p_partkey@0)], projection=[l_extendedprice@1, l_discount@2, p_type@4]
08)--------------CoalesceBatchesExec: target_batch_size=8192
09)----------------RepartitionExec: partitioning=Hash([l_partkey@0], 4), input_partitions=4
10)------------------CoalesceBatchesExec: target_batch_size=8192
11)--------------------FilterExec: l_shipdate@3 >= 1995-09-01 AND l_shipdate@3 < 1995-10-01, projection=[l_partkey@0, l_extendedprice@1, l_discount@2]
12)----------------------CsvExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:0..18561749], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:18561749..37123498], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:37123498..55685247], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/lineitem.tbl:55685247..74246996]]}, projection=[l_partkey, l_extendedprice, l_discount, l_shipdate], has_header=false
13)--------------CoalesceBatchesExec: target_batch_size=8192
14)----------------RepartitionExec: partitioning=Hash([p_partkey@0], 4), input_partitions=4
15)------------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
16)--------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/part.tbl]]}, projection=[p_partkey, p_type], has_header=false
Hm maybe some improvement landed on main not yet on this branch? Maybe rebase and profile again?
My results after merging main show no regressions (tpc-h basically has no change) 🚀
Comparing main and vectorized_aggregate
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ vectorized_aggregate ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.33ms │ 0.39ms │ 1.17x slower │
│ QQuery 1 │ 35.24ms │ 35.03ms │ no change │
│ QQuery 2 │ 82.54ms │ 83.07ms │ no change │
│ QQuery 3 │ 76.90ms │ 65.62ms │ +1.17x faster │
│ QQuery 4 │ 521.48ms │ 491.44ms │ +1.06x faster │
│ QQuery 5 │ 809.76ms │ 768.98ms │ +1.05x faster │
│ QQuery 6 │ 33.27ms │ 32.65ms │ no change │
│ QQuery 7 │ 39.26ms │ 39.52ms │ no change │
│ QQuery 8 │ 766.34ms │ 568.72ms │ +1.35x faster │
│ QQuery 9 │ 821.86ms │ 823.16ms │ no change │
│ QQuery 10 │ 217.90ms │ 207.96ms │ no change │
│ QQuery 11 │ 247.04ms │ 230.55ms │ +1.07x faster │
│ QQuery 12 │ 839.06ms │ 814.24ms │ no change │
│ QQuery 13 │ 1123.51ms │ 1102.40ms │ no change │
│ QQuery 14 │ 874.87ms │ 771.94ms │ +1.13x faster │
│ QQuery 15 │ 607.72ms │ 615.06ms │ no change │
│ QQuery 16 │ 1573.81ms │ 1332.64ms │ +1.18x faster │
│ QQuery 17 │ 1454.94ms │ 1220.57ms │ +1.19x faster │
│ QQuery 18 │ 3348.40ms │ 2936.64ms │ +1.14x faster │
│ QQuery 19 │ 59.96ms │ 59.11ms │ no change │
│ QQuery 20 │ 943.24ms │ 964.36ms │ no change │
│ QQuery 21 │ 1291.59ms │ 1299.99ms │ no change │
│ QQuery 22 │ 3798.86ms │ 3812.87ms │ no change │
│ QQuery 23 │ 8083.11ms │ 7952.95ms │ no change │
│ QQuery 24 │ 553.23ms │ 554.28ms │ no change │
│ QQuery 25 │ 530.46ms │ 534.44ms │ no change │
│ QQuery 26 │ 610.97ms │ 606.59ms │ no change │
│ QQuery 27 │ 1558.28ms │ 1584.92ms │ no change │
│ QQuery 28 │ 12837.43ms │ 12852.93ms │ no change │
│ QQuery 29 │ 448.12ms │ 446.98ms │ no change │
│ QQuery 30 │ 824.09ms │ 756.85ms │ +1.09x faster │
│ QQuery 31 │ 796.62ms │ 752.91ms │ +1.06x faster │
│ QQuery 32 │ 3502.80ms │ 2829.34ms │ +1.24x faster │
│ QQuery 33 │ 2775.26ms │ 2729.67ms │ no change │
│ QQuery 34 │ 2748.59ms │ 2738.52ms │ no change │
│ QQuery 35 │ 1194.59ms │ 840.66ms │ +1.42x faster │
│ QQuery 36 │ 150.59ms │ 146.20ms │ no change │
│ QQuery 37 │ 102.54ms │ 104.18ms │ no change │
│ QQuery 38 │ 105.60ms │ 104.55ms │ no change │
│ QQuery 39 │ 279.20ms │ 264.85ms │ +1.05x faster │
│ QQuery 40 │ 34.25ms │ 32.64ms │ no change │
│ QQuery 41 │ 31.88ms │ 31.48ms │ no change │
│ QQuery 42 │ 40.98ms │ 41.15ms │ no change │
└──────────────┴────────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main) │ 56776.48ms │
│ Total Time (vectorized_aggregate) │ 54183.02ms │
│ Average Time (main) │ 1320.38ms │
│ Average Time (vectorized_aggregate) │ 1260.07ms │
│ Queries Faster │ 14 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 28 │
└─────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ vectorized_aggregate ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 122.91ms │ 120.39ms │ no change │
│ QQuery 2 │ 55.64ms │ 55.18ms │ no change │
│ QQuery 3 │ 56.02ms │ 54.98ms │ no change │
│ QQuery 4 │ 39.93ms │ 39.85ms │ no change │
│ QQuery 5 │ 75.72ms │ 75.51ms │ no change │
│ QQuery 6 │ 20.83ms │ 20.81ms │ no change │
│ QQuery 7 │ 87.12ms │ 90.26ms │ no change │
│ QQuery 8 │ 75.97ms │ 75.86ms │ no change │
│ QQuery 9 │ 111.84ms │ 111.82ms │ no change │
│ QQuery 10 │ 101.88ms │ 101.16ms │ no change │
│ QQuery 11 │ 41.21ms │ 40.45ms │ no change │
│ QQuery 12 │ 53.79ms │ 53.63ms │ no change │
│ QQuery 13 │ 120.02ms │ 123.73ms │ no change │
│ QQuery 14 │ 37.49ms │ 38.08ms │ no change │
│ QQuery 15 │ 45.55ms │ 46.17ms │ no change │
│ QQuery 16 │ 36.87ms │ 33.76ms │ +1.09x faster │
│ QQuery 17 │ 99.41ms │ 100.12ms │ no change │
│ QQuery 18 │ 146.85ms │ 149.60ms │ no change │
│ QQuery 19 │ 65.43ms │ 65.90ms │ no change │
│ QQuery 20 │ 61.97ms │ 58.57ms │ +1.06x faster │
│ QQuery 21 │ 121.33ms │ 122.14ms │ no change │
│ QQuery 22 │ 34.58ms │ 34.67ms │ no change │
└──────────────┴──────────┴──────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main) │ 1612.37ms │
│ Total Time (vectorized_aggregate) │ 1612.63ms │
│ Average Time (main) │ 73.29ms │
│ Average Time (vectorized_aggregate) │ 73.30ms │
│ Queries Faster │ 2 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 20 │
└─────────────────────────────────────┴───────────┘
Is it possible to somehow unify GroupValuesColumn and VectorizedGroupValuesColumn ?
🤔I think It can unify simply, VectorizedGroupValuesColumn::scalarized_intern is similar as GroupValuesColumn::intern.
But its logic is much more complex, I am afraid performance regression of streaming aggregation.
The alternative is that we support a dedicated intern in VectorizedGroupValuesColumn which is totally same as GroupValuesColumn::intern.
It will not so hard to do it, because GroupValuesColumn::intern can be seen as a simpler version of VectorizedGroupValuesColumn::scalarized_intern.
🤔 I personally prefer the second one? What do you think about it @alamb ?
🤔 I personally prefer the second one? What do you think about it @alamb ?
I think this makes sense -- thank you
BTW I think this code is fairly well covered by the aggregate fuzz tester (also added by @Rachelint :))
Also, @LeslieKid is adding additional data type coverage which is great: https://github.com/apache/datafusion/pull/13226
cargo test --test fuzz -- aggregate
🤔 I personally prefer the second one? What do you think about it @alamb ?
I think this makes sense -- thank you
Have unified the VectorizedGroupValuesColumn and GroupValuesColumn through the way mentioned in https://github.com/apache/datafusion/pull/12996#issuecomment-2452991852
This is top of my list to review tomorrow morning
This is top of my list to review tomorrow morning
I am sorry -- I am just finding other PRs like https://github.com/apache/datafusion/pull/12978 and https://github.com/apache/datafusion/pull/13133 very subtle and take a long time to review (aka write tests for / help make sure they are still correct)
I am giving this a final review now
Performance results:
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2.30ms │ 2.33ms │ no change │
│ QQuery 1 │ 40.32ms │ 40.38ms │ no change │
│ QQuery 2 │ 96.98ms │ 97.06ms │ no change │
│ QQuery 3 │ 106.77ms │ 108.36ms │ no change │
│ QQuery 4 │ 912.91ms │ 923.46ms │ no change │
│ QQuery 5 │ 957.04ms │ 943.27ms │ no change │
│ QQuery 6 │ 36.29ms │ 35.65ms │ no change │
│ QQuery 7 │ 44.35ms │ 44.05ms │ no change │
│ QQuery 8 │ 1374.70ms │ 1026.74ms │ +1.34x faster │
│ QQuery 9 │ 1349.37ms │ 1354.92ms │ no change │
│ QQuery 10 │ 308.40ms │ 287.66ms │ +1.07x faster │
│ QQuery 11 │ 358.62ms │ 321.64ms │ +1.11x faster │
│ QQuery 12 │ 1003.91ms │ 981.15ms │ no change │
│ QQuery 13 │ 1542.79ms │ 1470.92ms │ no change │
│ QQuery 14 │ 1076.66ms │ 913.23ms │ +1.18x faster │
│ QQuery 15 │ 1080.68ms │ 1107.04ms │ no change │
│ QQuery 16 │ 2434.58ms │ 1986.02ms │ +1.23x faster │
│ QQuery 17 │ 2243.82ms │ 1854.36ms │ +1.21x faster │
│ QQuery 18 │ 5145.29ms │ 4294.07ms │ +1.20x faster │
│ QQuery 19 │ 98.01ms │ 100.58ms │ no change │
│ QQuery 20 │ 1259.01ms │ 1273.34ms │ no change │
│ QQuery 21 │ 1524.57ms │ 1495.15ms │ no change │
│ QQuery 22 │ 2711.65ms │ 2661.01ms │ no change │
│ QQuery 23 │ 8991.12ms │ 8565.66ms │ no change │
│ QQuery 24 │ 521.71ms │ 515.62ms │ no change │
│ QQuery 25 │ 434.70ms │ 423.71ms │ no change │
│ QQuery 26 │ 594.60ms │ 584.15ms │ no change │
│ QQuery 27 │ 1884.39ms │ 1857.91ms │ no change │
│ QQuery 28 │ 12978.56ms │ 13103.89ms │ no change │
│ QQuery 29 │ 530.69ms │ 538.63ms │ no change │
│ QQuery 30 │ 1023.13ms │ 897.27ms │ +1.14x faster │
│ QQuery 31 │ 1044.11ms │ 956.21ms │ +1.09x faster │
│ QQuery 32 │ 4300.17ms │ 4064.21ms │ +1.06x faster │
│ QQuery 33 │ 4063.10ms │ 4043.20ms │ no change │
│ QQuery 34 │ 4084.94ms │ 4073.62ms │ no change │
│ QQuery 35 │ 1926.27ms │ 1355.58ms │ +1.42x faster │
│ QQuery 36 │ 239.44ms │ 231.08ms │ no change │
│ QQuery 37 │ 96.47ms │ 97.42ms │ no change │
│ QQuery 38 │ 140.95ms │ 142.52ms │ no change │
│ QQuery 39 │ 513.28ms │ 443.86ms │ +1.16x faster │
│ QQuery 40 │ 57.47ms │ 55.78ms │ no change │
│ QQuery 41 │ 48.74ms │ 50.94ms │ no change │
│ QQuery 42 │ 62.29ms │ 63.98ms │ no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base) │ 69245.17ms │
│ Total Time (vectorize-append-value) │ 65387.65ms │
│ Average Time (main_base) │ 1610.35ms │
│ Average Time (vectorize-append-value) │ 1520.64ms │
│ Queries Faster │ 12 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 31 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 2783.88ms │ 2821.00ms │ no change │
│ QQuery 1 │ 690.58ms │ 679.33ms │ no change │
│ QQuery 2 │ 1435.85ms │ 1364.87ms │ no change │
│ QQuery 3 │ 781.53ms │ 708.00ms │ +1.10x faster │
│ QQuery 4 │ 12395.12ms │ 12441.79ms │ no change │
│ QQuery 5 │ 19443.67ms │ 19077.87ms │ no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base) │ 37530.62ms │
│ Total Time (vectorize-append-value) │ 37092.86ms │
│ Average Time (main_base) │ 6255.10ms │
│ Average Time (vectorize-append-value) │ 6182.14ms │
│ Queries Faster │ 1 │
│ Queries Slower │ 0 │
│ Queries with No Change │ 5 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ vectorize-append-value ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 227.03ms │ 223.21ms │ no change │
│ QQuery 2 │ 117.37ms │ 118.32ms │ no change │
│ QQuery 3 │ 131.23ms │ 112.98ms │ +1.16x faster │
│ QQuery 4 │ 80.11ms │ 82.46ms │ no change │
│ QQuery 5 │ 161.69ms │ 157.51ms │ no change │
│ QQuery 6 │ 43.57ms │ 43.28ms │ no change │
│ QQuery 7 │ 208.38ms │ 195.13ms │ +1.07x faster │
│ QQuery 8 │ 168.85ms │ 163.41ms │ no change │
│ QQuery 9 │ 246.40ms │ 245.05ms │ no change │
│ QQuery 10 │ 203.72ms │ 205.29ms │ no change │
│ QQuery 11 │ 94.72ms │ 92.35ms │ no change │
│ QQuery 12 │ 100.17ms │ 115.46ms │ 1.15x slower │
│ QQuery 13 │ 212.30ms │ 208.76ms │ no change │
│ QQuery 14 │ 83.73ms │ 70.51ms │ +1.19x faster │
│ QQuery 15 │ 104.72ms │ 112.26ms │ 1.07x slower │
│ QQuery 16 │ 72.68ms │ 69.40ms │ no change │
│ QQuery 17 │ 202.88ms │ 208.88ms │ no change │
│ QQuery 18 │ 309.89ms │ 322.50ms │ no change │
│ QQuery 19 │ 121.13ms │ 118.12ms │ no change │
│ QQuery 20 │ 139.03ms │ 122.02ms │ +1.14x faster │
│ QQuery 21 │ 260.01ms │ 253.21ms │ no change │
│ QQuery 22 │ 67.71ms │ 67.18ms │ no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘
🚀
As my admittedly sparse help for this PR I have filed some additional tickets for follow on work after this PR is merged:
- https://github.com/apache/datafusion/issues/13262
- https://github.com/apache/datafusion/issues/13263
I don't think we need to wait on this PR anymore, let's merge it in and keep moving forward. Thank you everyone again!
Update here is that this is looking like it results in some sweet clickbench improvements:
- https://github.com/apache/datafusion/issues/13983#issuecomment-2632053433