datafusion
datafusion copied to clipboard
Draft: Use take-in kernel in repartitioning
Combined with https://github.com/apache/arrow-rs/pull/7325, tries to use the take_in kernel in repartitioning. The goal is to elide the coalesce step after repartitioning.
Hi @ctsk -- is this PR ready for running some benchmarks?
@alamb This PR should be able to run benchmarks now. I've added overrides to use the modified version of arrow in the PR and a lockfile to avoid chrono issues. At least it can run tpch :)
I am firing up the benchmarks
I tried to run the clickbench queries using bench.sh and I got an error like this:
Q1: SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM hits;
Query 1 iteration 0 took 760.6 ms and returned 1 rows
Query 1 iteration 1 took 785.4 ms and returned 1 rows
Query 1 iteration 2 took 787.9 ms and returned 1 rows
Query 1 iteration 3 took 775.5 ms and returned 1 rows
Query 1 iteration 4 took 786.6 ms and returned 1 rows
Q2: SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "Soci\
alAction") FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
thread 'tokio-runtime-worker' panicked at /home/alamb/.cargo/git/checkouts/arrow-rs-583cca34693b79b8/368c1e6/arrow-array/src/builder/mod.rs:509:35\
:
not yet implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: Context("Join Error", External(JoinError::Panic(Id(2790), "not yet implemented", ...)))
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.