Automatically download tpcds benchmark data to the right place
Which issue does this PR close?
- Closes https://github.com/apache/datafusion/issues/19243
Rationale for this change
I want to be able to run tpcdb benchmarks added by @comphead as part of my benchmark
automation scripts. To do so I need to be able to run bench.sh data tpchds and have
it automatically generate the data if it is not present.
Right now the data generation step is manual.
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./benchmarks/bench.sh data tpcds
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: tpcds
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
For TPC-DS data generation, please clone the datafusion-benchmarks repository:
git clone https://github.com/apache/datafusion-benchmarks
And I think it takes some more post processing steps (which is what @mbutrovich hit)
What changes are included in this PR?
- Update the data setup portion to automatically download the contents from github and extract it in the correct location
Are these changes tested?
I tested this manually on my mac laptop by deleting the data directory and running the script again, and deleting the web_*.parquet files to ensure they are re-downloaded correctly.
./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds
I also tested on my benchmark machine (linux)
Are there any user-facing changes?
@mbutrovich any chance you can test this via:
./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds
?
I also regenerated TPCDS data, and the number of 0 rows queries went down from 36 to 18 https://github.com/apache/datafusion-benchmarks/pull/25
and considering independent nature of generating data and queries I think sometimes queries are out of sync, but also it might be still okay of having 0 rows as the computation still happens.
Anyway Im planning to modify DF TPCDS queries a little bit to see if I can tweak filters which exploited during query generation and make it return non-zero data
Thanks @alamb I think it should work.
I didn't do automatic cloning for benchmarks because:
- the user prob have its own set of data
- depending the location where user starts the benchmarking it can be a false positive on download and user will end up with multiple clones of benchmarks on his local machine.
Yeah, this is definitely a real potential issue. One thing we could do is document how to link to an existing checkout
something like this maybe:
mkdir -p data
ln -s -f $EXISITING_CHECKOUT/data/tpcds/sf1 data/tpcds_sf`
🤔
Thanks @martin-g and @alamb we can probably hardcode path, in most cases people cloned the benchmarks repo inside DF repo, so we can keep this path hardcoded having ${SCRIPTS_DIR} as anchor, if this path is empty we can fallback to DATA_DIR and if this fails again then throw an error, I'll try to play with it today
Thank you for the reviews @comphead and @martin-g
Thanks @martin-g