datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Automatically download tpcds benchmark data to the right place

Open alamb opened this issue 1 month ago • 4 comments

Which issue does this PR close?

  • Closes https://github.com/apache/datafusion/issues/19243

Rationale for this change

I want to be able to run tpcdb benchmarks added by @comphead as part of my benchmark automation scripts. To do so I need to be able to run bench.sh data tpchds and have it automatically generate the data if it is not present.

Right now the data generation step is manual.

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./benchmarks/bench.sh data tpcds
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: tpcds
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

For TPC-DS data generation, please clone the datafusion-benchmarks repository:
  git clone https://github.com/apache/datafusion-benchmarks

And I think it takes some more post processing steps (which is what @mbutrovich hit)

What changes are included in this PR?

  1. Update the data setup portion to automatically download the contents from github and extract it in the correct location

Are these changes tested?

I tested this manually on my mac laptop by deleting the data directory and running the script again, and deleting the web_*.parquet files to ensure they are re-downloaded correctly.

./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds

I also tested on my benchmark machine (linux)

Are there any user-facing changes?

alamb avatar Dec 09 '25 18:12 alamb

@mbutrovich any chance you can test this via:

./benchmarks/bench.sh data tpcds
./benchmarks/bench.sh run tpcds

?

alamb avatar Dec 09 '25 18:12 alamb

I also regenerated TPCDS data, and the number of 0 rows queries went down from 36 to 18 https://github.com/apache/datafusion-benchmarks/pull/25

and considering independent nature of generating data and queries I think sometimes queries are out of sync, but also it might be still okay of having 0 rows as the computation still happens.

Anyway Im planning to modify DF TPCDS queries a little bit to see if I can tweak filters which exploited during query generation and make it return non-zero data

comphead avatar Dec 09 '25 18:12 comphead

Thanks @alamb I think it should work.

I didn't do automatic cloning for benchmarks because:

  • the user prob have its own set of data
  • depending the location where user starts the benchmarking it can be a false positive on download and user will end up with multiple clones of benchmarks on his local machine.

Yeah, this is definitely a real potential issue. One thing we could do is document how to link to an existing checkout

something like this maybe:

mkdir -p data
ln -s -f $EXISITING_CHECKOUT/data/tpcds/sf1 data/tpcds_sf`

🤔

alamb avatar Dec 09 '25 20:12 alamb

Thanks @martin-g and @alamb we can probably hardcode path, in most cases people cloned the benchmarks repo inside DF repo, so we can keep this path hardcoded having ${SCRIPTS_DIR} as anchor, if this path is empty we can fallback to DATA_DIR and if this fails again then throw an error, I'll try to play with it today

comphead avatar Dec 10 '25 18:12 comphead

Thank you for the reviews @comphead and @martin-g

alamb avatar Dec 11 '25 22:12 alamb

Thanks @martin-g

alamb avatar Dec 11 '25 22:12 alamb