db-benchmark icon indicating copy to clipboard operation
db-benchmark copied to clipboard

Make datasets more accessible

Open MrPowers opened this issue 3 years ago • 4 comments
trafficstars

Thanks for the excellent work on this project.

I'd like to experiment with the datasets and would rather not have to generate the datasets myself. I've never used R and don't really want to learn at this moment. I'm more interested in looking at stuff like if using broadcast joins would materially impact the Spark benchmarks.

Can you provide downloadable data files? Or can you make the files accessible on S3? I'm making important data files accessible to the community in a S3 bucket, so I'd also be happy to upload them there if that'd help.

Thanks again for building / maintaining this project. Hope I'll be able to contribute!

MrPowers avatar Dec 29 '21 11:12 MrPowers

Hi there, checking in here, is there any update on having the data files available on an S3 bucket? I'd really appreciate it, especially for the 1e9 case which seems to have problems to create see https://github.com/h2oai/db-benchmark/issues/110

Thank you cc: @jangorecki

ncclementi avatar Mar 23 '22 23:03 ncclementi

We could make the 50 GB accessible in S3 via multiple gzipped files that users could download and reassemble on their local machines too. That'd let uses download the file in parallel from S3 and limit the massive file problem. Thoughts @jangorecki / @ncclementi?

MrPowers avatar Mar 24 '22 09:03 MrPowers

Hi, you need to contact h2o support. I am no longer maintainer of the project.

jangorecki avatar Mar 24 '22 12:03 jangorecki

ok @jangorecki, will do. Thanks for your great contributions on this project.

MrPowers avatar Mar 24 '22 13:03 MrPowers