cudf
cudf copied to clipboard
Add a libcudf/thrust-based TPC-H derived datagen
Description
This PR adds a TPC-H (according to spec 3.0.1) inspired datagen written using libcudf and thrust
Implementation Status
- [x] lineitem
- [x] orders
- [x] region
- [x] nation
- [x] supplier
- [x] customer
- [x] part
- [x] partsupp
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
TODO:
- Complete the
lineitemandorderstables - The
streamandmrparams need to be passed to the column generation functions throughout the code - Need to support scale factors <1
An example on how to use the library has also been added
I'm not following the purpose of the datagen.cpp file in the common directory and it containing a main.
I was thinking the common files would be a datagen library of sorts for tpch queries in general and the main() executable would be in specific benchmarks for specific queries.
@karthikeyann would you please take another look at this PR?
I'm not following the purpose of the datagen.cpp file in the common directory and it containing a main. I was thinking the common files would be a datagen library of sorts for tpch queries in generall and the main() executable would be in specific benchmarks for specific queries.
Removed this for now, will add back inside benchmarks/tpch once #16663 gets merged
Can we use std::span?
I tried using cudf::host_span, but looks like in the perform_left_join code, we use the column indices in the left_on and right_on params to select columns from a table using table_view.select. Now table_view.select accepts only a vector. So we can't use std::span here.
Reviewing just the CMake.
How do you anticipate the generator be used? Will it just be used by other benchmarks in our repository? If so, the CMake looks fine as-is. If you intend broader usage (I'd caution against that unless we have a very good reason) then there is some additional machinery.
Thanks for taking a look @vyasr. For now, we only want our internal benchmark code to use the datagen by just linking to the static library libtpch_data_generator.a.
/merge