cudf Add TPC-H benchmarks for Libcudf

[x] Query 1

Jun 25 '24 18:06 JayjeetAtGithub

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Jun 25 '24 18:06 copy-pr-bot[bot]

/ok to test

Jun 28 '24 20:06 JayjeetAtGithub

@JayjeetAtGithub would you please split the fixed_point literals change into a separate PR?

Jun 28 '24 20:06 GregoryKimball

Sure @GregoryKimball

Jun 28 '24 20:06 JayjeetAtGithub

Thanks for working on this @JayjeetAtGithub! I will review it promptly. Meanwhile, could you please update the description (should mostly be copy-paste + some other details from the README) and briefly detail what this PR is all about! Thanks

Jul 01 '24 21:07 mhaseeb123

/ok to test

Jul 03 '24 22:07 bdice

@JayjeetAtGithub These TPC-H-inspired examples look excellent. I would also like to post your performance results so far to show off your great work.

Jul 03 '24 23:07 GregoryKimball

@JayjeetAtGithub These TPC-H-inspired examples look excellent. I would also like to post your performance results so far to show off your great work.

Just to help clarify for future reference, this is SF10 running on NVIDIA T4 GPU, I believe.

Jul 04 '24 23:07 harrism

Hi all, I have ported the queries to work on TPC-H parquet files where the decimal64 columns is casted to float64. That means I am no more using cudf::fixed_point_scalar<numeric::decimal64> and instead using just cudf::numeric_scalar<double>. I have checked the query results, and they are consistent.

Jul 05 '24 23:07 JayjeetAtGithub

@JayjeetAtGithub Can you help me understand why we aren’t using decimal types? I think that using the same data types is essential for parity of benchmarks with other implementations.

Jul 05 '24 23:07 bdice

@JayjeetAtGithub Can you help me understand why we aren’t using decimal types? I think that using the same data types is essential for parity of benchmarks with other implementations

@bdice I found out that the other benchmarks within the team is using the TPC-H datasets in the /datasets/NOOOM- directory within the RDS servers, which has decimal64 columns casted to float64. So, I changed to using float64. But I can always change my code back to decimal64, if that would work better.

Jul 08 '24 18:07 JayjeetAtGithub

/ok to test

Jul 10 '24 21:07 JayjeetAtGithub

/ok to test

Jul 10 '24 21:07 JayjeetAtGithub

The PR title says these are benchmarks, but they are in cpp/examples rather than cpp/benchmarks. If they are indeed benchmarks I think they should be moved. Also, they should use one of the benchmarking frameworks we use, either google benchmark or nvbench. I think we are generally preferring the latter but I often run into "noise" problems with it because it is intended as a microbenchmark framework for device-side code rather than combined application-level benchmarking like this. So gbench may work better.

I think Mark’s proposal is a good idea. It should allow us to integrate with the existing microbenchmarking measurement frameworks we have, and therefore allow us to track performance over time. Do you have thoughts on this @JayjeetAtGithub @GregoryKimball @mhaseeb123? I guess creating/finding data files is the hardest part.

Jul 11 '24 01:07 bdice

The PR title says these are benchmarks, but they are in cpp/examples rather than cpp/benchmarks. If they are indeed benchmarks I think they should be moved. Also, they should use one of the benchmarking frameworks we use, either google benchmark or nvbench. I think we are generally preferring the latter but I often run into "noise" problems with it because it is intended as a microbenchmark framework for device-side code rather than combined application-level benchmarking like this. So gbench may work better.

I think Mark’s proposal is a good idea. It should allow us to integrate with the existing microbenchmarking measurement frameworks we have, and therefore allow us to track performance over time. Do you have thoughts on this @JayjeetAtGithub @GregoryKimball @mhaseeb123? I guess creating/finding data files is the hardest part.

I think this is okay under examples for now - actually it is a good example and depicts good use of libcudf as building block. I agree with the proposal that this example may evolve to a benchmark in the future with preferably data generation from within the C++ code rather than using static data files.

Jul 11 '24 03:07 mhaseeb123

The PR title says these are benchmarks, but they are in cpp/examples rather than cpp/benchmarks. If they are indeed benchmarks I think they should be moved. Also, they should use one of the benchmarking frameworks we use, either google benchmark or nvbench. I think we are generally preferring the latter but I often run into "noise" problems with it because it is intended as a microbenchmark framework for device-side code rather than combined application-level benchmarking like this. So gbench may work better.

I think Mark’s proposal is a good idea. It should allow us to integrate with the existing microbenchmarking measurement frameworks we have, and therefore allow us to track performance over time. Do you have thoughts on this @JayjeetAtGithub @GregoryKimball @mhaseeb123? I guess creating/finding data files is the hardest part.

I think this is okay under examples for now - actually it is a good example and depicts good use of libcudf as building block. I agree with the proposal that this example may evolve to a benchmark in the future with preferably data generation from within the C++ code rather than using static data files.

Thanks a lot for your thoughts on this @harrism @bdice @mhaseeb123. Along the same lines as @mhaseeb123 mentioned, I would prefer this PR to keep the code under examples and would like to evolve this work as follows (probably each one of these would be a separate PR):

Add small (SF=0.01) datasets to a cpp/examples/tpch/datasets directory and use them to run the examples in CI
Add a way to test the correctness of these queries: Add parquet files containing the result of the queries (at smaller SFs such as 0.01) generated from some other source such as DuckDB / Velox and compare them with the result generated by the libcudf implementations in this PR
Convert the code into NVBENCH benchmarks and move the code to acpp/benchmarks/tpch sub-directory as mentioned by @harrism and @bdice
Hopefully, by this time, the cudf-gen would be mature enough, that, just like @mhaseeb123 mentioned, we could remove the static data files, and use this generator to generate Parquet data directly in host buffers within the benchmarks code.
Keep adding more queries on the side, asynchronously to complete the suite.

Please let me know your thoughts ! Thanks

Jul 11 '24 05:07 JayjeetAtGithub

/ok to test

Jul 11 '24 05:07 JayjeetAtGithub

/ok to test

Jul 11 '24 23:07 JayjeetAtGithub

/ok to test

Jul 16 '24 02:07 JayjeetAtGithub

/ok to test

Jul 16 '24 17:07 JayjeetAtGithub

/ok to test

Jul 16 '24 22:07 JayjeetAtGithub

/ok to test

Jul 17 '24 18:07 JayjeetAtGithub

/merge

Jul 17 '24 22:07 JayjeetAtGithub

/merge

Jul 17 '24 22:07 JayjeetAtGithub

cudf cudf copied to clipboard

Add TPC-H benchmarks for Libcudf

cudf
cudf copied to clipboard