cudf Add libcudf example with large strings

Description

Creating an example that shows reading large strings columns. This uses the 1 billion row challenge input data and provides three examples of loading this data:

brc uses the CSV reader to load the input file in one call and aggregates the results using groupby
brc_chunks uses the CSV reader to load the input file in chunks, aggregates each chunk, and computes the results
brc_pipeline same as brc_chunks but input chunks are processed in separate threads/streams.

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Jun 11 '24 21:06 davidwendt

Thank you @davidwendt, this is really great! I had to add a few things during my testing. Would you please consider adding them to the PR?

add CLI parameter for chunk count to the load_text_chunks and load_csv_chunks examples
console-printed timing to all of the examples, e.g. std::cout << "processing time: " << elapsed.count() << " seconds" << std::endl;
add build_example 1billion to examples/build.sh

Aug 05 '24 04:08 GregoryKimball

console-printed timing to all of the examples, e.g. std::cout << "processing time: " << elapsed.count() << " seconds" << std::endl;

For console timing I just run the example with time like the following

time ./load_text_pipeline 1b.txt 10 2

The output will look like

input:   1b.txt
chunks:  10
threads: 2
number of keys = 413

real	0m5.143s
user	0m6.098s
sys	0m3.022s

Aug 05 '24 13:08 davidwendt

These examples are really nice, @davidwendt! I'm coming around to the idea that the read_text variants should be omitted. read_csv is the more common and standardized reader.. and it ends up with the best performance anyways.

Aug 16 '24 03:08 GregoryKimball

My high-level feedback is that this example should come with a README. It should explain:

the premise of the problem and its origins (external blog post link)
expectations for input data and how to generate it
the different execution methods (normal, chunked, pipelined) and what differs in each (byte ranges, threads)
how the choice of memory resource plays a role if relevant (both "pool" and "cuda" are supported in this code)

This might cover some topics that overlap with the blog post content that is being planned for this, but it's important to provide some explanation in the codebase next to the example itself. It would be fine to have a README that links out to a tech blog on this topic. One key difference is that the tech blog should focus more on performance analysis and how to generalize these ideas to user code, which I do not think are essential for an example README. The example README just needs to orient the reader so they know what they are looking at in the source.

Aug 27 '24 14:08 bdice

On GH200, the brc_pipeline example seems to be calling cudaHostRegister on the entire file for each chunk. And it ends up much slower. I'm looking to see if anything in the data source handling could be changed.

Aug 30 '24 22:08 GregoryKimball

On GH200, the brc_pipeline example seems to be calling cudaHostRegister on the entire file for each chunk.

That unexpected. I'll check the code and update here.

Aug 30 '24 22:08 vuule

Nit:

IMO the folder name 1billion is vague/not clean. Typically I would avoid naming anything starting with a number.
I'm kind of OCD with formatting. For printing information, I would prefer to see the printed sentences as "First letter of each sentence is capitalized" instead of "everything is in lower-case".

Sep 05 '24 03:09 ttnghia

/merge

Sep 05 '24 22:09 davidwendt

cudf cudf copied to clipboard

Add libcudf example with large strings

Description

Checklist

cudf
cudf copied to clipboard