cudf
cudf copied to clipboard
Add libcudf example with large strings
Description
Creating an example that shows reading large strings columns. This uses the 1 billion row challenge input data and provides three examples of loading this data:
brcuses the CSV reader to load the input file in one call and aggregates the results usinggroupbybrc_chunksuses the CSV reader to load the input file in chunks, aggregates each chunk, and computes the resultsbrc_pipelinesame asbrc_chunksbut input chunks are processed in separate threads/streams.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Thank you @davidwendt, this is really great! I had to add a few things during my testing. Would you please consider adding them to the PR?
- add CLI parameter for chunk count to the
load_text_chunksandload_csv_chunksexamples - console-printed timing to all of the examples, e.g.
std::cout << "processing time: " << elapsed.count() << " seconds" << std::endl; - add
build_example 1billiontoexamples/build.sh
console-printed timing to all of the examples, e.g. std::cout << "processing time: " << elapsed.count() << " seconds" << std::endl;
For console timing I just run the example with time like the following
time ./load_text_pipeline 1b.txt 10 2
The output will look like
input: 1b.txt
chunks: 10
threads: 2
number of keys = 413
real 0m5.143s
user 0m6.098s
sys 0m3.022s
These examples are really nice, @davidwendt! I'm coming around to the idea that the read_text variants should be omitted. read_csv is the more common and standardized reader.. and it ends up with the best performance anyways.
My high-level feedback is that this example should come with a README. It should explain:
- the premise of the problem and its origins (external blog post link)
- expectations for input data and how to generate it
- the different execution methods (normal, chunked, pipelined) and what differs in each (byte ranges, threads)
- how the choice of memory resource plays a role if relevant (both
"pool"and"cuda"are supported in this code)
This might cover some topics that overlap with the blog post content that is being planned for this, but it's important to provide some explanation in the codebase next to the example itself. It would be fine to have a README that links out to a tech blog on this topic. One key difference is that the tech blog should focus more on performance analysis and how to generalize these ideas to user code, which I do not think are essential for an example README. The example README just needs to orient the reader so they know what they are looking at in the source.
On GH200, the brc_pipeline example seems to be calling cudaHostRegister on the entire file for each chunk.
And it ends up much slower. I'm looking to see if anything in the data source handling could be changed.
On GH200, the
brc_pipelineexample seems to be callingcudaHostRegisteron the entire file for each chunk.
That unexpected. I'll check the code and update here.
Nit:
- IMO the folder name
1billionis vague/not clean. Typically I would avoid naming anything starting with a number. - I'm kind of OCD with formatting. For printing information, I would prefer to see the printed sentences as "First letter of each sentence is capitalized" instead of "everything is in lower-case".
/merge