Single-pass `multibyte_split` for non-overlapping delimiters
Description
This adds a new multibyte_split implementation for "nice" non-overlapping delimiters that needs to scan the input only once, and takes full advantage of byte_range.
To accomplish this, I introduce a new data structure for output_chunks (naming bikeshedding welcome 😄 ) for pre-allocation for unknown, but bounded outputs.
The structure contains a vector of exponentially growing device_uvectors, such that either the last vector has size 0 and the second-to-last vector has size() < capacity(), or all vectors but the last are full (size() == capacity()). It provides the operations
-
next_output(worst_case_size)returns asplit_device_spanof at least theworst_cast_sizepointing to the next free entries from the last two vectors. -
advance_output(actual_size)marks the firstactual_sizeentries of the previously returnedsplit_device_spanas filled.split_device_spantakes care of writing to the smallerdevice_uvectorfirst, and the largerdevice_uvectorsecond, if the first one is full. -
collectcopies all elements that were previously written into a singledevice_uvectorof the correct size.
This data structure should provide a good balance between allocation overheads and memory usage.
I only modified the actual multibyte_split kernel slightly to stop writing offsets once it passes the end of the byte_range. This way, we can determine all required offsets from a single scan, regardless of whether we need provide a range or not.
TODO
- [ ] Handle
byte_rangeedge cases - [ ] Extend to overlapping delimiters by providing
previous_chunksupport fordata_chunk_source
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Closes #11197
This still needs some work, since it duplicates the existing kernel and could use a few more tests, but I think it's already good for a first review.
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@ccd72f2). Click here to learn what that means. Patch has no changes to coverable lines.
:exclamation: Current head 4a89fe5 differs from pull request most recent head 4488965. Consider uploading reports for the commit 4488965 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.10 #11500 +/- ##
===============================================
Coverage ? 86.41%
===============================================
Files ? 145
Lines ? 22993
Branches ? 0
===============================================
Hits ? 19869
Misses ? 3124
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
According to the recent discussion, if this more work then add DO NOT MERGE label until it is ready to merge.
I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so read_text just can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary.
I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so
read_textjust can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary for now.
~There's still some clean up pending before review, is that right?~ Disregard, I see the clean up commits pushed.
@cwharris on the increased memory usage: With the exponential growth of the chunks, at worst we overestimate the amount of memory by the growth factor (2x in this case). A smaller growth factor might also make sense. What we could do alternatively is set a size limit for the chunks, because at some point the allocation overhead amortized over all kernel launches until the chunk is full is negligible.
rerun tests
@gpucibot merge