cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Single-pass `multibyte_split` for non-overlapping delimiters

Open upsj opened this issue 3 years ago • 0 comments

Description

This adds a new multibyte_split implementation for "nice" non-overlapping delimiters that needs to scan the input only once, and takes full advantage of byte_range.

To accomplish this, I introduce a new data structure for output_chunks (naming bikeshedding welcome 😄 ) for pre-allocation for unknown, but bounded outputs. The structure contains a vector of exponentially growing device_uvectors, such that either the last vector has size 0 and the second-to-last vector has size() < capacity(), or all vectors but the last are full (size() == capacity()). It provides the operations

  • next_output(worst_case_size) returns a split_device_span of at least the worst_cast_size pointing to the next free entries from the last two vectors.
  • advance_output(actual_size) marks the first actual_size entries of the previously returned split_device_span as filled. split_device_span takes care of writing to the smaller device_uvector first, and the larger device_uvector second, if the first one is full.
  • collect copies all elements that were previously written into a single device_uvector of the correct size.

This data structure should provide a good balance between allocation overheads and memory usage.

I only modified the actual multibyte_split kernel slightly to stop writing offsets once it passes the end of the byte_range. This way, we can determine all required offsets from a single scan, regardless of whether we need provide a range or not.

TODO

  • [ ] Handle byte_range edge cases
  • [ ] Extend to overlapping delimiters by providing previous_chunk support for data_chunk_source

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

Closes #11197

upsj avatar Aug 09 '22 17:08 upsj

This still needs some work, since it duplicates the existing kernel and could use a few more tests, but I think it's already good for a first review.

upsj avatar Aug 11 '22 11:08 upsj

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@ccd72f2). Click here to learn what that means. Patch has no changes to coverable lines.

:exclamation: Current head 4a89fe5 differs from pull request most recent head 4488965. Consider uploading reports for the commit 4488965 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11500   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    22993           
  Branches                ?        0           
===============================================
  Hits                    ?    19869           
  Misses                  ?     3124           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Aug 15 '22 21:08 codecov[bot]

According to the recent discussion, if this more work then add DO NOT MERGE label until it is ready to merge.

ttnghia avatar Aug 16 '22 17:08 ttnghia

I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so read_text just can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary.

upsj avatar Aug 16 '22 18:08 upsj

I realized today that the behavior I wanted to provide on the high-level is not supported by the existing kernel, so read_text just can't deal with overlapping delimiters for now (which just creates a bunch of empty/delimiter-only rows), so the entire backtracking effort is not necessary for now.

~There's still some clean up pending before review, is that right?~ Disregard, I see the clean up commits pushed.

vuule avatar Aug 16 '22 22:08 vuule

@cwharris on the increased memory usage: With the exponential growth of the chunks, at worst we overestimate the amount of memory by the growth factor (2x in this case). A smaller growth factor might also make sense. What we could do alternatively is set a size limit for the chunks, because at some point the allocation overhead amortized over all kernel launches until the chunk is full is negligible.

upsj avatar Aug 22 '22 09:08 upsj

rerun tests

upsj avatar Aug 26 '22 16:08 upsj

@gpucibot merge

upsj avatar Aug 30 '22 18:08 upsj