cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Optimizing multi-source byte range reading in JSON reader

Open shrshi opened this issue 1 year ago • 21 comments

Description

This piece of work seeks to achieve two goals - (i) reducing repeated reading of byte range chunks in the JSON reader, and (ii) enabling multi-source byte range reading for chunks spanning sources.

  • We expand on the idea outlined in #15185 to reduce the repeated reading of follow-on chunks while searching for the end of the last row in the requested chunk. After the requested chunk, the following chunks are divided into subchunks, and read until the delimiter character is reached.
  • We estimate the buffer size needed for the entire byte range, and compute offsets per source into the buffer.

Visualization of the performance improvement with this optimization

Checklist

  • [X] I am familiar with the Contributing Guidelines.
  • [X] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

shrshi avatar Mar 26 '24 22:03 shrshi

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Apr 05 '24 00:04 copy-pr-bot[bot]

/ok to test

shrshi avatar Apr 05 '24 00:04 shrshi

/ok to test

shrshi avatar Apr 05 '24 10:04 shrshi

/ok to test

shrshi avatar Apr 05 '24 10:04 shrshi

/ok to test

shrshi avatar Apr 05 '24 10:04 shrshi

/ok to test

shrshi avatar Apr 05 '24 16:04 shrshi

/ok to test

shrshi avatar Apr 05 '24 22:04 shrshi

Visualization of performance improvement achieved with this PR as measured by the JSON reader options benchmark with normalize_single_quotes=NO normalize_whitespace=NO mixed_types_as_string=NO recovery_mode=FAIL

byte-range-optimization

shrshi avatar Apr 09 '24 06:04 shrshi

/ok to test

shrshi avatar Apr 10 '24 18:04 shrshi

/ok to test

shrshi avatar Apr 10 '24 21:04 shrshi

/ok to test

shrshi avatar Apr 15 '24 19:04 shrshi

/ok to test

shrshi avatar Apr 16 '24 00:04 shrshi

/ok to test

shrshi avatar Apr 16 '24 05:04 shrshi

/ok to test

shrshi avatar Apr 17 '24 23:04 shrshi

/ok to test

shrshi avatar Apr 17 '24 23:04 shrshi

/ok to test

shrshi avatar Apr 18 '24 22:04 shrshi

/ok to test

shrshi avatar Apr 23 '24 23:04 shrshi

/ok to test

shrshi avatar Apr 24 '24 17:04 shrshi

/ok to test

shrshi avatar Apr 25 '24 21:04 shrshi

/ok to test

shrshi avatar Apr 29 '24 21:04 shrshi

/ok to test

shrshi avatar Apr 29 '24 21:04 shrshi

/ok to test

shrshi avatar Apr 30 '24 17:04 shrshi

/merge

shrshi avatar Apr 30 '24 18:04 shrshi