cudf
cudf copied to clipboard
Add BGZIP multibyte_split benchmark
Description
This refactors #11652 to extract the BGZIP IO and adds another source_type to the multibyte_split benchmark, creating a compressed file using zlib.
For reviews, only changes starting from the "bgzip refactoring and benchmark" commit are relevant, everything else is already part of the previous PR.
A quick benchmark shows performance results around 2.5x slower than reading from a device buffer at around 1:5 compression ratio
[0] Tesla T4
| source_type | delim_size | delim_percent | size_approx | byte_range_percent | Time | Peak Memory Usage | Encoded file size |
|---|---|---|---|---|---|---|---|
| bgzip | 1 | 1 | 2^30 = 1073741824 | 100 | 507.479 ms | 4.022 GiB | 1006.638 MiB |
| file | 1 | 1 | 2^30 = 1073741824 | 100 | 339.860 ms | 3.947 GiB | 1006.638 MiB |
| device | 1 | 1 | 2^30 = 1073741824 | 100 | 201.556 ms | 3.947 GiB | 1006.638 MiB |
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
@upsj Is this PR depending on #11652?
@PointKernel yes, should we wait with the review until #11652 is merged? Many of the suggestions also apply to the refactored code.
should we wait with the review until #11652 is merged?
Yes please, this will avoid unnecessary back-and-forth effort.
This seems to be larger than just a benchmark.
I think 22.10 needs to be merged so https://github.com/rapidsai/cudf/pull/11652 changes are excluded. @upsj do you want to keep this targeted for 22.10?
@vuule @ttnghia apologies for the confusion, the PR was still not up-to-date. I am used to a rebase-then-merge workflow rather than a merge-then-squash, so I hope me rebasing the branch doesn't mess up your review workflow. It should be ready now. It's not important to get this into 22.10
rerun tests
rerun tests
rerun tests
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.12@a270ae6). Click here to learn what that means. Patch has no changes to coverable lines.
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.12 #11723 +/- ##
===============================================
Coverage ? 87.48%
===============================================
Files ? 133
Lines ? 21866
Branches ? 0
===============================================
Hits ? 19130
Misses ? 2736
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
| source_type |
|---|
| bgzip |
| file |
| device |
@upsj How did you print out (display) the source type without using NVBENCH_DECLARE_ENUM_TYPE_STRINGS?
@PointKernel I didn't 😉 I edited the output manually for clarity, but thanks for reminding me of the macro, I'll use it instead
@gpucibot merge