cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Add BGZIP multibyte_split benchmark

Open upsj opened this issue 3 years ago • 2 comments

Description

This refactors #11652 to extract the BGZIP IO and adds another source_type to the multibyte_split benchmark, creating a compressed file using zlib.

For reviews, only changes starting from the "bgzip refactoring and benchmark" commit are relevant, everything else is already part of the previous PR.

A quick benchmark shows performance results around 2.5x slower than reading from a device buffer at around 1:5 compression ratio

[0] Tesla T4

source_type delim_size delim_percent size_approx byte_range_percent Time Peak Memory Usage Encoded file size
bgzip 1 1 2^30 = 1073741824 100 507.479 ms 4.022 GiB 1006.638 MiB
file 1 1 2^30 = 1073741824 100 339.860 ms 3.947 GiB 1006.638 MiB
device 1 1 2^30 = 1073741824 100 201.556 ms 3.947 GiB 1006.638 MiB

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

upsj avatar Sep 20 '22 19:09 upsj

@upsj Is this PR depending on #11652?

PointKernel avatar Sep 20 '22 23:09 PointKernel

@PointKernel yes, should we wait with the review until #11652 is merged? Many of the suggestions also apply to the refactored code.

upsj avatar Sep 21 '22 05:09 upsj

should we wait with the review until #11652 is merged?

Yes please, this will avoid unnecessary back-and-forth effort.

PointKernel avatar Sep 21 '22 17:09 PointKernel

This seems to be larger than just a benchmark.

I think 22.10 needs to be merged so https://github.com/rapidsai/cudf/pull/11652 changes are excluded. @upsj do you want to keep this targeted for 22.10?

vuule avatar Sep 27 '22 19:09 vuule

@vuule @ttnghia apologies for the confusion, the PR was still not up-to-date. I am used to a rebase-then-merge workflow rather than a merge-then-squash, so I hope me rebasing the branch doesn't mess up your review workflow. It should be ready now. It's not important to get this into 22.10

upsj avatar Sep 28 '22 09:09 upsj

rerun tests

upsj avatar Sep 28 '22 21:09 upsj

rerun tests

upsj avatar Sep 29 '22 18:09 upsj

rerun tests

upsj avatar Oct 03 '22 20:10 upsj

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.12@a270ae6). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.12   #11723   +/-   ##
===============================================
  Coverage                ?   87.48%           
===============================================
  Files                   ?      133           
  Lines                   ?    21866           
  Branches                ?        0           
===============================================
  Hits                    ?    19130           
  Misses                  ?     2736           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Oct 03 '22 22:10 codecov[bot]

source_type
bgzip
file
device

@upsj How did you print out (display) the source type without using NVBENCH_DECLARE_ENUM_TYPE_STRINGS?

PointKernel avatar Oct 03 '22 23:10 PointKernel

@PointKernel I didn't 😉 I edited the output manually for clarity, but thanks for reminding me of the macro, I'll use it instead

upsj avatar Oct 04 '22 05:10 upsj

@gpucibot merge

upsj avatar Oct 10 '22 07:10 upsj