Add BGZIP `data_chunk_reader`
Description
This adds a BGZIP data_chunk_reader usable with multibyte_split. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form
63 16 0
+----------------------+-------+
| block offset | local |
+----------------------+-------+
The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets.
For a description of the BGZIP format, check section 4 in the SAM specification.
Closes #10466
TODO
- [x] Use events to avoid clobbering data that is still in use
- [x] stricter handling of local_begin (currently it may overflow into subsequent blocks)
- [x] add tests where local_begin and local_end are in the same chunk or even block
- [x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@972708a). Click here to learn what that means. Patch has no changes to coverable lines.
:exclamation: Current head 7a6e8a1 differs from pull request most recent head 5576bcc. Consider uploading reports for the commit 5576bcc to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.10 #11652 +/- ##
===============================================
Coverage ? 87.52%
===============================================
Files ? 133
Lines ? 21794
Branches ? 0
===============================================
Hits ? 19075
Misses ? 2719
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
rerun tests
rerun tests
High level design question (realized this after seeing the benchmark code): can we separate compression type from the data source (file, host, device)? With the current implementation we can only read BGZIP files from disk. maybe BGZIP decompressor should be another layer, rather than one of the data_chunked_reader implementations.
@vuule This would involve adding another abstraction that can read both small inputs on the host side (header, footer) as well as large inputs to a pinned buffer or the device side (deflate stream). Since this is a pretty domain-specific format (mostly related to genomics), I am not sure if it warrants the complexity of stacking it on top of another abstraction for host/device IO. You can already somewhat serve things from host memory by using a different std::istream, which I think is a quite useful abstraction for this case.
Tests are currently failing, might be related to https://github.com/rapidsai/rapids-cmake/pull/272
rerun tests
@gpucibot merge