Description

This adds a BGZIP data_chunk_reader usable with multibyte_split. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form

63                    16       0
+----------------------+-------+
|      block offset    | local |
+----------------------+-------+

The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets.

For a description of the BGZIP format, check section 4 in the SAM specification.

Closes #10466

TODO

[x] Use events to avoid clobbering data that is still in use
[x] stricter handling of local_begin (currently it may overflow into subsequent blocks)
[x] add tests where local_begin and local_end are in the same chunk or even block
[x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Sep 05 '22 13:09 upsj

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@972708a). Click here to learn what that means. Patch has no changes to coverable lines.

:exclamation: Current head 7a6e8a1 differs from pull request most recent head 5576bcc. Consider uploading reports for the commit 5576bcc to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.10   #11652   +/-   ##
===============================================
  Coverage                ?   87.52%           
===============================================
  Files                   ?      133           
  Lines                   ?    21794           
  Branches                ?        0           
===============================================
  Hits                    ?    19075           
  Misses                  ?     2719           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Sep 05 '22 15:09 codecov[bot]

rerun tests

Sep 19 '22 14:09 upsj

rerun tests

Sep 19 '22 20:09 upsj

High level design question (realized this after seeing the benchmark code): can we separate compression type from the data source (file, host, device)? With the current implementation we can only read BGZIP files from disk. maybe BGZIP decompressor should be another layer, rather than one of the data_chunked_reader implementations.

Sep 22 '22 01:09 vuule

@vuule This would involve adding another abstraction that can read both small inputs on the host side (header, footer) as well as large inputs to a pinned buffer or the device side (deflate stream). Since this is a pretty domain-specific format (mostly related to genomics), I am not sure if it warrants the complexity of stacking it on top of another abstraction for host/device IO. You can already somewhat serve things from host memory by using a different std::istream, which I think is a quite useful abstraction for this case.

Sep 22 '22 05:09 upsj

Tests are currently failing, might be related to https://github.com/rapidsai/rapids-cmake/pull/272

Sep 22 '22 20:09 upsj

rerun tests

Sep 26 '22 16:09 upsj

@gpucibot merge

Sep 27 '22 09:09 upsj

Add BGZIP `data_chunk_reader`

Description

TODO

Checklist

Codecov Report