htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

Bring htsjdk's BCF2Codec up-to-date against the latest spec, and add tests

Open droazen opened this issue 9 years ago • 13 comments

BCF2Codec has not been well-maintained over the years, and does not fully support the latest BCF 2.2 spec (see the BCF section in http://samtools.github.io/hts-specs/VCFv4.3.pdf). We now have at least one htsjdk client (Intel) that wants to use the htsjdk BCF codec for performance reasons to ingest htslib output (which does support BCF 2.2), and even if we didn't it's worth bringing the codec up-to-date rather than continuing to distribute htsjdk with out-of-date BCF support.

droazen avatar Jun 02 '16 18:06 droazen

For @cmnbroad

droazen avatar Jun 02 '16 18:06 droazen

Also, when this is finished we should undo https://github.com/samtools/htsjdk/pull/591.

cmnbroad avatar Jun 13 '16 21:06 cmnbroad

@droazen to be clear - we only need to be able to read BCF2.2 records created by htslib. I don't think we need to be able to write BCF2.2 for our usecase. Is that right?

akiezun avatar Jun 16 '16 19:06 akiezun

@akiezun I believe so, yes (though we should confirm with the TileDB guys). In any event, the BCF2Codec is only capable of reading, so writing is not covered by this ticket.

droazen avatar Jun 17 '16 17:06 droazen

For what it is worth, our use case is to read BCF2.2 records created by htslib with htsjdk through Hadoop-BAM. Thanks for looking into this!

heuermh avatar Jun 27 '16 20:06 heuermh

Is there any sense of when this work might be completed? We have a similar requirement.

chriswhelix avatar Oct 13 '16 16:10 chriswhelix

We really hope to be able to assign an engineer to work on this this quarter, but can't make any firm promises at this time. The work has been started (see https://github.com/samtools/htsjdk/pull/694 and https://github.com/cmnbroad/htsjdk/tree/cn_bcf2), but it's run into snags related to the fact that we need to maintain backwards compatibility for older versions of the VCF/BCF specs, but the htsjdk parsing code is unfortunately not well decomposed by version. A significant refactoring is needed to properly isolate the parsers for different versions from each other (and do an equivalent task on the writing end).

droazen avatar Oct 13 '16 16:10 droazen

@droazen thanks for the quick response! Is that branch functional for BCF2.2 support if we don't need compatibility with earlier formats?

chriswhelix avatar Oct 13 '16 16:10 chriswhelix

@chriswhelix That branch is a work in progress that definitely shouldn't be used for anything except testing purposes -- @cmnbroad can provide more details on its current status.

droazen avatar Oct 13 '16 16:10 droazen

Its been a while since I've looked at it, but my recollection is that support for reading was mostly there, with the exception of one remaining BCF2.2. feature (end-of-vector marker ?). There is no write support at all. Anyway, its not finished; its pretty far behind master, and its certainly not tested.

cmnbroad avatar Oct 13 '16 18:10 cmnbroad

Thanks @cmnbroad. Really appreciate the responsiveness on this.

After an only mildly hellish tour through JNAerator, Bridj, and undocumented C code, I managed to get bindings to htslib working as a short term solution. Would definitely prefer to use htsjdk once it's updated.

chriswhelix avatar Oct 14 '16 16:10 chriswhelix

Was additional development done to support BCF2.2?

agostof avatar Feb 10 '21 12:02 agostof

@agostof BCF2.2 is still not supported.

cmnbroad avatar Feb 16 '21 13:02 cmnbroad