htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

We got a grant to improve Htsjdk!

Open lbergelson opened this issue 5 years ago • 5 comments
trafficstars

Some exciting news!

The Methods team at Broad recently applied for, and were subsequently awarded a grant as part of the Chan Zuckerberg Initiative’s Essential Open Source Software Program to fund development work on htsjdk. Specifically, the grant will fund one part-time developer for one year to develop a plugin system that supports versioned file formats, along with a few other features (more details below).

The work is slated to start on August 1st of this year, and will involve development of some new interfaces, as well as exposing the currently existing SAM, BAM, CRAM, and VCF formats through the plugin system.

The new interfaces and components will reside in new Java package(s), and will initially live alongside the existing classes and interfaces in order to minimize disruption for consumers of the existing APIs. Ultimately the expectation is that at some point the existing APIs will be deprecated.

The deliverables include:

CZI Deliverables

  1. A simple, easily-extensible plugin framework to facilitate support for new file formats, new data types, and new versions of existing formats
  2. A published API and API versioning scheme, with declared version compatibility policies.
  3. A set of interfaces that allow for multiple, versioned, independent implementations of:
  • Alignment file formats - Reference file formats - Variant file formats Index file formats
  1. Repackage the existing implementations of the currently-supported file formats into versioned plugin components:
  • FASTA reference format
  • SAM/BAM v1.6 format
  • CRAM v3.1 format
  • VCF v4.3 format
  1. Generalized support for location-independent data access via URI based identifiers, and layered streaming to enable eventual support for alternative access and encryption mechanisms such as those defined by the GA4GH specifications refget, htsget, and crypt4gh.
  2. Support for long reads produced by sequencing technologies such as those developed by Oxford Nanopore and Pacific Biosciences, with accompanying tests

lbergelson avatar May 27 '20 17:05 lbergelson

Great news. Any chance of including VCFv4.4 in scope near the end? There's quite a few SV changes in 4.4 that if you're designing a better VCF API it would be prudent to incorporate the improved SV support in VCFv4.4 into the htsjdk API.

There's also work underway for defining a 'strict' version of SAM/VCF (see https://github.com/samtools/hts-specs/pull/283 for an initial SAM draft) that requires the SAM/VCF to actually makes sense (e.g. mate/SA records actually exist, NM matches edit distance, and so on). Do you consider this, and any matching SAM/VCF validators in the scope of htsjdk or picard?

d-cameron avatar May 28 '20 01:05 d-cameron

Also, is there a plan for community engagement when drafting the APIs? I'd definitely like to contribute to any htsjdk API extensions that treat VCF symbolic alleles as first-class citizens.

d-cameron avatar May 28 '20 01:05 d-cameron

Yay! Super excited for you, congrats!

ohofmann avatar May 28 '20 03:05 ohofmann

Bravo! :)

brainstorm avatar May 28 '20 03:05 brainstorm

@d-cameron VCF 4.4 itself was not included in the deliverables for the grant, but its exactly the kind of thing we hope the grant-funded work will more readily enable. The same goes for new interfaces for things like VariantContext and SAMRecord. It may be possible to begin VCF4.4 in parallel though, once we have the base framework in place (htsjdk still doesn't have VCF 4.3 write support). As for community engagement, the PRs for this work will be available for public comment as usual.

cmnbroad avatar Jun 01 '20 21:06 cmnbroad