htsjdk icon indicating copy to clipboard operation
htsjdk copied to clipboard

Tribble writing support

Open magicDGS opened this issue 7 years ago • 10 comments

Description

As described in #701, it will be nice to have a simple mechanism to write features in tribble not only for VariantContext. This is a port from one of my project that I think that a lot of API users will benefit. It includes the following:

  • FeatureWriter interface to encapsulate adding features and writing headers. There is no support for comments yet, but it could be something added in the future.
  • FeatureEncoder interface to encapsulate feature-specific output formats. Based on a very nice suggestion from @lindenb in one of my previous PRs, the purpose of this interface is to format the feature/header and add it to an Appendable. This allow to write directly the features without storing the String format, and give more flexibility. It also have a method to encapsulate the tabix format, to allow indexing on the fly.
  • Example of FeatureEncoder to convert any Feature into a simple BED format (tab-delimited chromosome, start and end). It will be used also in tests.
  • Package protected implementation of FeatureWriter similar to the VCF writer, which uses an encoder and an output stream to create the feature file. This class can be extended for other purposes as noted bellow.
  • Package protected sub-class of FeatureWriterImpl to index on the fly. This one will require eventually #810 for use the java.nio.Path index method, but it is very useful for writing feature files already indexed.
  • Wrapper for FeatureWriter to use asynchronous writing. This is a very simple wrapper extending the AbstractAsyncWriter. The only flaw is that the header could not be written asynchronously with the current implementation (and thus, it's not safe). I do not really need that, and if it is a problem that it's not safe, this support could be removed.
  • Finally, for construction of feature writers, I decided to use a factory pattern instead of the static methods used in the tribble reader classes. This factory have support: 1) index on the fly, 2) block-gzip or gzip, 3) generate MD5 diggest files, 4) asynchronous wriitng and 5) sequence dictionary addition.

All the writing support use java.nio.Path to allow other filesystems. There are no tests for the writers yet (only for the simple encoder), because I want to know your opinion about this addition beforehand. I know that this is a very big PR, but I think that it may be useful for others and will be nice to included in HTSJDK for easier development of new bioinformatic tools.

Checklist

  • [x] Code compiles correctly
  • [x] New tests covering changes and new functionality
  • [x] All tests passing
  • [x] Extended the README / documentation, if necessary
  • [ ] Is not backward compatible (breaks binary or source compatibility)

magicDGS avatar Mar 10 '17 12:03 magicDGS

Because you were interested in the open issue, can you have a look to this one, @lbergelson? It still needs #810 for simplify some code paths, but I think that this could be cool for others here. I'm already using this code in my own tools, and it works amazingly!

magicDGS avatar Mar 10 '17 12:03 magicDGS

Codecov Report

Merging #822 into master will decrease coverage by 0.096%. The diff coverage is 49.565%.

@@               Coverage Diff               @@
##              master      #822       +/-   ##
===============================================
- Coverage     66.198%   66.101%   -0.096%     
+ Complexity      7640      7631        -9     
===============================================
  Files            535       539        +4     
  Lines          32406     32491       +85     
  Branches        5514      5514               
===============================================
+ Hits           21452     21477       +25     
- Misses          8790      8847       +57     
- Partials        2164      2167        +3
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/tribble/writer/FeatureEncoder.java 0% <0%> (ø) 0 <0> (?)
...va/htsjdk/tribble/writer/FeatureWriterFactory.java 0% <0%> (ø) 0 <0> (?)
.../java/htsjdk/tribble/writer/FeatureWriterImpl.java 100% <100%> (ø) 6 <6> (?)
...java/htsjdk/tribble/writer/AsyncFeatureWriter.java 80% <80%> (ø) 7 <7> (?)
...main/java/htsjdk/tribble/bed/BEDSimpleEncoder.java 85.714% <85.714%> (ø) 3 <3> (?)
...a/htsjdk/tribble/writer/IndexingFeatureWriter.java 87.5% <87.5%> (ø) 5 <5> (?)
.../samtools/seekablestream/SeekableMemoryStream.java 95% <0%> (-5%) 11% <0%> (-1%)
.../htsjdk/tribble/index/tabix/TabixIndexCreator.java 84.848% <0%> (-4.545%) 14% <0%> (-1%)
...dk/samtools/seekablestream/SeekableHTTPStream.java 54.545% <0%> (-3.03%) 9% <0%> (-4%)
...ntcontext/writer/IndexingVariantContextWriter.java 83.333% <0%> (-2.381%) 15% <0%> (-1%)
... and 25 more

codecov-io avatar Mar 10 '17 12:03 codecov-io

@magicDGS Sorry I haven't gotten back to you on this. I think we're very interested in this but I haven't had time to take a proper look. It's a big update and there's an upcoming deadline for some of my own work that's taking priority.

lbergelson avatar Mar 13 '17 22:03 lbergelson

No worries @lbergelson. I'm already using this code in another project, but I though that it would be nice to have support for this in HTSJDK. Whenever you have time I'm up for working in your suggestions...

magicDGS avatar Mar 14 '17 04:03 magicDGS

Friendly ping here @lbergelson!

magicDGS avatar May 08 '17 15:05 magicDGS

@jonn-smith has volunteered to review this old branch, as he is finding the lack of tribble writing support quite awkward in his current project.

droazen avatar Jan 25 '18 20:01 droazen

@magicDGS Can you rebase this and I'll start taking a look?

jonn-smith avatar Jan 25 '18 20:01 jonn-smith

@jonn-smith - I rebased, and I hope that it works with the current implementation of bgzip, which changed a lot after I had the last look to this branch.

Thanks for looking into this, will be really, really helpful!

magicDGS avatar Jan 25 '18 21:01 magicDGS

@jonn-smith are you going to get around to reviewing this?

yfarjoun avatar Feb 04 '19 19:02 yfarjoun

Another friendly ping @jonn-smith. This was rebased as you asked a year ago....

yfarjoun avatar Mar 25 '19 18:03 yfarjoun