htsjdk
htsjdk copied to clipboard
Tribble writing support
Description
As described in #701, it will be nice to have a simple mechanism to write features in tribble not only for VariantContext
. This is a port from one of my project that I think that a lot of API users will benefit. It includes the following:
-
FeatureWriter
interface to encapsulate adding features and writing headers. There is no support for comments yet, but it could be something added in the future. -
FeatureEncoder
interface to encapsulate feature-specific output formats. Based on a very nice suggestion from @lindenb in one of my previous PRs, the purpose of this interface is to format the feature/header and add it to anAppendable
. This allow to write directly the features without storing theString
format, and give more flexibility. It also have a method to encapsulate the tabix format, to allow indexing on the fly. - Example of
FeatureEncoder
to convert anyFeature
into a simple BED format (tab-delimited chromosome, start and end). It will be used also in tests. - Package protected implementation of
FeatureWriter
similar to the VCF writer, which uses an encoder and an output stream to create the feature file. This class can be extended for other purposes as noted bellow. - Package protected sub-class of
FeatureWriterImpl
to index on the fly. This one will require eventually #810 for use thejava.nio.Path
index method, but it is very useful for writing feature files already indexed. - Wrapper for
FeatureWriter
to use asynchronous writing. This is a very simple wrapper extending theAbstractAsyncWriter
. The only flaw is that the header could not be written asynchronously with the current implementation (and thus, it's not safe). I do not really need that, and if it is a problem that it's not safe, this support could be removed. - Finally, for construction of feature writers, I decided to use a factory pattern instead of the static methods used in the tribble reader classes. This factory have support: 1) index on the fly, 2) block-gzip or gzip, 3) generate MD5 diggest files, 4) asynchronous wriitng and 5) sequence dictionary addition.
All the writing support use java.nio.Path
to allow other filesystems. There are no tests for the writers yet (only for the simple encoder), because I want to know your opinion about this addition beforehand. I know that this is a very big PR, but I think that it may be useful for others and will be nice to included in HTSJDK for easier development of new bioinformatic tools.
Checklist
- [x] Code compiles correctly
- [x] New tests covering changes and new functionality
- [x] All tests passing
- [x] Extended the README / documentation, if necessary
- [ ] Is not backward compatible (breaks binary or source compatibility)
Because you were interested in the open issue, can you have a look to this one, @lbergelson? It still needs #810 for simplify some code paths, but I think that this could be cool for others here. I'm already using this code in my own tools, and it works amazingly!
Codecov Report
Merging #822 into master will decrease coverage by
0.096%
. The diff coverage is49.565%
.
@@ Coverage Diff @@
## master #822 +/- ##
===============================================
- Coverage 66.198% 66.101% -0.096%
+ Complexity 7640 7631 -9
===============================================
Files 535 539 +4
Lines 32406 32491 +85
Branches 5514 5514
===============================================
+ Hits 21452 21477 +25
- Misses 8790 8847 +57
- Partials 2164 2167 +3
Impacted Files | Coverage Δ | Complexity Δ | |
---|---|---|---|
...ain/java/htsjdk/tribble/writer/FeatureEncoder.java | 0% <0%> (ø) |
0 <0> (?) |
|
...va/htsjdk/tribble/writer/FeatureWriterFactory.java | 0% <0%> (ø) |
0 <0> (?) |
|
.../java/htsjdk/tribble/writer/FeatureWriterImpl.java | 100% <100%> (ø) |
6 <6> (?) |
|
...java/htsjdk/tribble/writer/AsyncFeatureWriter.java | 80% <80%> (ø) |
7 <7> (?) |
|
...main/java/htsjdk/tribble/bed/BEDSimpleEncoder.java | 85.714% <85.714%> (ø) |
3 <3> (?) |
|
...a/htsjdk/tribble/writer/IndexingFeatureWriter.java | 87.5% <87.5%> (ø) |
5 <5> (?) |
|
.../samtools/seekablestream/SeekableMemoryStream.java | 95% <0%> (-5%) |
11% <0%> (-1%) |
|
.../htsjdk/tribble/index/tabix/TabixIndexCreator.java | 84.848% <0%> (-4.545%) |
14% <0%> (-1%) |
|
...dk/samtools/seekablestream/SeekableHTTPStream.java | 54.545% <0%> (-3.03%) |
9% <0%> (-4%) |
|
...ntcontext/writer/IndexingVariantContextWriter.java | 83.333% <0%> (-2.381%) |
15% <0%> (-1%) |
|
... and 25 more |
@magicDGS Sorry I haven't gotten back to you on this. I think we're very interested in this but I haven't had time to take a proper look. It's a big update and there's an upcoming deadline for some of my own work that's taking priority.
No worries @lbergelson. I'm already using this code in another project, but I though that it would be nice to have support for this in HTSJDK. Whenever you have time I'm up for working in your suggestions...
Friendly ping here @lbergelson!
@jonn-smith has volunteered to review this old branch, as he is finding the lack of tribble writing support quite awkward in his current project.
@magicDGS Can you rebase this and I'll start taking a look?
@jonn-smith - I rebased, and I hope that it works with the current implementation of bgzip, which changed a lot after I had the last look to this branch.
Thanks for looking into this, will be really, really helpful!
@jonn-smith are you going to get around to reviewing this?
Another friendly ping @jonn-smith. This was rebased as you asked a year ago....