htsjdk
htsjdk copied to clipboard
Implementation of a SAMRecord comparator that matches samtools' queryname sort order.
@nh13, @lbergelson I put this together mostly for fun - it's a comparator that will sort read names the same way samtools does. I'm not really sure what the best way to integrate this is in HTSJDK though. Thoughts?
Codecov Report
Merging #1600 (2260167) into master (a38c78d) will increase coverage by
0.014%
. The diff coverage is94.118%
.
@@ Coverage Diff @@
## master #1600 +/- ##
===============================================
+ Coverage 69.838% 69.852% +0.014%
- Complexity 9653 9664 +11
===============================================
Files 703 704 +1
Lines 37647 37681 +34
Branches 6114 6121 +7
===============================================
+ Hits 26292 26321 +29
- Misses 8904 8906 +2
- Partials 2451 2454 +3
Impacted Files | Coverage Δ | |
---|---|---|
.../samtools/SAMRecordQueryNameNaturalComparator.java | 94.118% <94.118%> (ø) |
|
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java | 80.952% <0.000%> (-9.524%) |
:arrow_down: |
src/main/java/htsjdk/io/AsyncWriterPool.java | 72.222% <0.000%> (-1.389%) |
:arrow_down: |
Always best to be I second place: https://github.com/samtools/htsjdk/pull/1601
Do you need to fall back on the first and second of pair flags as samtools does? https://github.com/samtools/samtools/blob/bdc5bb81c11f2ab25ea97351213cb87b33857c4d/bam_sort.c#L1798
@nh13 I guess it depends on whether you want to match samtools 100% or implement the "natural" sort order which doesn't prescribe how ties are broken. I don't think I have a strong opinion so long as e.g. first of pair sorts before second of pair, and secondary/supplementary records sort after primary
I think the ideal way to make this work, but which would be a PITA to retrofit would be something akin to fgbio's SamOrder, which defines more than the SO
tag, and has capabilities for setting SO
, SS
etc. and for detecting the ordering based on more than the SO
tag. But we had the benefit of writing that many years later, and after the sub-sort stuff landed in the spec.
Ah, the eternal problem https://github.com/samtools/htsjdk/issues/766. I was assuming that awkwardly working around this was the impetus for the last pr?
I think we have a few options none of which are trivial.
I like what you describe. The subsort field was invented to fix this problem so it makes sense that we would use it. Do you know if samtools is outputting the appropriate SS field in new versions? We never started doing so, but maybe they're more responsible... That would solve the problem going forward, and we could allow users to pick which variant order they want to use when writing new files. It wouldn't solve the problem of reading older sam files.
An alternative would be to try to make some sort of meta queryname sorter which tries to figure out the subsort when it's reading a file. That shouldn't be hard I would think, just try both sort orders and if one passes it's valid. I can't remember if there's any issue with stateful comparators but I think it wouldn't be a problem to have one that figures out which it is and then remembers it.
We should write the SS field in either case...
@lbergelson I don't think samtools writes the SS
field either for queryname sorting. Though we might be able to get @jmarshall or @nh13 to help with that.
I think the easiest first step would be to have HTSJDK accept either as being queryname sorted - HTSJDK already has that SAMSortOrderChecker
which is stateful. I don't think it would be hard to extend that to allow for either/or checking for queryname sorting. And that way we could at least get to the point where HTSJDK (and picard etc.) would accept either samtools or HTSJDK queryname sorting (or, "natural" and "lexicographic") as valid. It's possible that would just fix ValidateSamFile
in Picard, or it might require a little more tweaking.
Then the question becomes ... if it can accept either as valid, is there any reason not to switch to the samtools style as the default output for queryname
?
The main reason not to switch the default immediately is that a lot of people used mixed pipeline where some of the tools are on older versions. Switching would break them until they update all parts of the pipeline.
Does samtools have trouble reading htsjdk queryname bams? I'm not sure the problem is symmetrical.
I don't believe there are issues with samtools reading picard/htsjdk queryname sorted files, but it's not something I do regularly. I think samtools tends more towards trusting/assuming the user provided acceptable input and not validating.