htsjdk
htsjdk copied to clipboard
Added validation warning if the mapped length of the read is greater than the predicted insert size for FR pairs on the same chromosome.
Description
I'm running into situations where I'm being handed BAM files that have reads with mapped lengths longer than their insert sizes. E.g. a 100bp read is part of an FR
pair, has the insert size on it recorded as 95
but has a cigar of 100M
. This seems wrong to me - and this kind of situation is remedied by tools like MergeBamAlignment
in Picard. I'd like to be able to point users to ValidateSamFile
and have it catalog these errors too.
Since this can be somewhat pervasive in pipelines that don't trim back alignments that extend past the end of the insert, I've made it a warning, not an error.
I've also adjusted the way validation errors are handled so that even in STRICT
mode, exceptions are not thrown for warnings, but only for errors.
Checklist
- [x] Code compiles correctly
- [x] New tests covering changes and new functionality
- [x] All tests passing
- [ ] Extended the README / documentation, if necessary
- [ ] Is not backward compatible (breaks binary or source compatibility)
@yfarjoun & @ktibbett Could I request that you guys take a look at this PR please? Particularly it would be good, prior to merging, to ensure that this change won't cause you problems internally. My expectation is that any BAM produced by MergeBamAlignment with CLIP_OVERLAPPING_READS=true
(the default) should be free from these kinds of errors.
failing tests...and needs a rebase now...
In terms of affecting the pipeline this seems fine. would still like to have someone review the code itself. @jacarey could you please do the honors?
@tfenne going over old PRs...can you rebase this? we'd like to put it in...
@yfarjoun I've rebased and fixed conflicts, but am still getting test failures which I don't have time to debug right now.
CramFileReader constructor with file, stream and refsource doesn't set a validation stringency and so it was null which caused processValidationErrors in SamUtils to ignore the error since previously it was it was only checking if it was if stringency == STRICT. I set them to SILENT since they have always been there just previously ignored and I'm not sure if they are relevant to the tests in any case. I could take a deeper look if necessary.
Also as a note I'm a fair amount of these warning spit out on test run. We have a lot of test data that fails this check (usually adapter clipped reads as far as I can tell)
@jacarey I'm not sure why adapter-clipped reads should cause this to fail...could you point me to a bam that has this problem?
So, there seems to be a problem with index_test.bam which is used throughout the test-suite. it causes lots of validation warnings to be emitted and fills up the log file, causing the Test to fail.
Not sure how to resolve this...should we fix index_test and replace it everywhere?
@tfenne, @jacarey, @lbergelson ??
@tfenne regarding the change around warnings, I think that it would be best to add a VALIDATION_STRINGENCY called ALLOW_WARNINGS which would be between STRICT and LENIENT.
I retract my statement about deletions...as you are looking at the length on the reference not the CIGAR.
is this still interesting to the OP? @tfenne
@jacarey could you point me to the failing data?
bump @tfenne @jacarey ....