bamUtil icon indicating copy to clipboard operation
bamUtil copied to clipboard

CIGAR does not evaluate to the same length as SEQ

Open h-2 opened this issue 8 years ago • 3 comments

This warning is produced by validate even if the SEQ is set to *. The SAM spec explicitly states that it is ok for the SEQ to be * and that the lengths don't have to match:

If not a ‘*’, the length of the sequence must equal the sum of lengths of M/I/S/=/X operations in CIGAR

It even states that SEQ should be * for secondary alignments:

SEQ and QUAL of secondary alignments should be set to ‘*’ to reduce the file size.

Or am I misunderstanding something?

Thanks!

h-2 avatar Nov 24 '15 15:11 h-2

Thanks for pointing this out and finding the relevant parts of the documentation. I believe that you are correct. bamUtil won't warn if the CIGAR is *, but doesn't check SEQ's contents.

That validation is done in libStatGen, and I'm afraid that some programs using that library might make assumptions without knowing about *. I'll test things and report back.

pjvandehaar avatar Nov 27 '15 05:11 pjvandehaar

I had not considered secondary reads so when I put in the validation, I didn't see a situation where the cigar would have value, but there would be no sequence. You are correct, that validation does not seem to be valid, so I will try to fix that validation to only validate if SEQ is not '*'. I should be able to do this sometime this week. I'll let you know when complete.

mktrost avatar Dec 01 '15 18:12 mktrost

I updated libStatGen so that the cigar/seq length validation should only occur if the seq is not '*'. Please use the most recent libStatGen and let me know if this fixes the problem or if you continue to have issues.

mktrost avatar Dec 02 '15 01:12 mktrost