htsjdk
htsjdk copied to clipboard
CG tag has the incorrect type.
Description of the issue:
Using picard SamFormatConverter
to turn a long-read CRAM to BAM with more than 65536 cigar ops generates a CG:B:i,
tag. The SAM specification dictates this must be CG:B:I,
. This may be overly picky, but samtools validates against and thus rejects this data. Obviously we can make Samtools more lenient, but it would also be good to correct the issue at source as it is generating malformed data that will cause issues for older copies of samtools.
Your environment:
- version of htsjdk
- version of java
- which OS
Whatever version of htsjdk is bundled in Picard 2.25.7. (I'm assuming this issue is part of htsjdk and not picard.) openjdk 11.0.11 Ubuntu 18.04.5 LTS (bionic)
Steps to reproduce
See the file reported in https://github.com/samtools/samtools/issues/1477
If you convert this to SAM or CRAM (not using samtools so the CIGAR is correct) and then convert to BAM using picard then we see the uncompressed BAM binary contains CGBi
.
PS. It would also be good to get PG
headers added. I only tried Picard as a punt, as there's nothing in the headers to indicate that is infact what did the format conversion. Please provide data provenance for format conversions.
@jkbonfield Thanks for reporting. Looks like an oversight where we forgot to set the "isUnsigned" bit when setting the tag value.
Adding provenance to format conversions is a good idea, but I'll have to think about where that could live. Currently the writer in htsjdk doesn't know the type that the reader read since it just gets the same SAMRecord object no matter what the input was. Is that @cmnbroad still the case in the new code? I assume it is for BAM/CRAM since we don't have different header versions that would be propagated.
Does htslib automatically add PG lines when performing conversions or is it something special in samtools view?