NextGenMap icon indicating copy to clipboard operation
NextGenMap copied to clipboard

Tag handling with --keep-tags creates invalid SAM output

Open adamjorr opened this issue 4 years ago • 1 comments

Hi,

I was attempting to re-align reads that were aligned with another aligner. I was using the --keep-tags option, primarily because I have RG tags and OQ tags that I care about on my reads. However, with --keep-tags, the other tags including MD, NM, MC, and AS are also copied. Since NGM also sets these tags, they are appended to the end of the read so that these tags all appear twice in the read. This is a violation of the SAM specification and consequently causes SAMtools to crash when it tries to parse the read.

As an example, the malformed read looks like this: 2 151M = 129799794 343 CCCTTGCTGCATGAGCCAGTAGCTGGGTGGGCATGGTAGCCTCTTGTCTTCCTAGCTTGCCCCTCCAGACATGGAACCTCCACACTGTGAGCGACTTGGTGTGGGGCAATCCAGGCAGATGTGCTCAGTCTGCCACACCTAGGATGGGGCT :862939:9=:=<<=9===<>4=>==<,;054=6;':=>8;/1/5;==?-<>??;<>>>9<<9?=&><7;;>28=.<<0:9-7>>@97<+<'+;3?>3)<:>[email protected]=@2:1-)>><?4?A).=??<)3=.;@>?A,*4@A5;#### MD:Z:10C48A91 PG:Z:MarkDuplicates.1E.5J RG:Z:HK2WY.5 NM:C:2 OQ:Z:####A7AA7,,FFFAA,A7,7FFA,,AF7AA<<,,7A7KF<,FFKKKFFAA<,7FAA<7,F,F7AKKFA,AA,FF,FA7FAF7FF(FKAFFAKKFFFKKKF,KKFF7,7,FAKF<,F7F<<<F,FKKKF<KAKKKAKKFKAFA<A<<,<<< UQ:C:22 AS:C:141 MQ:i:60 MC:Z:151M AS:i:1460 NM:i:2 NH:i:0 XI:f:0.9868 X0:i:0 XE:i:39 XR:i:151 MD:Z:91T48G10

For now, I'll get around this by getting the reads and aligning them as FASTQ, but if NGM is still being developed I think a good option would be to allow the user to specify which tags to keep when using --keep-tags, have NGM overwrite tags it outputs, or allow more user control over which tags are output by NGM.

adamjorr avatar Sep 15 '20 02:09 adamjorr

Thanks. yes its been a while since we did some changes on the code.. Cheers Fritz

fritzsedlazeck avatar Sep 15 '20 02:09 fritzsedlazeck