vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

CIGAR strings differ between samout and other outputs

Open frederic-mahe opened this issue 6 years ago • 1 comments

vsearch \
    --usearch_global <(printf '>query\nAAGGGGGGGGGCCC\n') \
    --db <(printf '>target\nAAGGGGAAAAGGGGCC\n') \
    --minseqlength 1 \
    --quiet \
    --id 0.1 \
    --userfields caln \
    --userout - \
    --samout - \
    --alnout -
Qry  1 + AAgggg---gggggCC 13
         ||||||    ||||||
Tgt  1 + AAGGGGAAAAGGGGCC 16

SAM:  6M3D7M1I
caln: 6M3I7MD

SAM's CIGAR strings require a number between each letter ("7M1I" instead of "7MD"), but the main different is in the "point-of-view".

SAM's CIGAR strings encode the target modifications needed to equal the query, whereas CIGAR strings in other output formats encode the query modifications needed to equal the target.

If this is confirmed, that should be indicated in the documentation.

frederic-mahe avatar Aug 04 '17 12:08 frederic-mahe

This can be confirmed.

Here is the SAM format specification:

https://samtools.github.io/hts-specs/SAMv1.pdf

Other sources of information:

https://doi.org/10.1093/bioinformatics/btp352 http://www.drive5.com/usearch/manual/cigar.html

The spec indicates that the CIGAR string in the SAM format is in the direction from the reference (target) to the query. Deleted symbols are only found in the reference (target), while inserted symbols are only found in the query. The point of view may be different in other formats. I have made the output similar to USEARCH.

The spec does not say, but in SAM files there always seems to be an '1' in front of operations that happen once, but in other contexts this '1' is sometimes skipped.

torognes avatar Aug 22 '17 14:08 torognes