vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

usearch_global search aligning to Ns with 100% identity

Open lmolokin opened this issue 5 years ago • 4 comments

Seeing false full length alignments that show 100% identity to stretches of Ns.

vsearch v2.14.1_linux_x86_64

vsearch --usearch_global nano_reclust.fa \
--db blastoNCBI_120919.udb \
--userout nano_reclust.vsearch \
--userfields query+id+alnlen+qcov+target \
--output_no_hits \
--id 0.9 \
--query_cov 0.5 \
--maxhits 10 \
--maxaccepts 0 \
--maxrejects 0 \
--alnout nano_reclust.aln

image

alignment.txt

lmolokin avatar Jan 02 '20 22:01 lmolokin

Thanks for reporting this. I have seen similar behaviour as well. This is related to issue #354.

Matches between/to ambiguous residues is currently counted as matches, and the output is therefore as expected.

Matches to long stretches of N's like this are usually unwanted.

torognes avatar Jan 03 '20 09:01 torognes

Any updates on this? We are also facing the same issue skewing the results. Is there a way to see the match score w.r.t alignment length?

ragavishanmugam avatar Nov 09 '21 04:11 ragavishanmugam

No, there is currently no way to see the match score. The score for matching a nucleotide vs an N is zero.

I am not sure how to handle this.

Alignments can have a negative score and still be shown, both in vsearch and usearch. The alignment score is just used to align a pair of sequences in the best possible way. Note that terminal gaps (and gap penalties) are usually not counted.

These kind of matches with a lot of Ns can also be produced by usearch, but perhaps not exactly this one with only Ns, due to some heuristics.

To eliminate these kind of matches, I think we need to add an option where ambiguous matches (with other symbols than ACGTU) are not counted as matches. Currently matches between compatible symbols, e.g. A vs R, but not A vs Y, are counted as matches when computing the identity percentage.

We could also add an option to set a (negative) score for ambiguous matches.

torognes avatar Nov 10 '21 15:11 torognes

Thank you for replying.

My suggestion would be to differentiate Mixed bases ( like A vs R) from more generic bases like (A vs N). If we could differentiate just the ‘N’s it will be useful. Mixed bases could also mean Mixed populations in some cases and are very subjective.

I think the practical way to implement this would be to give that option to users. If users can somehow input what combinations can be considered as a match and what would be the weight for each combination on the matching score, It will be useful for all cases.

Regards, Ragavi.

ragavishanmugam avatar Nov 10 '21 16:11 ragavishanmugam