vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

Implement the search_global command

Open torognes opened this issue 8 years ago • 7 comments

Should be simple to add. Together with the other issues that suggests other usearch 7 commands to be implemented, this will make the set of usearch 7 commands for nucleotide sequences complete, as far as I can see. Except for the cluster_otus command.

torognes avatar Oct 09 '15 09:10 torognes

Do you have any plans to implement the cluster_otus command as well? This is the only command I still use from the usearch pipeline. All other commands are already replaced by vsearch.

kunstner avatar Mar 16 '16 09:03 kunstner

Thanks for the suggestion. We do not have any plans to implement cluster_otus. I'll look into it and see how much it will take.

torognes avatar Mar 17 '16 16:03 torognes

:+1:

Of course specifics are scarce. My understanding was that it performs like --cluster_smallmem except that when a read does not match to an existing centroid, it is passed through --uparse_ref before it is allowed to become a new centroid. The --uparse_ref algorithm attempts to explain how a new read could derive from existing reads in a database, in a way that sounds a lot like --uchime_denovo. Implementing @frederic-mahe's uchime suggestions as an internal step of OTU picking could yield a solid parity of --cluster_otus. https://github.com/torognes/vsearch/issues/118#issuecomment-193178967

http://www.drive5.com/usearch/manual/uparseotu_algo.html http://www.drive5.com/usearch/manual/cmd_cluster_otus.html http://www.drive5.com/usearch/manual/uparseref_algo.html

@kunstner, if I may ask, why choose uparse over another clustering algorithm? What qualities would you hope for in a vsearch implementation?

colinbrislawn avatar Mar 17 '16 18:03 colinbrislawn

@colinbrislawn: I use it for microbiome data. Actually, I have a quite smoothly running pipeline using vsearch/usearch for preprocessing and mothur for classification and Otu binning. Unfortunately, running time is quite long using this approach and it is very demanding with respect to either RAM or disk space. I was looking for an alternative approach and came across the cluster_otus command. The results look quite similar to the results I obtained by mothur (which isn't the case if I use the other cluster commands implemented in usearch or vsearch). My second aim is to use a pipeline completely based on open source software which scales nicely with huge data sets. Mothur is not a good option in this case if I have to test different parameters.

kunstner avatar Mar 18 '16 07:03 kunstner

Hi @kunstner, Thanks for telling me a little more about your pipeline. I've used the uparse pipeline right when it came out, but like you I value open-source science and so I switched to vsearch in 2015. VSEARCH definitely scales well and swarm scales even better, although its definition of OTU is a little esoteric and has not yet garnered the popularity it deserves.

I have not used mothur for clustering. Does it mitigate OTU inflation really well like uparse does? Colin

colinbrislawn avatar Mar 18 '16 16:03 colinbrislawn

Hi @colinbrislawn,

my personal experience is that mothur usually mitigates OTU inflation well. But I have some data sets (with lots of samples sequenced) with quite a lot of very rare OTUs which I did not get using uparse. But for most of the data I don't see this problem. Unfortunately, mothur has another problem. For larger data sets (MiSeq data, >400 samples), it is difficult to obtain the representative sequence for each OTU if a distance based method is applied.

Axel

kunstner avatar Mar 21 '16 07:03 kunstner

For future reference, usearch_global stands for fast database search, and search_global stands for slow database search (no heuristics, can detect arbitrary low pairwise identities).

frederic-mahe avatar Dec 30 '18 10:12 frederic-mahe