vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

Add support for amino acid sequences

Open audy opened this issue 9 years ago • 14 comments

The README says that this "may be added in the future." Any updates on those plans? Would adding AA support be a large undertaking or something as simple as adding a new alphabet and scoring matrix?

audy avatar Dec 01 '14 18:12 audy

Thanks for the suggestion. In addition to what you have already noted, I believe some code is needed to detect the type of file (nucleotides/amino acids) as well. Also, there might be some parts of the code that assumes that there are only 4 symbols (e.g. database indexing using kmers); this code needs to be updated. Which commands are most relevant for amino acids?

torognes avatar Dec 01 '14 21:12 torognes

Right now the most relevant commands are for clustering (--usearch_local and --usearch_global).

nucleotide/aa detection is IMO a terrible idea (shouldn't the bioinformatician know what they are clustering?) but this feature would be necessary to make vsearch a "drop-in" replacement for usearch.

audy avatar Dec 01 '14 21:12 audy

If you want fast amino acid searches perhaps Diamond, RAPsearch2, or GHOSTZ would be appropriate? For clustering amino acid sequences I'm not sure.

torognes avatar Dec 09 '14 22:12 torognes

I am also in need a faster BLASTP. I tried usearch, which worked well in prelim testing, but then I ran into the 4gb db size. I have tried Diamond, Rapsearch, AB-blast, Lastal, and a host of others. Nothing has worked. So far the most promising is vsearch.

(AB-blast is only about twice as fast as blast, I need at least 10x) (Diamond needs >100GB of ram to run, as does Rapsearch I believe) I will check out GHOSTZ though.

deprekate avatar Jan 01 '15 02:01 deprekate

Ok, we'll try to add support for amino acid sequences in a future version.

torognes avatar Jan 05 '15 12:01 torognes

Awesome. Let me know if I can help test.

audy avatar Jan 05 '15 12:01 audy

I totally second that - i.e. AA support would be very useful.

bwawrik avatar Feb 16 '15 04:02 bwawrik

I third that!

ODiogoSilva avatar Feb 19 '15 15:02 ODiogoSilva

May I say that Diamond does not require >100 GB of RAM. I ran it successfully on machines with 4 GB of RAM using the full NR database. Feel free to contact me for more info.

bbuchfink avatar Mar 24 '15 12:03 bbuchfink

We have obtained funding to adapt vsearch to work with amino acid sequences. The work on this has just been started by @RonnySoak.

torognes avatar Jul 22 '15 13:07 torognes

I started the branch protein_adaption on which I will work while adapting vsearch to amino acid sequences. At the same time I will start with a suite of unit-tests. Especially for the parts that I will change during my work on vsearch.

Feel free to leave comments :-)

RonnySoak avatar Jul 23 '15 10:07 RonnySoak

That's great news! @torognes

audy avatar Jul 23 '15 14:07 audy

The work on this issue has been suspended due to lack of funding. Hopefully the work could be resumed again later. We will update the issue when we have more information.

torognes avatar Sep 22 '15 15:09 torognes

I'm sorry there has been funding issues :-(

I may have mentioned this before, but www.github.com/jeffdaily/parasail library offers pairwise alignment of amino acids with optimization for the SSE2, SSE4.1, AVX2, and KNC instructions sets. It also implements the striped algorithms from Farrar, 2007 and the blocked algorithm from your 2000 paper.

Given the availability of this high quality library for pairwise protein alignment, is this problem more accessible?

colinbrislawn avatar Sep 25 '15 21:09 colinbrislawn