hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

Eukaryote only sequence database

Open BrennicaMarlow opened this issue 4 years ago • 1 comments

I want to use hhblits to make a multiple sequence alignment using only eukaryote sequences. Is there a way to get only the eukaryote sequences from the uniclust database.

:exclamation: Make to check out our User Guide.

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps.

HH-suite Output (for bugs)

Please make sure to post the complete output of the tool you called. Please use gist.github.com.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the issue in.

  • Version/Git commit used:
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • Operating system and version:

BrennicaMarlow avatar Sep 13 '19 16:09 BrennicaMarlow

I want to use hhblits to make a multiple sequence alignment using only eukaryote sequences. Is there a way to get only the eukaryote sequences from the uniclust database.

Hi BrennicaMarlow and All who are reading this,

Actually, I want to do the same with proteins form dsDNA viruses. The best (partial) answer, that I have so far, is to use ffindex_get utility (comes together with the hhsuite-3.2.0) to parse the UniRef30_2020_02_a3m.ffdata by their indices and retrieve the alignment that correspond to specific organism. Something like this

$ ffindex_get UniRef30_2020_02_a3m.ffdata UniRef30_2020_02_a3m.ffindex 110848668 110849024 110850663 110850770 11085238

Then, recalculate HMMs and context states on these sub-alignments with hhmake and cstranslate, correspondingly, and generally follow the guidelines for building customized alignments from MSAs.

The problem here, however, is that there is no correspondence between database index in ffindex file (those 110848668 110849024 110850663 110850770 11085238 in the command above) and taxonomic group. I imagine, it's possible to write a script that will establish this correspondence, because, if you check the headers of sequences in UniRef30_2020_02_a3m.ffdata, you will notice that they contain TaxID="NCBI Taxonomy ID". But maybe dear developers can advise us a better way to solve this problem

Best regards, Danyil

danyilgrybchuk avatar May 22 '20 15:05 danyilgrybchuk