metaeuk icon indicating copy to clipboard operation
metaeuk copied to clipboard

Multiple round of metaeuk

Open xiekunwhy opened this issue 2 years ago • 1 comments

Hi,

I am anotating some big animal and plant genomes, when doing homolog base annotation, I want to use those proteins in OrthoDB as homolog proteins, but I found that there are too many protein sequences (5,000,000+ for vertebrate) and metaeuk is slow.

May I cut the whole protein database into 10s or 100s pieces and run metaeuk using each piece seperately, then combine all targets sequences in metaeuk results, and run metaeuk again using this combined target sequences to get the final results?

Best, Kun

xiekunwhy avatar Apr 28 '22 01:04 xiekunwhy

Hi,

I am very sorry for the late reply. This issue somehow escaped me.

What you suggest sounds reasonable. Basically, it is a way to pre-filter the target database and retain only the sequences that have potential to contribute something at a later stage. However, if it is too involved to implement the idea, here are other things you could try:

  1. Divide your contigs to several input files and run each against the large target database
  2. Cluster your target database and use only the representative sequences as a slimmer version of the target (or construct profiles from each cluster)
  3. Choose a different, smaller target database. You can find some options using the command databases

elileka avatar Jul 11 '22 10:07 elileka