hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

How to build cs219 databases faster?

Open sunzhig opened this issue 3 years ago • 0 comments

:exclamation: Make to check out our User Guide.

Dear hh-suite developers,

When I try to build my own databases, the step to get cs219.ff{data,index} by cstranslate is very slow. If the size of input fasta file is 1G, it'll cost about 8 hours. I wander how bfd databases are built and any solutions to speed up this step? Or are there some mistakes of my commands?

Steps to Reproduce (for bugs)

mmseqs createdb mgy_peptides_all.fasta mgy_db
mmseqs linclust mgy_db mgy_clu tmp
mmseqs result2msa mgy_db mgy_db mgy_clu mgy_cluMsa --msa-format-mode 1
mpirun -np 16 cstranslate_mpi -f -i mgy_cluMsa -o mgy_cs219 -x 0.3 -c 4 -I ca3m -b
mpirun -np 16 ffindex_apply_mpi mgy_cluMsa_ca3m.ff{data,index} -i mgy_hhm.ffindex -d mgy_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
rm mgy_cluMsa_ca3m.ff{data,index}
sort -k3 -n -r mgy_cs219.ffindex | cut -f1 > sorting.dat
    
ffindex_order sorting.dat mgy_hhm.ff{data,index} mgy_hhm_ordered.ff{data,index}
mv mgy_hhm_ordered.ffindex mgy_hhm.ffindex
mv mgy_hhm_ordered.ffdata mgy_hhm.ffdata
    
ffindex_order sorting.dat mgy_cluMsa_ca3m.ff{data,index} mgy_cluMsa_ca3m_ordered.ff{data,index}
rm mgy_cluMsa_ca3m.ff{data,index}
mv mgy_cluMsa_ca3m_ordered.ffindex mgy_ca3m.ffindex
mv mgy_cluMsa_ca3m_ordered.ffdata mgy_ca3m.ffdata
rm mgy_clu*
rm mgy_cs219.log.*
rm mgy_db* 

When the input fasta file is very small, these commands can work well. But if the input file is big, cstranslate will cost too much time:(

Your Environment

  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
    My computer can support both AVX2 and SSE and have 500G memory in total.
  • Operating system and version: Linux version 4.15.0-133-generic (buildd@lgw01-amd64-024) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12))

sunzhig avatar Aug 10 '21 09:08 sunzhig