hh-suite
hh-suite copied to clipboard
custom database creation error on ctranslate step
Expected Behavior
Custom database created for dbCAN v8.
Current Behavior
Error during the cstranslate
step.
Steps to Reproduce (for bugs)
# creating custom dbCAN hhsuite database
## download MSA from http://bcb.unl.edu/dbCAN2/download/ (and uncompress)
http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar -pzxvf dbCAN-fam-aln-V8.tar.gz
## build from MSAs
cd dbCAN-fam-aln-V8
ffindex_build -s ../dbCAN-fam-aln-V8.ff{data,index} .
cd ../
## concensus
ffindex_apply dbCAN-fam-aln-V8.ffdata dbCAN-fam-aln-V8.ffindex -i dbCAN-fam-aln-V8_a3m.ffindex -d dbCAN-fam-aln-V8_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
## hmm
ffindex_apply dbCAN-fam-aln-V8_a3m.ff{data,index} -i dbCAN-fam-aln-V8_hhm.ffindex -d dbCAN-fam-aln-V8_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
## context states
cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219
HH-suite Output (for bugs)
If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219
:
Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...
ERROR: Unable to read input file 'dbCAN-fam-aln-V8_a3m'!
If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m.ffdata -o dbCAN-fam-aln-V8_cs219
:
Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...
ERROR: Sequence 468 has 181 match columns but should have 613!
Your Environment
Ubuntu 18.04.4
# conda env
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 0_gnu conda-forge
bzip2 1.0.8 h516909a_2 conda-forge
ca-certificates 2020.6.20 hecda079_0 conda-forge
certifi 2020.6.20 py37hc8dfbb8_0 conda-forge
curl 7.69.1 h33f0ec9_0 conda-forge
fqtools 2.0 hc0aa232_5 bioconda
hhsuite 3.2.0 py37pl526h3340039_1 bioconda
htslib 1.9 h4da6232_3 bioconda
krb5 1.17.1 h2fd8d38_0 conda-forge
ld_impl_linux-64 2.34 h53a641e_5 conda-forge
libcurl 7.69.1 hf7181ac_0 conda-forge
libdeflate 1.6 h516909a_0 conda-forge
libedit 3.1.20191231 h46ee950_0 conda-forge
libffi 3.2.1 he1b5a44_1007 conda-forge
libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
libgomp 9.2.0 h24d8f2e_2 conda-forge
libssh2 1.9.0 hab1572f_2 conda-forge
libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
llvm-openmp 8.0.1 hc9558a2_0 conda-forge
ncurses 6.1 hf484d3e_1002 conda-forge
openmp 8.0.1 0 conda-forge
openssl 1.1.1g h516909a_0 conda-forge
perl 5.26.2 h516909a_1006 conda-forge
pip 20.1.1 py_1 conda-forge
python 3.7.6 cpython_h8356626_6 conda-forge
python_abi 3.7 1_cp37m conda-forge
readline 8.0 hf8c457e_0 conda-forge
seqkit 0.12.1 0 bioconda
setuptools 47.3.1 py37hc8dfbb8_0 conda-forge
sqlite 3.30.1 hcee41ef_0 conda-forge
taxonkit 0.5.0 0 bioconda
tk 8.6.10 hed695b0_0 conda-forge
wheel 0.34.2 py_1 conda-forge
xz 5.2.5 h516909a_0 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
Ah I've been meaning to build a database from dbCAN since a while, thanks for the reminder.
I tried to reproduce building the database and it works correctly with the *_mpi
binaries.
Something like this works for me:
DB=dbCAN-fam-V8
wget http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar xzvf dbCAN-fam-aln-V8.tar.gz
cd dbCAN-fam-aln;
ffindex_build -s ../${DB}_msa.ff{data,index} .
cd ..
sed 's|\.aln||g' ${DB}_msa.ffindex > ${DB}_msa_renamed.ffindex
mv ${DB}_msa_renamed.ffindex ${DB}_msa.ffindex
mpirun -np 16 ffindex_apply_mpi ${DB}_msa.ffdata ${DB}_msa.ffindex -i ${DB}_a3m.ffindex -d ${DB}_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
mpirun -np 16 ffindex_apply_mpi ${DB}_a3m.ff{data,index} -i ${DB}_hhm.ffindex -d ${DB}_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
mpirun -np 16 cstranslate_mpi -x 0.3 -c 4 -I a3m -i ${DB}_a3m -o ${DB}_cs219
# reorder according to cs219 for better access patterns
sort -k 3 -n ${DB}_cs219.ffindex | cut -f1 > ${DB}.list
for type in a3m hhm; do
ffindex_order ${DB}.list ${DB}_${type}.ffdata ${DB}_${type}.ffindex ${DB}_${type}_opt.ffdata ${DB}_${type}_opt.ffindex
mv -f ${DB}_${type}_opt.ffdata ${DB}_${type}.ffdata
mv -f ${DB}_${type}_opt.ffindex ${DB}_${type}.ffindex
done
md5deep ${DB}_{a3m,hhm,cs219}.ff{data,index} > ${DB}.md5sum
tar czvf ${DB}.tar.gz ${DB}_{a3m,hhm,cs219}.ff{data,index} ${DB}.md5sum
I took the liberty to build this database and put it on our file server: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V8.tar.gz
I would recommend to search through it with HHsearch instead of HHblits though. Due to it's small size HHsearch can still easily handle it and it will be more sensitive.
Hello? I want to know how you get the *_mpi binaries? The document didn't declare the process of installing hh-suite with MPI support? Could you please tell me how to do it? Thanks! I also met the problem `Reading context library for pseudocounts from context_data.lib ... Reading abstract state alphabet from cs219.lib ...
ERROR: Sequence 1 has 764 match columns but should have 2021! `
I added a section to the wiki: https://github.com/soedinglab/hh-suite/wiki#mpi-support
I think you were missing the -f
or --ffindex
flag of cstranslate
to switch from single file mode to database read in.
That might be what was causing the error message.
I made a new DB for V9: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V9.tar.gz
The dbCAN team thankfully provided the raw alignments for the new release.