hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

custom database creation error on ctranslate step

Open nick-youngblut opened this issue 4 years ago • 4 comments

Expected Behavior

Custom database created for dbCAN v8.

Current Behavior

Error during the cstranslate step.

Steps to Reproduce (for bugs)

# creating custom dbCAN hhsuite database
## download MSA from http://bcb.unl.edu/dbCAN2/download/ (and uncompress)
http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar -pzxvf dbCAN-fam-aln-V8.tar.gz
## build from MSAs
cd dbCAN-fam-aln-V8
ffindex_build -s ../dbCAN-fam-aln-V8.ff{data,index} .
cd ../
## concensus
ffindex_apply dbCAN-fam-aln-V8.ffdata dbCAN-fam-aln-V8.ffindex -i dbCAN-fam-aln-V8_a3m.ffindex -d dbCAN-fam-aln-V8_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
## hmm 
ffindex_apply dbCAN-fam-aln-V8_a3m.ff{data,index} -i dbCAN-fam-aln-V8_hhm.ffindex -d dbCAN-fam-aln-V8_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
## context states
cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219 

HH-suite Output (for bugs)

If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m -o dbCAN-fam-aln-V8_cs219:

Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...

ERROR: Unable to read input file 'dbCAN-fam-aln-V8_a3m'!

If using cstranslate -x 0.3 -c 4 -I a3m -i dbCAN-fam-aln-V8_a3m.ffdata -o dbCAN-fam-aln-V8_cs219:

Reading context library for pseudocounts from internal ...
Reading abstract state alphabet from internal ...

ERROR: Sequence 468 has 181 match columns but should have 613!

Your Environment

Ubuntu 18.04.4

# conda env
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py37hc8dfbb8_0    conda-forge
curl                      7.69.1               h33f0ec9_0    conda-forge
fqtools                   2.0                  hc0aa232_5    bioconda
hhsuite                   3.2.0           py37pl526h3340039_1    bioconda
htslib                    1.9                  h4da6232_3    bioconda
krb5                      1.17.1               h2fd8d38_0    conda-forge
ld_impl_linux-64          2.34                 h53a641e_5    conda-forge
libcurl                   7.69.1               hf7181ac_0    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         h46ee950_0    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgomp                   9.2.0                h24d8f2e_2    conda-forge
libssh2                   1.9.0                hab1572f_2    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
llvm-openmp               8.0.1                hc9558a2_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
openmp                    8.0.1                         0    conda-forge
openssl                   1.1.1g               h516909a_0    conda-forge
perl                      5.26.2            h516909a_1006    conda-forge
pip                       20.1.1                     py_1    conda-forge
python                    3.7.6           cpython_h8356626_6    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
seqkit                    0.12.1                        0    bioconda
setuptools                47.3.1           py37hc8dfbb8_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
taxonkit                  0.5.0                         0    bioconda
tk                        8.6.10               hed695b0_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h516909a_0    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

nick-youngblut avatar Jun 24 '20 15:06 nick-youngblut

Ah I've been meaning to build a database from dbCAN since a while, thanks for the reminder.

I tried to reproduce building the database and it works correctly with the *_mpi binaries.

Something like this works for me:

DB=dbCAN-fam-V8
wget http://bcb.unl.edu/dbCAN2/download/dbCAN-fam-aln-V8.tar.gz
tar xzvf dbCAN-fam-aln-V8.tar.gz
cd dbCAN-fam-aln;
ffindex_build -s ../${DB}_msa.ff{data,index} .
cd ..
sed 's|\.aln||g' ${DB}_msa.ffindex > ${DB}_msa_renamed.ffindex
mv ${DB}_msa_renamed.ffindex ${DB}_msa.ffindex
mpirun -np 16 ffindex_apply_mpi ${DB}_msa.ffdata ${DB}_msa.ffindex -i ${DB}_a3m.ffindex -d ${DB}_a3m.ffdata -- hhconsensus -M 50 -maxres 65535 -i stdin -oa3m stdout -v 0
mpirun -np 16 ffindex_apply_mpi ${DB}_a3m.ff{data,index} -i ${DB}_hhm.ffindex -d ${DB}_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
mpirun -np 16 cstranslate_mpi -x 0.3 -c 4 -I a3m -i ${DB}_a3m -o ${DB}_cs219
# reorder according to cs219 for better access patterns
sort -k 3 -n ${DB}_cs219.ffindex | cut -f1 > ${DB}.list
for type in a3m hhm; do
    ffindex_order ${DB}.list ${DB}_${type}.ffdata ${DB}_${type}.ffindex ${DB}_${type}_opt.ffdata ${DB}_${type}_opt.ffindex
    mv -f ${DB}_${type}_opt.ffdata ${DB}_${type}.ffdata
    mv -f ${DB}_${type}_opt.ffindex ${DB}_${type}.ffindex
done
md5deep ${DB}_{a3m,hhm,cs219}.ff{data,index} > ${DB}.md5sum
tar czvf ${DB}.tar.gz ${DB}_{a3m,hhm,cs219}.ff{data,index} ${DB}.md5sum

I took the liberty to build this database and put it on our file server: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V8.tar.gz

I would recommend to search through it with HHsearch instead of HHblits though. Due to it's small size HHsearch can still easily handle it and it will be more sensitive.

milot-mirdita avatar Jun 28 '20 15:06 milot-mirdita

Hello? I want to know how you get the *_mpi binaries? The document didn't declare the process of installing hh-suite with MPI support? Could you please tell me how to do it? Thanks! I also met the problem `Reading context library for pseudocounts from context_data.lib ... Reading abstract state alphabet from cs219.lib ...

ERROR: Sequence 1 has 764 match columns but should have 2021! `

gancao avatar Aug 08 '20 02:08 gancao

I added a section to the wiki: https://github.com/soedinglab/hh-suite/wiki#mpi-support

I think you were missing the -f or --ffindex flag of cstranslate to switch from single file mode to database read in. That might be what was causing the error message.

milot-mirdita avatar Aug 16 '20 23:08 milot-mirdita

I made a new DB for V9: http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/dbCAN-fam-V9.tar.gz

The dbCAN team thankfully provided the raw alignments for the new release.

milot-mirdita avatar Apr 17 '21 17:04 milot-mirdita