hh-suite
hh-suite copied to clipboard
hhblits crashes if a sequence id is longer than 31 characters
I tried to search a fasta file containing ids more than 31 characters using hhblits, which resulted in an opaque error message when hhblits reached that sequence.
Expected Behavior
hhblits works with sequence ids longer than 31 characters
Current Behavior
hhblits should either work with sequence ids longer than 31 characters, or inform the user (ideally in the beginning) the long ids are not supported
Steps to Reproduce (for bugs)
Minimized example cath_5.fasta:
>cath-4_2_0-1a15A00-1-67
KPVSLSYRCPCRFFESHVARANVKHLKILNTPACALQIVARLKNNNRQVCIDPKLKWIQEYLEKALN
>cath-4_2_0-1a1rA02-121-205
TPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPTGHAVGLFRAAVCTRGVAKAVDFIPVENLETTM
>cath-4_2_0-1a1vA02-325-430_452-483
PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS
>cath-4_2_0-1a1vA03-432-451_484-624
VTQTVDFSLDPTFTIETTTLGMFDSSVLCECYDAGXAWYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENFPYLVAYQATVCARAQAPPPSWDQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMS
>cath-4_2_0-1a1zA00-1-83
MDPFLVLLHSVSSSLSSSELTELKGLCLGRVGKRKLERVQSGLDLFSMLLEQNDLEPGHTELLRELLASLRRHDLLRRVDDFELEHHHHHH
Try to build a custom database:
hhsuite/bin/ffindex_from_fasta -s cath_5_fas.ff{data,index} cath_5.fasta
hhsuite/bin/hhblits_omp -i cath_5_fas -oa3m query.a3m -d /path/to/UniRef30_2020_06 -cpu 5
This gives the following output:
- 14:58:10.139 INFO: Searching 25985124 column state sequences.
- 14:58:10.240 INFO: Thread 3 cath-4_2_0-1a1rA02-121-205
- 14:58:10.240 INFO: Thread 1 h-4_2_0-1a15A00-1-67
- 14:58:10.242 INFO: Thread 0 cath-4_2_0-1a1vA02-325-430_452-4|
- 14:58:10.243 INFO: Thread 2 cath-4_2_0-1a1vA03-432-451_484-6A
- 14:58:10.247 INFO: Thread 4 cath-4_2_0-1a1zA00-1-83
- 14:58:10.325 ERROR: Error in /big/martin/hh-suite/src/hhfunc.cpp:83: ReadQueryFile:
- 14:58:10.325 ERROR: unrecognized input file format in 'cath-4_2_0-1a1vA02-325-430_452-4|'
- 14:58:10.325 ERROR: line = PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS
Looking into cath_5_fas.ffindex (notice the non-ascii characters):
cath-4_2_0-1a15A00-1-67 0 94
cath-4_2_0-1a1rA02-121-205 94 115
cath-4_2_0-1a1vA02-325-430_452-4¹o 209 176
cath-4_2_0-1a1vA03-432-451_484-6¹o 385 199
cath-4_2_0-1a1zA00-1-83 584 118
The non-ASCII character get rendered differently depending on the editor:

To check let's use only ids with length <32:
th-4_2_0-1a15A00-1-67
KPVSLSYRCPCRFFESHVARANVKHLKILNTPACALQIVARLKNNNRQVCIDPKLKWIQEYLEKALN
>cath-4_2_0-1a1rA02-121-205
TPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPTGHAVGLFRAAVCTRGVAKAVDFIPVENLETTM
>cath-4_2_0-1a1vA02-325-430_452-
PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS
>cath-4_2_0-1a1vA03-432-451_484-
VTQTVDFSLDPTFTIETTTLGMFDSSVLCECYDAGXAWYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENFPYLVAYQATVCARAQAPPPSWDQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMS
>cath-4_2_0-1a1zA00-1-83
MDPFLVLLHSVSSSLSSSELTELKGLCLGRVGKRKLERVQSGLDLFSMLLEQNDLEPGHTELLRELLASLRRHDLLRRVDDFELEHHHHHH
This file now passes (with v3.1.0; with v3.3.0 it crashes due to #260)
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
- Version/Git commit used: 3.3.0, avx2
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory): Intel Xeon Gold 6248 which has avx2, 376GB RAM, 239G free
- Operating system and version: Ubuntu 18.04
In ffindex.h you can increase the following define.
#define FFINDEX_MAX_ENTRY_NAME_LENTH 32
We didn't want to change that as we are not 100% sure it won't break anything downstream.
You can build a database with MMseqs2 (mmseqs createdb) and symlink the required file names for HH-suite as described in the other thread (https://github.com/soedinglab/hh-suite/issues/262#issuecomment-831439909).
I changed the source and did some quick tests. Found no problems so far (this doesn't mean that I encourage people to do this).
@milot-mirdita This is sort of related to this thread: I was trying to create a ffdata and ffindex files from A3M alignments using ffindex_build and I got the character limit error. Could I use mmseqs createdb here? As far as I know there's no way of creating a MSA DB from a set of A3M files using MMSeqs2 (convertmsa only allows Stockholm).
mmseqs tar2db might be one of the easiest ways to create a database like that.
Sorry, my createdb suggestion doesn't actually work, for the original problem you'd need something with mmseqs createseqfiledb.