hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

hhblits crashes if a sequence id is longer than 31 characters

Open konstin opened this issue 4 years ago • 3 comments

I tried to search a fasta file containing ids more than 31 characters using hhblits, which resulted in an opaque error message when hhblits reached that sequence.

Expected Behavior

hhblits works with sequence ids longer than 31 characters

Current Behavior

hhblits should either work with sequence ids longer than 31 characters, or inform the user (ideally in the beginning) the long ids are not supported

Steps to Reproduce (for bugs)

Minimized example cath_5.fasta:

>cath-4_2_0-1a15A00-1-67
KPVSLSYRCPCRFFESHVARANVKHLKILNTPACALQIVARLKNNNRQVCIDPKLKWIQEYLEKALN
>cath-4_2_0-1a1rA02-121-205
TPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPTGHAVGLFRAAVCTRGVAKAVDFIPVENLETTM
>cath-4_2_0-1a1vA02-325-430_452-483
PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS
>cath-4_2_0-1a1vA03-432-451_484-624
VTQTVDFSLDPTFTIETTTLGMFDSSVLCECYDAGXAWYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENFPYLVAYQATVCARAQAPPPSWDQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMS
>cath-4_2_0-1a1zA00-1-83
MDPFLVLLHSVSSSLSSSELTELKGLCLGRVGKRKLERVQSGLDLFSMLLEQNDLEPGHTELLRELLASLRRHDLLRRVDDFELEHHHHHH

Try to build a custom database:

hhsuite/bin/ffindex_from_fasta -s cath_5_fas.ff{data,index} cath_5.fasta
hhsuite/bin/hhblits_omp -i cath_5_fas -oa3m query.a3m -d /path/to/UniRef30_2020_06 -cpu 5

This gives the following output:

- 14:58:10.139 INFO: Searching 25985124 column state sequences.
- 14:58:10.240 INFO: Thread 3   cath-4_2_0-1a1rA02-121-205
- 14:58:10.240 INFO: Thread 1   h-4_2_0-1a15A00-1-67
- 14:58:10.242 INFO: Thread 0   cath-4_2_0-1a1vA02-325-430_452-4|
- 14:58:10.243 INFO: Thread 2   cath-4_2_0-1a1vA03-432-451_484-6A
- 14:58:10.247 INFO: Thread 4   cath-4_2_0-1a1zA00-1-83
- 14:58:10.325 ERROR: Error in /big/martin/hh-suite/src/hhfunc.cpp:83: ReadQueryFile:
- 14:58:10.325 ERROR:   unrecognized input file format in 'cath-4_2_0-1a1vA02-325-430_452-4|'
- 14:58:10.325 ERROR:   line = PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS

Looking into cath_5_fas.ffindex (notice the non-ascii characters):

cath-4_2_0-1a15A00-1-67	0	94
cath-4_2_0-1a1rA02-121-205	94	115
cath-4_2_0-1a1vA02-325-430_452-4¹‚‘o•	209	176
cath-4_2_0-1a1vA03-432-451_484-6¹‚‘o•	385	199
cath-4_2_0-1a1zA00-1-83	584	118

The non-ASCII character get rendered differently depending on the editor:

grafik

To check let's use only ids with length <32:

th-4_2_0-1a15A00-1-67
KPVSLSYRCPCRFFESHVARANVKHLKILNTPACALQIVARLKNNNRQVCIDPKLKWIQEYLEKALN
>cath-4_2_0-1a1rA02-121-205
TPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPTGHAVGLFRAAVCTRGVAKAVDFIPVENLETTM
>cath-4_2_0-1a1vA02-325-430_452-
PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS
>cath-4_2_0-1a1vA03-432-451_484-
VTQTVDFSLDPTFTIETTTLGMFDSSVLCECYDAGXAWYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENFPYLVAYQATVCARAQAPPPSWDQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMS
>cath-4_2_0-1a1zA00-1-83
MDPFLVLLHSVSSSLSSSELTELKGLCLGRVGKRKLERVQSGLDLFSMLLEQNDLEPGHTELLRELLASLRRHDLLRRVDDFELEHHHHHH

This file now passes (with v3.1.0; with v3.3.0 it crashes due to #260)

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

  • Version/Git commit used: 3.3.0, avx2
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory): Intel Xeon Gold 6248 which has avx2, 376GB RAM, 239G free
  • Operating system and version: Ubuntu 18.04

konstin avatar May 02 '21 13:05 konstin

In ffindex.h you can increase the following define.

#define FFINDEX_MAX_ENTRY_NAME_LENTH 32

We didn't want to change that as we are not 100% sure it won't break anything downstream. You can build a database with MMseqs2 (mmseqs createdb) and symlink the required file names for HH-suite as described in the other thread (https://github.com/soedinglab/hh-suite/issues/262#issuecomment-831439909).

milot-mirdita avatar May 03 '21 18:05 milot-mirdita

I changed the source and did some quick tests. Found no problems so far (this doesn't mean that I encourage people to do this).

@milot-mirdita This is sort of related to this thread: I was trying to create a ffdata and ffindex files from A3M alignments using ffindex_build and I got the character limit error. Could I use mmseqs createdb here? As far as I know there's no way of creating a MSA DB from a set of A3M files using MMSeqs2 (convertmsa only allows Stockholm).

apcamargo avatar May 25 '21 16:05 apcamargo

mmseqs tar2db might be one of the easiest ways to create a database like that.

Sorry, my createdb suggestion doesn't actually work, for the original problem you'd need something with mmseqs createseqfiledb.

milot-mirdita avatar May 26 '21 15:05 milot-mirdita