hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

Compress BFD database

Open heeqee opened this issue 2 years ago • 4 comments

Expected Behavior

Hi, BFD database is so huge that I am looking for a solution to compress it. Could you provide the compressed version of BFD database, that is, ca3m database? Or can you show me how to convert an uncompressed a3m database into a compressed ca3m database? Thank you very much!

Current Behavior

BFD database is uncompressed a3m format.

heeqee avatar Nov 24 '21 11:11 heeqee

ca3m only works for redundant databases (sequences occur multiple times in different clusters).

Running hhfilter on each a3m would likely massively reduce the a3m size, but this would have to be carefully benchmarked.

Another possibility would be to port the zstd compressed databases from MMseqs2 to HH-suite, but we don’t have any development resources free for HH-suite.

Colabfold databases are also much smaller. So that might be also a solution for you.

milot-mirdita avatar Nov 24 '21 12:11 milot-mirdita

Thank you for your reply. May I ask if I can pass the compressed file a3m.gz to HHblits via zcat?

heeqee avatar Nov 24 '21 13:11 heeqee

Or can HHblits read the compressed file a3m.gz directly?

heeqee avatar Nov 24 '21 13:11 heeqee

No that does not work.

milot-mirdita avatar Nov 24 '21 14:11 milot-mirdita