hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

uniboost entries look very strange

Open yangkky opened this issue 4 years ago • 2 comments

I download uniboost10_2016_09 and then extracted the database using a3m_database_extract:

$  wget http://gwdu111.gwdg.de/~compbiol/uniclust/2016_09/uniboost10_2016_09.tar.gz
$ tar -xzf  uniboost10_2016_09.tar.gz
$ cd uniboost10_2016_09.tar.gz
$ a3m_database_extract -i uniboost10_2016_09_ca3m -o uniboost_a3m -d uniboost10_2016_09_sequence -q uniboost10_2016_09_header

After that, if I examine the first entry in uniboost10_2016_09_ca3m.ffdata, it doesn't match the first entry in the extracted database:

$ head -n 2 uniboost10_2016_09_ca3m.ffdata
>consensus_sp|Q197F8|002R_IIV3 Uncharacterized protein 002R OS=Invertebrate iridescent virus 3 GN=IIV3-002R PE=4 SV=1 
MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNREQALAERYPELQTSEPSEDYSGPVESLELLPLEIKLDIMQYLSWEQISWCKHPWLWTRWYKDNVVRVSAITFEDFQREYAFPEKIQEIHFTDTRAEEIKAILETTPNVTRLVIRRIDDMNYNTHGDLGLDDLEFLTHLMVEDACGFTDFWAPSLTHLTIKNLDMHPRWFGPVMDGIKSMQSTLKYLYIFETYGVNKPFVQWCTDNIETFYCTNSYRYENVPRPIYVWVLFQEDEWHGYRVEDNKFHRRYMYSTILHKRDTDWVENNPLKTPAQVEMYKFLLRISQLNRDGTGYESDSDPENEHFDDESFSSGEEDSSDEDDPTWAPDSDDSDWETETEEEPSVAARILEKGKLTITNLMKSLGFKPKPKKIQSIDRYFCSLDSNYNSEDEDFEYDSDSEDDDSDSEDDC
$ head -n 2 uniboost_a3m.ffdata
>consensus_tr|A0A151C731|A0A151C731_BIFBI Uncharacterized protein OS=Bifidobacterium bifidum GN=APS66_02335 PE=4 SV=1 
MSVIVPLHKWRSADPAILIGRRCIARTDQDVIIDGRLELIRLPDGTAVLRFQGIGNDIIDHDPNTCSNSMSDGIRSLAIYGKE

Is this expected?

In addition, if I use ffindex_get to inspect entries from the a3m, the header lines look a lot like parts of sequences, and they're followed by blank lines:

$ ffindex_get uniboost_a3m.ffdata uniboost_a3m.ffindex -n 10 | head -n 8

>consensus_tr|E2BA87|E2BA87_HARSA Putative uncharacterized protein OS=Harpegnathos saltator GN=EAI_17578 PE=4 SV=1 
MAKYYTPWVILTLAAVCVVLVCSSARELGQLEGDLEHAASHHHHHGHHHEEGGGHEHHHHHHHEHGEKGEKGHKGHHHHHKGEHGHHGKHHHEGHHHEHGGHKKHHHDEHDHHGHHHEHEHGHKGGKHGHKKGHKKGHKTHGHHHKHHKDEYHKEHKFYDEHHDGGHHEKHGDHHEHHEHKEGHHKKGGHHHSGHHEDHHGKKGHHDKGHHDHDHKGHHGDHGHDEHHHHHEDHGKKGGHHHGKKHGYHHGHHGHHHHH
>MVERLGIAVEDRSPKLRKQAIRERFVLFKKNTERVEKYEYYAIRGQSIYINGRLSKLQSERYPKMIILLDIFCQPNPRNLFLRFKERIDGKSEWENNFTYAGNNIGCTKEMESDMIRIFNELDDEKRDV

MAKYVGPWLLLGLAVVCTVVACSSARELGQLEGDLEVAASHHHHHGHHHEEGGGHEHHAHHHHEHGEKGEKGHKGHHHHHKGEHGHHGKHHHEGHHHEHGGHKKGHHDEHDEHGHHHEHEHGHKGGKFGHKKGHKKGEKTHGYHHKAHKDEYHKEHKFYDDYHKGGHHEKHGDHHGHHEKKEGHHKKGGHHHSGHHEDHHGKKGHHDKGHHDEDHKGHHGKHGHEEHHHHHEDHGKKGGHHGGKKHGYHHG

## Your Environment
Include as many relevant details about the environment you experienced the issue in.
* Version/Git commit used: 3.3.0 installed via `conda install -c conda-forge -c bioconda hhsuite`
* Operating system and version: Ubuntu 16.04.6

yangkky avatar Sep 04 '20 19:09 yangkky

Did you find a solution/workaround to this? I ran into the same thing. I wrote to Martin Steinegger and he said they probably aren't going to bother to look into / fix this but they're working on a completely new method of generating MSAs that they might use to put up a new uniboost-like database.

Looks like the uniboost section of the uniclust pipeline isn't too complicated and should be possible to get running, if you have enough compute. https://github.com/soedinglab/uniclust-pipeline/blob/master/uniclust_workflow.sh I was going to try this next.

jueseph avatar May 07 '21 05:05 jueseph

Yeah, we are working on a new approach. If you just need a very diverse database to learn from you could try the BFD instead of bothering with the uniboost (https://bfd.mmseqs.com). The average Uniboost Neff were also not massively higher than the Uniclust ones, so the benefit is not huge.

The uniclust workflow would need a few changes, since many of the MMseqs2 parameters changed, and it's also quite inefficient due to the direct 90->30 clustering.

milot-mirdita avatar May 18 '21 12:05 milot-mirdita