hh-suite
hh-suite copied to clipboard
uniboost entries look very strange
I download uniboost10_2016_09 and then extracted the database using a3m_database_extract
:
$ wget http://gwdu111.gwdg.de/~compbiol/uniclust/2016_09/uniboost10_2016_09.tar.gz
$ tar -xzf uniboost10_2016_09.tar.gz
$ cd uniboost10_2016_09.tar.gz
$ a3m_database_extract -i uniboost10_2016_09_ca3m -o uniboost_a3m -d uniboost10_2016_09_sequence -q uniboost10_2016_09_header
After that, if I examine the first entry in uniboost10_2016_09_ca3m.ffdata, it doesn't match the first entry in the extracted database:
$ head -n 2 uniboost10_2016_09_ca3m.ffdata
>consensus_sp|Q197F8|002R_IIV3 Uncharacterized protein 002R OS=Invertebrate iridescent virus 3 GN=IIV3-002R PE=4 SV=1
MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNREQALAERYPELQTSEPSEDYSGPVESLELLPLEIKLDIMQYLSWEQISWCKHPWLWTRWYKDNVVRVSAITFEDFQREYAFPEKIQEIHFTDTRAEEIKAILETTPNVTRLVIRRIDDMNYNTHGDLGLDDLEFLTHLMVEDACGFTDFWAPSLTHLTIKNLDMHPRWFGPVMDGIKSMQSTLKYLYIFETYGVNKPFVQWCTDNIETFYCTNSYRYENVPRPIYVWVLFQEDEWHGYRVEDNKFHRRYMYSTILHKRDTDWVENNPLKTPAQVEMYKFLLRISQLNRDGTGYESDSDPENEHFDDESFSSGEEDSSDEDDPTWAPDSDDSDWETETEEEPSVAARILEKGKLTITNLMKSLGFKPKPKKIQSIDRYFCSLDSNYNSEDEDFEYDSDSEDDDSDSEDDC
$ head -n 2 uniboost_a3m.ffdata
>consensus_tr|A0A151C731|A0A151C731_BIFBI Uncharacterized protein OS=Bifidobacterium bifidum GN=APS66_02335 PE=4 SV=1
MSVIVPLHKWRSADPAILIGRRCIARTDQDVIIDGRLELIRLPDGTAVLRFQGIGNDIIDHDPNTCSNSMSDGIRSLAIYGKE
Is this expected?
In addition, if I use ffindex_get to inspect entries from the a3m, the header lines look a lot like parts of sequences, and they're followed by blank lines:
$ ffindex_get uniboost_a3m.ffdata uniboost_a3m.ffindex -n 10 | head -n 8
>consensus_tr|E2BA87|E2BA87_HARSA Putative uncharacterized protein OS=Harpegnathos saltator GN=EAI_17578 PE=4 SV=1
MAKYYTPWVILTLAAVCVVLVCSSARELGQLEGDLEHAASHHHHHGHHHEEGGGHEHHHHHHHEHGEKGEKGHKGHHHHHKGEHGHHGKHHHEGHHHEHGGHKKHHHDEHDHHGHHHEHEHGHKGGKHGHKKGHKKGHKTHGHHHKHHKDEYHKEHKFYDEHHDGGHHEKHGDHHEHHEHKEGHHKKGGHHHSGHHEDHHGKKGHHDKGHHDHDHKGHHGDHGHDEHHHHHEDHGKKGGHHHGKKHGYHHGHHGHHHHH
>MVERLGIAVEDRSPKLRKQAIRERFVLFKKNTERVEKYEYYAIRGQSIYINGRLSKLQSERYPKMIILLDIFCQPNPRNLFLRFKERIDGKSEWENNFTYAGNNIGCTKEMESDMIRIFNELDDEKRDV
MAKYVGPWLLLGLAVVCTVVACSSARELGQLEGDLEVAASHHHHHGHHHEEGGGHEHHAHHHHEHGEKGEKGHKGHHHHHKGEHGHHGKHHHEGHHHEHGGHKKGHHDEHDEHGHHHEHEHGHKGGKFGHKKGHKKGEKTHGYHHKAHKDEYHKEHKFYDDYHKGGHHEKHGDHHGHHEKKEGHHKKGGHHHSGHHEDHHGKKGHHDKGHHDEDHKGHHGKHGHEEHHHHHEDHGKKGGHHGGKKHGYHHG
## Your Environment
Include as many relevant details about the environment you experienced the issue in.
* Version/Git commit used: 3.3.0 installed via `conda install -c conda-forge -c bioconda hhsuite`
* Operating system and version: Ubuntu 16.04.6
Did you find a solution/workaround to this? I ran into the same thing. I wrote to Martin Steinegger and he said they probably aren't going to bother to look into / fix this but they're working on a completely new method of generating MSAs that they might use to put up a new uniboost-like database.
Looks like the uniboost section of the uniclust pipeline isn't too complicated and should be possible to get running, if you have enough compute. https://github.com/soedinglab/uniclust-pipeline/blob/master/uniclust_workflow.sh I was going to try this next.
Yeah, we are working on a new approach. If you just need a very diverse database to learn from you could try the BFD instead of bothering with the uniboost (https://bfd.mmseqs.com). The average Uniboost Neff were also not massively higher than the Uniclust ones, so the benefit is not huge.
The uniclust workflow would need a few changes, since many of the MMseqs2 parameters changed, and it's also quite inefficient due to the direct 90->30 clustering.