foldcomp icon indicating copy to clipboard operation
foldcomp copied to clipboard

`highquality_cluster30` - fragmented sequences split on undetermined aminoacid

Open valentynbez opened this issue 3 months ago • 0 comments

Hello! I've tried using highquality_clust30 as a reference and identified the following issue. The database has around 200k repeated entries, they appear to be fragmented proteins split on X aminoacid. (The additional information from headers was removed, only unique MG IDs are stored in my FASTAs for indexing with samtools-faidx)

Example 1

> grep "MGYP003384474486" highquality_clust30.lookup                                                                                                                        
32543322        MGYP003384474486        0
32543327        MGYP003384474486        0
32543390        MGYP003384474486        0
32543528        MGYP003384474486        0
32543587        MGYP003384474486        0
> zgrep -A 1 "MGYP003384474486" highquality_clust30.fasta.gz                                                                                                                         
>MGYP003384474486
MFSSKCNLCR
--
>MGYP003384474486
IDQER
--
>MGYP003384474486
KYNEVKIY
--
>MGYP003384474486
ETIIGIYDF
--
>MGYP003384474486
FLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK

When I query ESM API I get

{"sequence": "MFSSKCNLCRXIDQERXKYNEVKIYXETIIGIYDFXFLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK"}

Example 2

> grep "MGYP003343806611" highquality_clust30.lookup                                                                                                                        
31381065        MGYP003343806611        0
31381071        MGYP003343806611        0
>zgrep -A 1 "MGYP003343806611" highquality_clust30.fasta.gz
>MGYP003343806611
MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWA
--
>MGYP003343806611
YY

ESM API

{"sequence": "MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWAXYYX"}

valentynbez avatar Mar 27 '24 15:03 valentynbez