foldcomp
foldcomp copied to clipboard
`highquality_cluster30` - fragmented sequences split on undetermined aminoacid
Hello!
I've tried using highquality_clust30
as a reference and identified the following issue.
The database has around 200k repeated entries, they appear to be fragmented proteins split on X
aminoacid.
(The additional information from headers was removed, only unique MG IDs are stored in my FASTAs for indexing with samtools-faidx
)
Example 1
> grep "MGYP003384474486" highquality_clust30.lookup
32543322 MGYP003384474486 0
32543327 MGYP003384474486 0
32543390 MGYP003384474486 0
32543528 MGYP003384474486 0
32543587 MGYP003384474486 0
> zgrep -A 1 "MGYP003384474486" highquality_clust30.fasta.gz
>MGYP003384474486
MFSSKCNLCR
--
>MGYP003384474486
IDQER
--
>MGYP003384474486
KYNEVKIY
--
>MGYP003384474486
ETIIGIYDF
--
>MGYP003384474486
FLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK
When I query ESM API I get
{"sequence": "MFSSKCNLCRXIDQERXKYNEVKIYXETIIGIYDFXFLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK"}
Example 2
> grep "MGYP003343806611" highquality_clust30.lookup
31381065 MGYP003343806611 0
31381071 MGYP003343806611 0
>zgrep -A 1 "MGYP003343806611" highquality_clust30.fasta.gz
>MGYP003343806611
MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWA
--
>MGYP003343806611
YY
{"sequence": "MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWAXYYX"}