esm
esm copied to clipboard
Token for unaligned amino acids
Hi there,
I would like to generate embedding for a protein sequence that comes from a multiple sequence alignment (e.g: "AAA--AAA--A")
What kind of token should be used to describe the unaligned amino acids ?
My initial guess is that is could be one of these:
-
. -
- -
<pad> -
<mask>
But which one ?
Thanks for your help
Why not embed the unaligned sequence?
to be able to use the information contained in the msa
ESMC was not trained for aligned sequences. I would embed the unaligned sequences and do a post-hoc alignment of the embeddings.