esm icon indicating copy to clipboard operation
esm copied to clipboard

Token for unaligned amino acids

Open VGPReys opened this issue 11 months ago • 3 comments

Hi there,

I would like to generate embedding for a protein sequence that comes from a multiple sequence alignment (e.g: "AAA--AAA--A")

What kind of token should be used to describe the unaligned amino acids ?

My initial guess is that is could be one of these:

  • .
  • -
  • <pad>
  • <mask>

But which one ?

Thanks for your help

VGPReys avatar Feb 12 '25 11:02 VGPReys

Why not embed the unaligned sequence?

thomas-a-neil avatar Feb 26 '25 02:02 thomas-a-neil

to be able to use the information contained in the msa

VGPReys avatar Mar 13 '25 15:03 VGPReys

ESMC was not trained for aligned sequences. I would embed the unaligned sequences and do a post-hoc alignment of the embeddings.

ebetica avatar Sep 19 '25 21:09 ebetica