esm icon indicating copy to clipboard operation
esm copied to clipboard

How to deal with Amino Acids that are not in the vocabulary

Open HannesStark opened this issue 3 years ago • 1 comments

Thanks for ESM!

I am trying to generate language model embeddings with ESM2. However, some of the protein sequences that I have contain AAs that are not in the vocabulary of the language model.

Currently, I am replacing them with '-'. After reading this response in an issue https://github.com/facebookresearch/esm/issues/300#issuecomment-1262447466 I was thinking that my approach might be a bad idea.

What would be the best approach to deal with the uncommon amino acids that are not in the vocabulary such as 'MSE'/SELENOMETHIONINE?

Thanks!

HannesStark avatar Oct 01 '22 21:10 HannesStark

You could replace them with a mask token, or map them to the closest natural amino acid. FYI there was some occurence of Ambiguous Amino Acids (X, B, Z) in the Uniref training data.

tomsercu avatar Oct 14 '22 22:10 tomsercu