How to deal with Amino Acids that are not in the vocabulary
Thanks for ESM!
I am trying to generate language model embeddings with ESM2. However, some of the protein sequences that I have contain AAs that are not in the vocabulary of the language model.
Currently, I am replacing them with '-'. After reading this response in an issue https://github.com/facebookresearch/esm/issues/300#issuecomment-1262447466 I was thinking that my approach might be a bad idea.
What would be the best approach to deal with the uncommon amino acids that are not in the vocabulary such as 'MSE'/SELENOMETHIONINE?
Thanks!
You could replace them with a mask token, or map them to the closest natural amino acid. FYI there was some occurence of Ambiguous Amino Acids (X, B, Z) in the Uniref training data.