How to Generate Embeddings of protein sequences using ESM C
Hi,
I would like to know how to generate embeddings of protein sequences using ESM C? Is it similar to ESM-2. Is it possible to generate embeddings from the 3d structure or pdb files?
Further I have a following query: Does ESM-3 or ESM-2 or ESM C has decode option meaning if I get the embedding for a sequence "HMJIYT" then we can convert the embedding using decode function to have "HMJIYT" sequence again. This implies "HMJIYT" to --->Embedding then Embedding to ----> "HMJIYT" using ESM model ?
My group wrote a simple wrapper for ESMC if you'd like to interface with it like ESM2 huggingface models. There's also a built in embedding function so its easy to embed entire datasets. https://huggingface.co/Synthyra/ESMplusplus_small
So output.last_hidden_state will give the embedding of a protein sequence like ESM-2?
Yes, the last hidden state is typically the preferred residue-wise protein embedding.
Thanks. I want to know is there a way converting embedding to the corresponding sequence? Let us say [0.9, -34. ...] is the embedding of a sequence "JKLL". Now we update the embedding to [8, 78, 0...]. can we decode [8, 78, 0...] to get the corresponding protein sequence?
Yep, the sequence head does this. The sequence head returns logits (batch_size, sequence_len, vocab_size) which you call .argmax(dim=-1) to get (batch_size, sequence_len) predictions of the tokens (amino acids). However, ESMC seems to do a poor job at this if none of the amino acids are masked, see here.
I'm not sure this is a real issue outside of things like an unmasked mutagenesis study.
Hi, is it possible to share a code example how to covert embedding to the corresponding sequence? Thanks
As shown here, you can get the logits like this for ESM++. Official ESMC has examples on the readme of this repo.
from transformers import AutoModelForMaskedLM #AutoModel also works
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust_remote_code=True)
tokenizer = model.tokenizer
sequences = ['MPRTEIN', 'MSEQWENCE']
tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
# tokenized['labels'] = tokenized['input_ids'].clone() # correctly mask input_ids and set unmasked instances of labels to -100 for MLM training
output = model(**tokenized) # get all hidden states with output_hidden_states=True
print(output.logits.shape) # language modeling logits, (batch_size, seq_len, vocab_size), (2, 11, 64)
print(output.last_hidden_state.shape) # last hidden state of the model, (batch_size, seq_len, hidden_size), (2, 11, 960)
print(output.loss) # language modeling loss if you passed labels
#print(output.hidden_states) # all hidden states if you passed output_hidden_states=True (in tuple)
logits = output.logits # (batch_size, seq_len, vocab_size)
You can decode back to amino acid letters for either like this:
amino_acid_seq = tokenizer.decode(logits.argmax(dim=-1).cpu().flatten().tolist()).replace(' ', '')
If you had a hidden state and wanted to manually see what the sequence head maps to, you could do something like this
hidden_state = ... # (batch_size, seq_len, hidden_size)
logits = model.sequence_head(hidden_state) # (batch_size, seq_len, vocab_size)
Thank you so much. Really appreciate.
Hi,
It is interesting that, the embeddings can be converted to corresponding amino acid of a sequence. Is there a way to convert the embeddings into pdb files ( like sequence) to get the 3 d structure of a sequence?
I'm sure this can be done when calling the components of ESM3 in the right order, however, I have not messed with that model a lot. You may want to tag a member of Evolutionary Scale to get some more insight.
Thanks. One more query, except speed , and less memory usage what are other advantages of ESM++ (ESM C) over ESM-2. One thing I noticed for a sequence ESM-2 generates the embedding of length 320 whereas ESM++ generates the embedding of length 960. Does ESM++ generate more informative embedding? If so how to capture the unique information that only ESM++ can produce?
There are various version of ESM2, you can look at the model and embeddings sizes in a table here. In general ESM++ (and ESMC) have more informative embeddings, although ESM2-650 is still an excellent model. We have a graph that showcases this on our model page, direct link here.
Evolutionary Scale has some stats showing some other tasks that ESMC greatly outperforms vs. ESM2.
The original ESM2 and huggingface implementations are much slower than more modern versions. So unless you are going to use something like FAESM or my FastESM2, I would personally recommend ESM++ small for the vast majority of use cases. For any mask filling objectives, you may want to consider ESM2-650.
Thanks. I am working to develop XAI tool to infer protein-to-protein relations. These ESM models generate the embedding values for a amino acid. I see positive and negative values in the embeddings. Is there a way determine the most important embedding values which carry the most pivotal information of the amino acid?
Pivotal or important is a loaded term for embeddings, the information is very abstract - different portions will be important for some tasks and not others. Ranking the features of embeddings, or specifically from pLMs, is an active area of research. See dictionary learning on NLP models from Anthropic or more recent academic projects doing dictionary learning on pLMs. The Gleghorn Lab is also developing some tools for XAI in pLMs, if you would like to collaborate, please reach out to me here: [email protected] .
Thanks for sharing the pLMs paper seems interesting.
Hi @lhallee I was generating embedding for the sequence "MLKG". My understanding ESM model will generate embedding for each amino acid separately (for M, L, K, G). I see ESM model generates 6 embedding vectors. I can understand 2 extra vectors for CLS and SEP special separator characters. However, which vectors for the special characters. Is it the first vector and the last vector?
Yep, ESM tokenizer will add CLS and EOS, always at the start and end unless there is padding for batching.
Hi @lhallee is it possible to predict the protein to protein relations using embedding by ESM. Let us say, there are embeddings for protein-1 and protein-2 respectively e1, and e2. protein-1 and protein-2 have relations and labeled as 1. The model is trained with the embeddings of protein-1 and protein-2, and corresponding labels. Now, I want to predict the relations of protein-3 and protein-4 using this trained model.
Typically this would require some additional supervised fine-tuning or contrastive learning. However, because similar proteins often produce similar embeddings, you can pool the last hidden state and use a vector similarity metric like cosine similarity to get an idea for shared properties. Additional training is much more reliable though.
As shown here, you can get the logits like this for ESM++. Official ESMC has examples on the readme of this repo.
from transformers import AutoModelForMaskedLM #AutoModel also works model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_small', trust_remote_code=True) tokenizer = model.tokenizer
sequences = ['MPRTEIN', 'MSEQWENCE'] tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
tokenized['labels'] = tokenized['input_ids'].clone() # correctly mask input_ids and set unmasked instances of labels to -100 for MLM training
output = model(**tokenized) # get all hidden states with output_hidden_states=True print(output.logits.shape) # language modeling logits, (batch_size, seq_len, vocab_size), (2, 11, 64) print(output.last_hidden_state.shape) # last hidden state of the model, (batch_size, seq_len, hidden_size), (2, 11, 960) print(output.loss) # language modeling loss if you passed labels #print(output.hidden_states) # all hidden states if you passed output_hidden_states=True (in tuple)
logits = output.logits # (batch_size, seq_len, vocab_size) You can decode back to amino acid letters for either like this:
amino_acid_seq = tokenizer.decode(logits.argmax(dim=-1).cpu().flatten().tolist()).replace(' ', '') If you had a hidden state and wanted to manually see what the sequence head maps to, you could do something like this
hidden_state = ... # (batch_size, seq_len, hidden_size) logits = model.sequence_head(hidden_state) # (batch_size, seq_len, vocab_size)
@lhallee Thanks for the method! I have a small question to ask: When the input is sequences = ['MSEQWENCE'], the output of aminoacidseq is 'XMSLKELLLEK'
Hi @antecede,
This is a known quirk of ESMC - it does not appear to decode meaningfully without mask tokens in the input. See this issue.
Thank you for your prompt reply! Thank you for the detailed answer. The process you developed is very useful, thank you very much
--------------原始邮件-------------- 发件人:"Logan Hallee @.>; 发送时间:2025年3月12日(星期三) 晚上9:58 收件人:"evolutionaryscale/esm" @.>; 抄送:"minuscule @.>;"Mention @.>; 主题:Re: [evolutionaryscale/esm] How to Generate Embeddings of protein sequences using ESM C (Issue #176)
Hi @antecede,
This is a known quirk of ESMC - it does not appear to decode meaningfully without mask tokens in the input. See this issue.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***> lhallee left a comment (evolutionaryscale/esm#176)
Hi @antecede,
This is a known quirk of ESMC - it does not appear to decode meaningfully without mask tokens in the input. See this issue.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Yes, the last hidden state is typically the preferred residue-wise protein embedding.
I am looking to utilise ESM C embeddings as input for a downstream task. I've got a few questions:
- Does your advice (using the last hidden state & averaging over tokens) also hold for the 6b model?
- The example notebook specifies layer 55 (not layer 80), is there a reason for this?
- Should embeddings first be normalised?
- Should tokens (e.g. EOS) be removed before averaging over the sequence length?
The scope of our study does unfortunately not allow for lots of experimentation on this front.
Please keep in mind I do not work for ESM or speak for them. They may have different advice.
Does your advice (using the last hidden state & averaging over tokens) also hold for the 6b model? The example notebook specifies layer 55 (not layer 80), is there a reason for this?
This should hold. For instance this example from them
forge_client = ESM3ForgeInferenceClient(model="esmc-6b-2024-12", url="https://forge.evolutionaryscale.ai", token="<your forge token>")
protein_tensor = forge_client.encode(protein)
logits_output = forge_client.logits(
protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
)
print(logits_output.logits, logits_output.embeddings)
Returns the last hidden state. I think they showcase 55 in that notebook as an example that you can return whatever layer embeddings you would like.
Importantly, the last hidden state may not be the best embedding for your task. It's good on average but each layer certainly contains different information. If you do not have the resources or know how to explore that yourself I would personally recommend the last hidden state as a default. This paper explores this quirk in depth.
Should embeddings first be normalised?
This also depends on the task. It usually doesn't hurt. I like to add a nn.LayerNorm layer as the first layer for small neural networks trained on embeddings. I find this helps in particular for larger models.
Should tokens (e.g. EOS) be removed before averaging over the sequence length?
I see most people keep the CLS and EOS tokens in the average, although removing EOS should not remove a ton of information. Most importantly the pad tokens should not be included in the average. The CLS is also important to keep, it usually contains a lot of good info in general because it acts like an attention sink during training (even though it isn't referenced or used in the loss function). The CLS token by itself is a much better pooling method than the EOS token. In principle, they should function the same way but the CLS is always in the same position, which is our current running hypothesis to why it acts more like an attention sink and holds more info vs. the EOS token.
hiii @lhallee can i use esm3 encoder to embed protein structures without sequences? (i wanna use just structure track of encoder) do you think is it possible? thanks! :)
Hey @123Barry, I believe ESM3 accepts full structures (after tokenization with the ESM3 VQ-VAE) or direct secondary structure tokens. Perhaps this example notebook is helpful. You may also find projects like SAProt useful, which take protein structure inputs in the form of Foldseek tokens. Protokens takes a similar approach to the ESM3 VQ-VAE and may also be helpful. Foldseek uses 20 structure tokens, Protokens is around 400 if I remember correctly, and ESM3 is 4096. I haven't seen them compared, but I suspect the Protokens approach may encode structures the best due to some mutual information tricks. Best, Logan
Hey @123Barry, I believe ESM3 accepts full structures (after tokenization with the ESM3 VQ-VAE) or direct secondary structure tokens. Perhaps this example notebook is helpful. You may also find projects like SAProt useful, which take protein structure inputs in the form of Foldseek tokens. Protokens takes a similar approach to the ESM3 VQ-VAE and may also be helpful. Foldseek uses 20 structure tokens, Protokens is around 400 if I remember correctly, and ESM3 is 4096. I haven't seen them compared, but I suspect the Protokens approach may encode structures the best due to some mutual information tricks. Best, Logan
Thanks a lot for the info and pointers—really helpful! I’ll definitely check out the notebook and look into Protokens and SAProt. Appreciate it! Best, Barry
Thanks. I am working to develop XAI tool to infer protein-to-protein relations. These ESM models generate the embedding values for a amino acid. I see positive and negative values in the embeddings. Is there a way determine the most important embedding values which carry the most pivotal information of the amino acid?
The embeddings are in different shape if the proteins have different length of sequence. How do you calculate the similarity/distance by using these embeddings?
Often folks will average the embedding across the sequence dimension.