esm icon indicating copy to clipboard operation
esm copied to clipboard

Why does the embedding generated by ESMC have two more tokens than the sequence length?

Open 595693085 opened this issue 7 months ago • 1 comments

When using ESMC to generate embeddings, two additional tokens are added. Suppose the input sequence has length “seq_len” , then the shape of the embedding generated by ESMC is (1, seq_len + 2, 960) for esmc-300m-2024-12, and (1, seq_len + 2, 1152) for esmc-600m-2024-12. Based on my attempts, I couldn't find an explanation for these two extra tokens in the official documentation.

Currently, my approach is: if I want to get per-residue embeddings, I remove the first and last token embeddings, so that the length matches the number of residues (sequence length). This way, I can obtain an embedding for each residue (to build protein graph node features). Is there any official explanation for the presence of these two extra embeddings?

595693085 avatar May 15 '25 08:05 595693085

Hi 595,

I believe the reason lies in the fact that ESM C was trained using a BERT-like transformer architecture. In BERT-like models, a beginning of sequence token [cls] is prepended and an end of sequence token [eos] is appended to each input embedding before passing the input through the model. The eos token informs the model where the input ends and the cls token was originally used for classification and downstream fine-tuning tasks. Check out the original BERT paper to learn more about these tokens, why they're there and how they can be used. [Ref. Devlin J., et al. 2019]

Josh-Almonte avatar May 21 '25 21:05 Josh-Almonte

Hi 595,

I believe the reason lies in the fact that ESM C was trained using a BERT-like transformer architecture. In BERT-like models, a beginning of sequence token [cls] is prepended and an end of sequence token [eos] is appended to each input embedding before passing the input through the model. The eos token informs the model where the input ends and the cls token was originally used for classification and downstream fine-tuning tasks. Check out the original BERT paper to learn more about these tokens, why they're there and how they can be used. [Ref. Devlin J., et al. 2019]

Thank you for your reply. However, without official documentation from the developers, I'm still somewhat unsure whether this is the correct way to use it.

595693085 avatar Aug 04 '25 09:08 595693085

The two extra embeddings do indeed come from the [cls] and [eos] tokens (source: https://github.com/evolutionaryscale/esm/blob/e103c9c1c4047c38e8c7f1215c91f8481268e366/esm/tokenization/sequence_tokenizer.py#L48)

To get per-residue embeddings for the original sequence, you can remove the first and last token embeddings from the output.

imathur1 avatar Oct 29 '25 20:10 imathur1