Why does the embedding generated by ESMC have two more tokens than the sequence length?
When using ESMC to generate embeddings, two additional tokens are added. Suppose the input sequence has length “seq_len” , then the shape of the embedding generated by ESMC is (1, seq_len + 2, 960) for esmc-300m-2024-12, and (1, seq_len + 2, 1152) for esmc-600m-2024-12. Based on my attempts, I couldn't find an explanation for these two extra tokens in the official documentation.
Currently, my approach is: if I want to get per-residue embeddings, I remove the first and last token embeddings, so that the length matches the number of residues (sequence length). This way, I can obtain an embedding for each residue (to build protein graph node features). Is there any official explanation for the presence of these two extra embeddings?
Hi 595,
I believe the reason lies in the fact that ESM C was trained using a BERT-like transformer architecture. In BERT-like models, a beginning of sequence token [cls] is prepended and an end of sequence token [eos] is appended to each input embedding before passing the input through the model. The eos token informs the model where the input ends and the cls token was originally used for classification and downstream fine-tuning tasks. Check out the original BERT paper to learn more about these tokens, why they're there and how they can be used. [Ref. Devlin J., et al. 2019]
Hi 595,
I believe the reason lies in the fact that ESM C was trained using a BERT-like transformer architecture. In BERT-like models, a beginning of sequence token
[cls]is prepended and an end of sequence token[eos]is appended to each input embedding before passing the input through the model. The eos token informs the model where the input ends and the cls token was originally used for classification and downstream fine-tuning tasks. Check out the original BERT paper to learn more about these tokens, why they're there and how they can be used. [Ref. Devlin J., et al. 2019]
Thank you for your reply. However, without official documentation from the developers, I'm still somewhat unsure whether this is the correct way to use it.
The two extra embeddings do indeed come from the [cls] and [eos] tokens (source: https://github.com/evolutionaryscale/esm/blob/e103c9c1c4047c38e8c7f1215c91f8481268e366/esm/tokenization/sequence_tokenizer.py#L48)
To get per-residue embeddings for the original sequence, you can remove the first and last token embeddings from the output.