esm icon indicating copy to clipboard operation
esm copied to clipboard

How to generate embeddings for very long protein using EMSC

Open soumitrakp opened this issue 10 months ago • 0 comments

Dear ESMC Developer,

I am attempting to generate embeddings for a set of proteins, including a particularly large protein (>34,000 residues) — ENSMUSP00000097561.4 (mouse heart muscle gene titin). However, I am encountering a CUDA memory limitation error when processing this sequence.

I am using the following code snippet:

EMBEDDING_CONFIG = LogitsConfig(
    sequence=True, return_embeddings=True, return_hidden_states=True
)

def embed_sequence(model: ESM3InferenceClient, sequence: str) -> LogitsOutput:
    protein = ESMProtein(sequence=sequence)
    protein_tensor = model.encode(protein)
    output = model.logits(protein_tensor, EMBEDDING_CONFIG)
    return output

Could you please suggest an approach to handle these types of long proteins?

Thank you for your time and assistance.

Best regards, Soumitra

soumitrakp avatar Feb 27 '25 14:02 soumitrakp