How to generate embeddings for very long protein using EMSC

Open soumitrakp opened this issue 10 months ago • 0 comments

Dear ESMC Developer,

I am attempting to generate embeddings for a set of proteins, including a particularly large protein (>34,000 residues) — ENSMUSP00000097561.4 (mouse heart muscle gene titin). However, I am encountering a CUDA memory limitation error when processing this sequence.

I am using the following code snippet:

EMBEDDING_CONFIG = LogitsConfig(
    sequence=True, return_embeddings=True, return_hidden_states=True
)

def embed_sequence(model: ESM3InferenceClient, sequence: str) -> LogitsOutput:
    protein = ESMProtein(sequence=sequence)
    protein_tensor = model.encode(protein)
    output = model.logits(protein_tensor, EMBEDDING_CONFIG)
    return output

Could you please suggest an approach to handle these types of long proteins?

Thank you for your time and assistance.

Best regards, Soumitra

Feb 27 '25 14:02 soumitrakp