esm icon indicating copy to clipboard operation
esm copied to clipboard

How the 3D structure was captures in ESM-3 model

Open anonimoustt opened this issue 1 year ago • 6 comments

Hi,

I was checking ESM-3 structure embedding, and ESM-3 sequence embedding, and found that the distance between embeddings is very less ( 0.0001) . I am curious how ESM-3 model is pre-trained with 3D structure of the protein sequences. Do you have any paper or documentation on ESM-3 from where I can get to know how ESM-3 capture 3D structure?

anonimoustt avatar Aug 28 '24 00:08 anonimoustt

Have you seen our biorxiv paper? https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1

ebetica avatar Aug 28 '24 04:08 ebetica

Hi,

Thanks for your reply. I will check it. Is it possible to specifically answer whether the ESM-3 is trained on AlphaFold 3 D ( full sequence structure ) of the human protein sequences?

anonimoustt avatar Aug 28 '24 06:08 anonimoustt

Yes data from the AlphaFoldDB was used to train ESM3, that includes human proteins.

ebetica avatar Aug 29 '24 18:08 ebetica

Thanks for your reply. It is really interesting. I was checking embeddings generated using ESM-3 sequence and ESM-3 structure separately. I found the cluster generated using ESM-3 sequence embedding is different from the cluster generated using ESM-3 structure embeddings. If ESM_3 captures both sequence and structure then why clusters are different for ESM-3 sequence and ESM-3 structure embeddings. I have applied Agglomerative Clustering. To investigate further detail I was checking for two protein sequences Q6P3R8 and Q9BYP7 that appear together in ESM-3 sequence based clustering but did not appear together ESM-3 structure based clustering. I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 structure and measure the cosine similarity which is 0.96962. Next, I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 sequence and measure the cosine similarity which is 0.9922. The cosine similarity is very close but Q6P3R8 and Q9BYP7 appear together in the same cluster when using ESM-3 sequence embedding but they are appearing separately when using ESM-3 structure embedding. Should not the cluster be similar using ESM-3 sequence embedding and ESM-3 structure embedding? Why am I getting different clusters or trees?

anonimoustt avatar Aug 29 '24 19:08 anonimoustt

Hi, is it possible to re-train ESM-3 model with structure and sequence?

anonimoustt avatar Aug 31 '24 13:08 anonimoustt

The Embeddings are in different shapes if the sequences have different length. How do you do the similarity calculations?

Thanks for your reply. It is really interesting. I was checking embeddings generated using ESM-3 sequence and ESM-3 structure separately. I found the cluster generated using ESM-3 sequence embedding is different from the cluster generated using ESM-3 structure embeddings. If ESM_3 captures both sequence and structure then why clusters are different for ESM-3 sequence and ESM-3 structure embeddings. I have applied Agglomerative Clustering. To investigate further detail I was checking for two protein sequences Q6P3R8 and Q9BYP7 that appear together in ESM-3 sequence based clustering but did not appear together ESM-3 structure based clustering. I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 structure and measure the cosine similarity which is 0.96962. Next, I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 sequence and measure the cosine similarity which is 0.9922. The cosine similarity is very close but Q6P3R8 and Q9BYP7 appear together in the same cluster when using ESM-3 sequence embedding but they are appearing separately when using ESM-3 structure embedding. Should not the cluster be similar using ESM-3 sequence embedding and ESM-3 structure embedding? Why am I getting different clusters or trees?

shijiale0609 avatar Jul 06 '25 04:07 shijiale0609