How the 3D structure was captures in ESM-3 model
Hi,
I was checking ESM-3 structure embedding, and ESM-3 sequence embedding, and found that the distance between embeddings is very less ( 0.0001) . I am curious how ESM-3 model is pre-trained with 3D structure of the protein sequences. Do you have any paper or documentation on ESM-3 from where I can get to know how ESM-3 capture 3D structure?
Have you seen our biorxiv paper? https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1
Hi,
Thanks for your reply. I will check it. Is it possible to specifically answer whether the ESM-3 is trained on AlphaFold 3 D ( full sequence structure ) of the human protein sequences?
Yes data from the AlphaFoldDB was used to train ESM3, that includes human proteins.
Thanks for your reply. It is really interesting. I was checking embeddings generated using ESM-3 sequence and ESM-3 structure separately. I found the cluster generated using ESM-3 sequence embedding is different from the cluster generated using ESM-3 structure embeddings. If ESM_3 captures both sequence and structure then why clusters are different for ESM-3 sequence and ESM-3 structure embeddings. I have applied Agglomerative Clustering. To investigate further detail I was checking for two protein sequences Q6P3R8 and Q9BYP7 that appear together in ESM-3 sequence based clustering but did not appear together ESM-3 structure based clustering. I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 structure and measure the cosine similarity which is 0.96962. Next, I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 sequence and measure the cosine similarity which is 0.9922. The cosine similarity is very close but Q6P3R8 and Q9BYP7 appear together in the same cluster when using ESM-3 sequence embedding but they are appearing separately when using ESM-3 structure embedding. Should not the cluster be similar using ESM-3 sequence embedding and ESM-3 structure embedding? Why am I getting different clusters or trees?
Hi, is it possible to re-train ESM-3 model with structure and sequence?
The Embeddings are in different shapes if the sequences have different length. How do you do the similarity calculations?
Thanks for your reply. It is really interesting. I was checking embeddings generated using ESM-3 sequence and ESM-3 structure separately. I found the cluster generated using ESM-3 sequence embedding is different from the cluster generated using ESM-3 structure embeddings. If ESM_3 captures both sequence and structure then why clusters are different for ESM-3 sequence and ESM-3 structure embeddings. I have applied Agglomerative Clustering. To investigate further detail I was checking for two protein sequences Q6P3R8 and Q9BYP7 that appear together in ESM-3 sequence based clustering but did not appear together ESM-3 structure based clustering. I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 structure and measure the cosine similarity which is 0.96962. Next, I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 sequence and measure the cosine similarity which is 0.9922. The cosine similarity is very close but Q6P3R8 and Q9BYP7 appear together in the same cluster when using ESM-3 sequence embedding but they are appearing separately when using ESM-3 structure embedding. Should not the cluster be similar using ESM-3 sequence embedding and ESM-3 structure embedding? Why am I getting different clusters or trees?