ProtTrans icon indicating copy to clipboard operation
ProtTrans copied to clipboard

Precomputed embedding for all huuman proteins

Open SalvatoreRa opened this issue 2 years ago • 2 comments

Hi everyone,

I would like to know if you have the precomputed embedding for the human proteins (not for other sequences). I am just interested in the vectors for the all human proteins (for example TP53 and the corresponding vector). I want to use the vectors for some classification models using the embedding vectors as additional features.

thank you for your help

SalvatoreRa avatar Sep 06 '22 14:09 SalvatoreRa

Human is here: https://zenodo.org/record/5047020

sacdallago avatar Oct 05 '22 08:10 sacdallago

Thank you very much, I have downloaded the reduced embedding and the CSV. The key names in the embedding file (h5) are numbers (0,12...). In the CSV I see I can retrieve the names of the proteins (in the embedding there are 20395 vectors and in the CSV 20395 rows). Thus, I suppose that the names of the proteins are the same in the CSV file and I can use the CSV to retrieve the protein name. is it right?

SalvatoreRa avatar Oct 12 '22 08:10 SalvatoreRa

Perfectly correct: lines/line_indices in the CSV refer to the the entries in the H5 file. So you can use the line-index of an entry in the CSV to query the corresponding embedding. By now, ProtT5 embeddings for a small set of organisms (incl. Human) are also available via UniProt :) --> https://www.uniprot.org/help/downloads

mheinzinger avatar Oct 18 '22 08:10 mheinzinger