ProtTrans
ProtTrans copied to clipboard
Precomputed embedding for all huuman proteins
Hi everyone,
I would like to know if you have the precomputed embedding for the human proteins (not for other sequences). I am just interested in the vectors for the all human proteins (for example TP53 and the corresponding vector). I want to use the vectors for some classification models using the embedding vectors as additional features.
thank you for your help
Human is here: https://zenodo.org/record/5047020
Thank you very much, I have downloaded the reduced embedding and the CSV. The key names in the embedding file (h5) are numbers (0,12...). In the CSV I see I can retrieve the names of the proteins (in the embedding there are 20395 vectors and in the CSV 20395 rows). Thus, I suppose that the names of the proteins are the same in the CSV file and I can use the CSV to retrieve the protein name. is it right?
Perfectly correct: lines/line_indices in the CSV refer to the the entries in the H5 file. So you can use the line-index of an entry in the CSV to query the corresponding embedding. By now, ProtT5 embeddings for a small set of organisms (incl. Human) are also available via UniProt :) --> https://www.uniprot.org/help/downloads