Search script and OMG database
Hi,
Is there a python script for search and retrieval tasks which could be used without the server? How to do residue level and sequence level retrieval as shown in the paper.
Also I don't see the OMG database related weights or faiss index file. Would be glad if you could comment on this.
Regards Rakesh
Hi Rakesh,
Is there a python script for search and retrieval tasks which could be used without the server? How to do residue level and sequence level retrieval as shown in the paper.
We don't directly provide a python script for retrieval tasks. However you could easily implement it by yourself! You could follow below steps:
- Load the ProTrek model and calculate your query embedding. See here.
- Directly load faiss index. If you have already downloaded the faiss index files from here, you could dive into the folder and see
.indexfiles and correspondingids.tsvfiles. An example for your retrieval would be:
import faiss
# Load model and get query embedding
...
# Load faiss index
path = "/your/path/to/***.index"
index = faiss.read_index(path)
index.metric_type = faiss.METRIC_INNER_PRODUCT
# Start retrieval
topk = 10
scores, ranks = index.search(query_embedding, topk)
# Once you get ranks, you could load "ids.tsv" to get the information of retrieved candidates
...
For more details about the usage of the faiss library, please refer to the official github repo here. If you want to perform residue-level or sequence-level retrieval, you only need to load corresponding .index file and follow the above steps.
I don't see the OMG database related weights or faiss index file. Would be glad if you could comment on this.
Yes. Currently we don't upload other database index files due to their large size (~hundreds of GB). You could use our online server to access these databases. We are working hard on making it more handy and efficient :)
Best regards, Jin
We will release all 3 billion protein embeddings (TB-level) when the paper was accepted. Not now, very sorry. But you can use our shared model weight to generate your protein embeddings that you are intererested in.
You can also use the API if you do not want to use the interface. http://search-protrek.com/?view=api
Hi I had a question how do I map the ranks to the id.tsv file as mentioned by you previously to get seqeunces.
Hi,
The sequence file and id file are aligned, i.e. you can find the id of the 5th sequence in the 5th line of the id file. So you can first record all sequences and their ranks in the sequence file, and then retrieve their ids by the rank.
@LTEnjoy I noticed your paper has been accepted in Nature Biotech, congrats!
Would like to know when the download links for the different databases will be provided. Thanks.
@rakeshr10 Thx Rakesh!
We are preparing for the final version of our manuscript, along with all database links for researchers to download. I will keep you informed once we make everything ready :)
@LTEnjoy is there a script to download all the embedding files of different databases at once.
@LTEnjoy is there a script to download all the embedding files of different databases at once.
Hi, due to the large size of embeddings we cannot compress it into a single .zip file. You could write a shell script to download every file in a loop :)
@LTEnjoy It is very hard for me to figure out the download links for all the database files and write a script. Since you have posted the files and should know the posted urls, it will be great if you could provide a shell or python script in the repo to download all the files at once.
I also don't know the total size of the files, if it is large will be good if it is hosted on a fast mirror to download all the files quickly in a day.
Hi,
Now we have a download button in our storage website. You could click to download all file urls in a .txt file and run a loop to download all files at once!
The total size of all embeddings is large (around TB size), so maybe you could download multiple files at the same time to speed up the process.