ProTrek icon indicating copy to clipboard operation
ProTrek copied to clipboard

Search script and OMG database

Open rakeshr10 opened this issue 1 year ago • 10 comments

Hi,

Is there a python script for search and retrieval tasks which could be used without the server? How to do residue level and sequence level retrieval as shown in the paper.

Also I don't see the OMG database related weights or faiss index file. Would be glad if you could comment on this.

Regards Rakesh

rakeshr10 avatar Jan 09 '25 18:01 rakeshr10

Hi Rakesh,

Is there a python script for search and retrieval tasks which could be used without the server? How to do residue level and sequence level retrieval as shown in the paper.

We don't directly provide a python script for retrieval tasks. However you could easily implement it by yourself! You could follow below steps:

  1. Load the ProTrek model and calculate your query embedding. See here.
  2. Directly load faiss index. If you have already downloaded the faiss index files from here, you could dive into the folder and see .index files and corresponding ids.tsv files. An example for your retrieval would be:
import faiss

# Load model and get query embedding
...

# Load faiss index
path  = "/your/path/to/***.index"
index = faiss.read_index(path)
index.metric_type = faiss.METRIC_INNER_PRODUCT

# Start retrieval
topk = 10
scores, ranks = index.search(query_embedding, topk)

# Once you get ranks, you could load "ids.tsv" to get the information of retrieved candidates
...

For more details about the usage of the faiss library, please refer to the official github repo here. If you want to perform residue-level or sequence-level retrieval, you only need to load corresponding .index file and follow the above steps.

I don't see the OMG database related weights or faiss index file. Would be glad if you could comment on this.

Yes. Currently we don't upload other database index files due to their large size (~hundreds of GB). You could use our online server to access these databases. We are working hard on making it more handy and efficient :)

Best regards, Jin

LTEnjoy avatar Jan 10 '25 06:01 LTEnjoy

We will release all 3 billion protein embeddings (TB-level) when the paper was accepted. Not now, very sorry. But you can use our shared model weight to generate your protein embeddings that you are intererested in.

You can also use the API if you do not want to use the interface. http://search-protrek.com/?view=api

fajieyuan avatar Mar 21 '25 14:03 fajieyuan

Hi I had a question how do I map the ranks to the id.tsv file as mentioned by you previously to get seqeunces.

rakeshr10 avatar Apr 09 '25 22:04 rakeshr10

Hi,

The sequence file and id file are aligned, i.e. you can find the id of the 5th sequence in the 5th line of the id file. So you can first record all sequences and their ranks in the sequence file, and then retrieve their ids by the rank.

LTEnjoy avatar Apr 10 '25 01:04 LTEnjoy

@LTEnjoy I noticed your paper has been accepted in Nature Biotech, congrats!

Would like to know when the download links for the different databases will be provided. Thanks.

rakeshr10 avatar Aug 18 '25 18:08 rakeshr10

@rakeshr10 Thx Rakesh!

We are preparing for the final version of our manuscript, along with all database links for researchers to download. I will keep you informed once we make everything ready :)

LTEnjoy avatar Aug 19 '25 03:08 LTEnjoy

@LTEnjoy is there a script to download all the embedding files of different databases at once.

rakeshr10 avatar Oct 03 '25 05:10 rakeshr10

@LTEnjoy is there a script to download all the embedding files of different databases at once.

Hi, due to the large size of embeddings we cannot compress it into a single .zip file. You could write a shell script to download every file in a loop :)

LTEnjoy avatar Oct 09 '25 03:10 LTEnjoy

@LTEnjoy It is very hard for me to figure out the download links for all the database files and write a script. Since you have posted the files and should know the posted urls, it will be great if you could provide a shell or python script in the repo to download all the files at once.

I also don't know the total size of the files, if it is large will be good if it is hosted on a fast mirror to download all the files quickly in a day.

rakeshr10 avatar Oct 09 '25 04:10 rakeshr10

Hi,

Now we have a download button in our storage website. You could click to download all file urls in a .txt file and run a loop to download all files at once!

Image

The total size of all embeddings is large (around TB size), so maybe you could download multiple files at the same time to speed up the process.

LTEnjoy avatar Oct 10 '25 08:10 LTEnjoy