pyserini the 'prebuilt_index' is too big, out of memory

the 'prebuilt_index' is too big, out of memory

Open xixihawokao opened this issue 2 years ago • 4 comments

This project is really awesome!!! I want to use DKRR to retrieval Wikipedia articles ，and use the result in other experiment ，I download the faiss-flat.wikipedia.dkrr-dpr-tqa-retriever.20220217.25ed1f.cc91b2.tar.gz and get the file of index it‘s almost 60g, I try to use command 'split' to cut the file in order to make it smaller , but I still can't get result because the files Become incomplete：

RuntimeError: Error in faiss::Index* faiss::read_index(faiss::IOReader*, int) at /project/faiss/faiss/impl/index_read.cpp:793: Index type 0x0c2000be ("\xbe\x00 \x0c") not recognized

like this, please tell me how to do

Mar 19 '22 13:03 xixihawokao

Hi @xixihawokao,

The prebuild index is a not a line-by-line file, so we can't simply use split to cut the file. We usually need a machine has enough RAM e.g. 128G to deal with the large index size (60G). If you don't have a machine with enough RAM, we can help you split the index on our machine and send you sub-indexes. Could you let me know the RAM of your machine so that I can decide how many splits to create?

Mar 23 '22 02:03 MXueguang

my machine's RAM is 32g,I would appreciate it if you could split the index

for more people who want to use this easily ,I recommend that you can split index into 8g，generally personal computer's RAM, and provide a script that can use it with those split indexes.

Mar 26 '22 14:03 xixihawokao

Hi @xixihawokao Here is the subindexes wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/tmp/pyserinifaiss-flat.wikipedia.dkrr-dpr-tqa-retriever.20220217.25ed1f.cc91b2.sub_indexes.tar

Thanks for your suggestion, but it is hard for us to maintain indexes splits at current stage. The reasons are:

many IR tasks (at least in our team) are running on server rather than personal computer, we expect there are enough RAM.
With indexes sharded, we have to conduct search on each index shard and then merge the results. It is a bit complicated if we do multiple searches by default.
maintain a single index file is easier

Mar 29 '22 00:03 MXueguang

we current store the passage encoding as faiss flat ip index. It requires large memory to read as it has to load entire things into memory. A potential solution for the limited memory situation is to dump the passage encodings embeddings into pickle file incrementally and make it able to read line by line. (https://stackoverflow.com/questions/37954324/how-to-load-one-line-at-a-time-from-a-pickle-file) In that case, we can load the embeddings based on memory resources.

we looks forward to your PR if you are interested in implement it

Mar 29 '22 03:03 MXueguang

Closing...

Sep 27 '22 01:09 lintool

pyserini pyserini copied to clipboard

the 'prebuilt_index' is too big, out of memory

pyserini
pyserini copied to clipboard