pyserini
pyserini copied to clipboard
the 'prebuilt_index' is too big, out of memory
This project is really awesome!!! I want to use DKRR to retrieval Wikipedia articles ,and use the result in other experiment ,I download the faiss-flat.wikipedia.dkrr-dpr-tqa-retriever.20220217.25ed1f.cc91b2.tar.gz and get the file of index it‘s almost 60g, I try to use command 'split' to cut the file in order to make it smaller , but I still can't get result because the files Become incomplete:
RuntimeError: Error in faiss::Index* faiss::read_index(faiss::IOReader*, int) at /project/faiss/faiss/impl/index_read.cpp:793: Index type 0x0c2000be ("\xbe\x00 \x0c") not recognized
like this, please tell me how to do
Hi @xixihawokao,
The prebuild index is a not a line-by-line file, so we can't simply use split
to cut the file.
We usually need a machine has enough RAM e.g. 128G to deal with the large index size (60G).
If you don't have a machine with enough RAM, we can help you split the index on our machine and send you sub-indexes.
Could you let me know the RAM of your machine so that I can decide how many splits to create?
my machine's RAM is 32g,I would appreciate it if you could split the index
for more people who want to use this easily ,I recommend that you can split index into 8g,generally personal computer's RAM, and provide a script that can use it with those split indexes.
Hi @xixihawokao
Here is the subindexes wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/tmp/pyserinifaiss-flat.wikipedia.dkrr-dpr-tqa-retriever.20220217.25ed1f.cc91b2.sub_indexes.tar
Thanks for your suggestion, but it is hard for us to maintain indexes splits at current stage. The reasons are:
- many IR tasks (at least in our team) are running on server rather than personal computer, we expect there are enough RAM.
- With indexes sharded, we have to conduct search on each index shard and then merge the results. It is a bit complicated if we do multiple searches by default.
- maintain a single index file is easier
we current store the passage encoding as faiss flat ip index. It requires large memory to read as it has to load entire things into memory. A potential solution for the limited memory situation is to dump the passage encodings embeddings into pickle file incrementally and make it able to read line by line. (https://stackoverflow.com/questions/37954324/how-to-load-one-line-at-a-time-from-a-pickle-file) In that case, we can load the embeddings based on memory resources.
we looks forward to your PR if you are interested in implement it
Closing...