rank_bm25 icon indicating copy to clipboard operation
rank_bm25 copied to clipboard

Can I fed 500K documents in rank_bm25?

Open ramsey-coding opened this issue 2 years ago • 9 comments

Thanks for this awesome library.

I am curious to know whether rank_bm25 can handle 500K documents. Each document has around 1000 words.

Looking forward to your feedback. I want to use the following functionality with rank_bm25:

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)


query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=1)

print(result)

ramsey-coding avatar Aug 25 '22 03:08 ramsey-coding

@Witiko can you please provide any insight?

ramsey-coding avatar Aug 26 '22 09:08 ramsey-coding

@ramsey-coding I don't see a reason why it shouldn't. Have you tried?

Witiko avatar Aug 26 '22 10:08 Witiko

@Witiko the problem is call to the bm25.get_top_n is very very slow :-(

It is taking ~5 second per call.

ramsey-coding avatar Aug 27 '22 07:08 ramsey-coding

@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?

ramsey-coding avatar Aug 27 '22 07:08 ramsey-coding

Hi @ramsey-coding,

I have just released a new Python-based search engine called retriv. It only takes ~40ms to query 8M documents on my machine. If you try it, please, let me know if it works for your use case.

AmenRa avatar Nov 17 '22 16:11 AmenRa

@AmenRa I am also interested in this feature. Would try out retriv.

nashid avatar Nov 17 '22 21:11 nashid

Better use elastichsearch.Python version can be slow makes you crazy

nocoolsandwich avatar Apr 19 '23 02:04 nocoolsandwich

@nocoolsandwich

You should try my library retriv. It takes 10 ms to search 10 million documents with BM25.

AmenRa avatar Apr 19 '23 09:04 AmenRa