rank_bm25
rank_bm25 copied to clipboard
Can I fed 500K documents in rank_bm25?
Thanks for this awesome library.
I am curious to know whether rank_bm25 can handle 500K documents. Each document has around 1000 words.
Looking forward to your feedback. I want to use the following functionality with rank_bm25:
from rank_bm25 import BM25Okapi
corpus = [
"Hello there good man!",
"It is quite windy in London",
"How is the weather today?"
]
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query = "windy London"
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=1)
print(result)
@Witiko can you please provide any insight?
@ramsey-coding I don't see a reason why it shouldn't. Have you tried?
@Witiko the problem is call to the bm25.get_top_n
is very very slow :-(
It is taking ~5 second per call.
@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?
Hi @ramsey-coding,
I have just released a new Python-based search engine called retriv
.
It only takes ~40ms to query 8M documents on my machine.
If you try it, please, let me know if it works for your use case.
@AmenRa I am also interested in this feature. Would try out retriv
.
Better use elastichsearch.Python version can be slow makes you crazy
@nocoolsandwich
You should try my library retriv
.
It takes 10 ms to search 10 million documents with BM25.