langchain BM25 Retriever: get_relevant_documents() returns wrong results

BM25 Retriever: get_relevant_documents() returns wrong results

Open codingbutstillalive opened this issue 1 year ago • 1 comments

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.

Example Code

''' results = bm25_retriever.get_relevant_documents(search_key) '''

Description

I am using the BM25 retriever and noticed the following two unexpected behaviors:

When I add only two documents, get_relevant_documents() isn't working as expected, but will always return the second (i.e. the last page) - independent of the actual search key.
get_relevant_documents() defaults to returning the last page when it finds no suitable match (e.g. when tested with some arbitrary character combinations like 'jgosntih' or with an empty string as search key). Thus, sensible results cannot be discerned from non-sensible results.

Probably, both observations can be explained by the same default behavior of get_relevant_documents(), which just does not make sense and leads to wrong and unuseful results.

System Info

python = "3.10" langchain = "0.0.272"

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Jan 19 '24 14:01 codingbutstillalive

🤖

Hey there @codingbutstillalive! Good to see you back in action on the LangChain battlefield. Hope you're ready for another exciting round of bug-bashing! 🎯

Thank you for bringing this to our attention. It seems like you've encountered an unexpected behavior with the get_relevant_documents() function in the BM25 retriever.

Based on the context provided, the get_relevant_documents() function works by preprocessing the input query and then retrieving the top k documents that are most relevant to the processed query using the BM25Okapi vectorizer's get_top_n method. The function returns a list of Document objects that represent the most relevant documents to the query.

The get_top_n method of the BM25Okapi vectorizer works by using the BM25 algorithm to rank the documents based on their relevance to the processed query and then returns the top 'k' documents.

From your description, it seems like the function is always returning the last document when it can't find a suitable match. This could be due to the way the BM25 algorithm ranks the documents when there's no match.

However, without seeing the full code and the documents you're using, it's hard to say for sure what's causing this issue. Could you please provide more details about the documents you're using and the full code where you're calling the get_relevant_documents() function? This will help us better understand the issue and provide a more accurate solution.

Also, please make sure that the documents you're using are preprocessed correctly before being passed to the get_relevant_documents() function. The preprocessing function should split the text into a list of words, as this is how the BM25 algorithm expects the input.

If you're still encountering the issue after checking the preprocessing, it might be a bug in the LangChain codebase. In that case, I recommend creating an issue in the LangChain repository detailing the problem and the steps to reproduce it. The LangChain team will then be able to investigate and fix the issue.

I hope this helps! If you have any other questions, feel free to ask.

Best regards, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Jan 19 '24 14:01 dosubot[bot]

langchain langchain copied to clipboard

BM25 Retriever: get_relevant_documents() returns wrong results

Checked other resources

Example Code

Description

System Info

Related Components

Sources

langchain
langchain copied to clipboard