langchain
langchain copied to clipboard
[Feature] Augment SelfQueryRetriever to support Vespa
As of 0.0.155
, the SelfQueryRetriever
class supports Pinecone only.
I wanted to extend it myself to support Vespa too but, after reviewing the current implementation, I discovered that SelfQueryRetriever
"wraps around a vector store" instead of a retriever.
Currently, there is no VectorStore
implementation for Vespa. Hence, I believe it is not possible to augment SelfQueryRetriever
to support Vespa.
After thinking about it for a couple of days, I think that a valid solution to this problem would require to rethink the implementation of the SelfQueryRetriever
by making it wrap a retriever instead of a vector store. This sounds reasonable because each vector sore can act as a retriever thanks to the as_retriever
method.
The search
method added as part of the #3607 is a wrapper around either the similarity_search
or max_marginal_relevance_search
method which are also wrapped by the get_relevant_documents
method of a retriever. Hence, refactoring SelfQueryRetriever
to wrap a retriever instead of a vector store seems reasonable to me.
Obviously, mine is a fairly narrow point of view given that I mainly looked at the code related to the implementation of the SelfQueryRetriever
class and I might miss key implications of such a proposed change.
I would be happy to have a go a the SelfQueryRetriever
refactoring but first it would be great of someone from the code developers team could comment on this.
Tagging here @dev2049 because you were the author of #3607
How were you imagining you'd use the self-querying retriever with another retriever? Would it be rephrasing the original query string?
I ask because the main method exposed by the Retriever abstraction is
def get_relevant_documents(self, query: str) -> List[Document]:
This only takes a single string query as input. the motivation for the SelfQueryRetriever is to be able to create complex, structured queries against a data store. Since Retrievers don't expose such a method, it's not particularly interesting to wrap the self-querying functionality around a Retriever. VectorStores on the other hand support arbitrary kwargs:
def search(self, query: str, search_type: str, **kwargs: Any) -> List[Document]:
Many vector stores for example support metadata filtering, and the self-querying retriever can auto-generate those filtering conditions.
My intuition is if we just want a retriever that takes a query, passes it to an LLM to rephrase it, and then applies the rephrased query to a base retriever it's wrapped around, it's probably cleaner/easier to just introduce a new Retriever class. What do you think?
cc @hwchase17
To answer your first question: yes, that is what I thought. My idea would be to rephrase the original query string which is then passed to a downstream retriever.
You raise all valid points of course.
Here some thoughts (in no specific order):
- modifying
def get_relevant_documents(self, query: str)
to accept an additional**kwargs
argument would most likely solve this issue. - (this is a very opinionated point) referring to vector stores, I think that the various search (retrieval) capabilities should be exposed through their "retriever interface" only. This would work as some kind of "separation of concerns" mechanism. Everything which has to do with creation/maintenance/storing of indices happens within the
VectorStore
subclass of the specific vector store. However, retrieving documents should happen using methods exposed by theas_retriever
interface of the vector store. All of this is to support my idea that theSelfQueryRetriever
class should only be aware of the retriever capabilities of an object. - Ideally, one would create a
SelfQueryRetriever
object from an LLM and retrieverSelfQueryRetriever.from_llm(llm, retriever, ...)
and not from a vector storeSelfQueryRetriever.from_llm(llm, vectorstore, ...)
. Then theget_relevant_documents
ofSelfQueryRetriever
would just call theget_relevant_documents
of the specific retriever withkwargs
(filters) (this would require the modification described in the first bullet point obviously). - If the general idea for a retriever in LangChain is to have an interface to simply perform text based queries, then yes one might think about introduce an
RetrieverWithFilters
class (or something like that). This class could expose aget_relevant_documents_with_filters
method for example. However I still think that adding a**kwargs
toget_relevant_documents
might be a leaner solution (maybe even without breaking anything 🤞🏻) which does not require creating a new class.
P.S. These are comments from my 2 AM tired brain. Hopefully they still make some kind of sense. Happy to rephrase if needed.
making this work with vespa should be a priority
i think what makes sense is to add a method to vespa like "retrieve with metadata filter" or something (better named). this can be a new method that takes in a query + a filter
we can add that method in its own pr, that should be pr #1
after that, we can figure out the right abstraction to allow to use that in self-query
I am using redis and would love it if both redis hybrid search as well as SelfQueryRetriever would be available for redis.
More importantly, SelfQueryRetriever could be made independent of the vectorstore which is used though. I mean the LLM which returns the filter and the value could be made a separate entity which can then be used in any vectorstore.
@srivatsanv1 the SelfQueryRetriever is generic and should work with any vector store that you build a translation layer for. the way the retriever works is the structured query is first created using an internal query language, and then that is translated to whatever format a specific vector store accepts. Pinecone is just the first (and currently only) built-in translation layer, but we should add more. But you could technically already build a translation module yourself and pass it in.
More thoughts:
Unfortunately, adding a new method to the Vespa retriever might not do the trick. Looking at the implementation of the Vespa retriever, the query structure needs to be provided when creating the VespaRetriever
object, as shown in the docs:
from langchain.retrievers.vespa_retriever import VespaRetriever
vespa_query_body = {
"yql": "select content from paragraph where userQuery()", # <- THIS PART
"hits": 5,
"ranking": "documentation",
"locale": "en-us"
}
vespa_content_field = "content"
retriever = VespaRetriever(vespa_app, vespa_query_body, vespa_content_field)
When using the SelfQueryRetriever
upstream, then this would require a modification of the value of the yql
key. For example, with the query What's a movie about toys after 1990
, the yql
value would become:
vespa_query_body = {
"yql": "select content from paragraph where userQuery() AND year > 1990", # <- THIS PART
"hits": 5,
"ranking": "documentation",
"locale": "en-us",
"query": "What's a movie about toys"
}
It should be possible to implement a clever logic to handle this modification but it can quickly become a daunting task because the definition of the yql
entry is entirely left to the user. I think that it should be wise to add constraints to which kind of query one can build for Vespa (if its retriever must work with the SelfQueryRetriever
).
I believe it is necessary to review how a VespaRetriever
object is created.
@dev2049 I think that the recently introduced search
method could be removed (this is true if it is used by the SelfQueryRetriever
only though).
For example, in the retrievers/self_query/base.py
, one could replace this:
search_kwargs = {**self.search_kwargs, **new_kwargs}
docs = self.vectorstore.search(query, self.search_type, **search_kwargs)
with this:
kwargs = {
"search_kwargs": {**self.search_kwargs, **new_kwargs},
"search_type": self.search_type,
}
retriever = self.vectorstore.as_retriever(**kwargs)
docs = retriever.get_relevant_documents(query)
I will have a go at a draft PR to (try to) tackle the issues discussed here.
Hey, I'm with the Vespa team.
The Vespa query API request parameter accepts queries expressed with Vespa simple query syntax, which is a typical search syntax you expect from web search engines (Well, except when Google launched G+ and they changed the way to search for an exact term).
What's a +movie about +toys +year:>1990
So there is still some flexibility with that, but I do think that we should a find a way to override the request body further as there are a lot of parameters that could be used, for example, native Vespa embedders
Hi @jobergum 👋 thanks for joining the discussion.
Question: with your comment "we should a find a way to override the request body further", are you specifically referring to the SelfQueryRetriever
? Or is it a more general comment for the VespaRetriever
?
I'm talking about the VespaRetriever
, it should at least support passing the query string to embed functionality or multiple embedders. Currently, it will only replace the query
request parameter.
Hi @jobergum, would you consider this a meaningful query?
{
"yql": "select content from paragraph where userQuery() or {targetHits: 100}nearestNeighbor(embedding, first_query)",
"query": "Where is Waldo?",
"input.query(first_query)": "embed(bertModel, Where is Waldo?)",
"type": "weakAnd",
"hits": 5,
"ranking": "documentation",
"locale": "en-us"
}
Or would you push for allowing a variable number of input.query()
entries?
Hi @jobergum, would you consider this a meaningful query?
Yes, hybrid retrieval like this is meaningful, and the query is valid. Not that many cases have multiple query embedding models. What we hope to add soon is support for referencing native request parameters in the embed function.
{
"yql": "select content from paragraph where userQuery() or {targetHits: 100}nearestNeighbor(embedding, first_query)",
"query": "Where is Waldo?",
"last_messages": ".... ",
"input.query(first_query)": "embed(@query)",
"input.query(conversion_embedding)": embed(@last_messages)"
}
In this case, both the question is encoded and the last_messages for additional context, which could be used for ranking, or retrieval (with another nearestNeighbor query operator).
What we hope to add soon is support for referencing native request parameters in the embed function.
That would be indeed a nice improvement!
I am moving this conversation to the discussion of the PR #4546. I like what Davis has done but I will now suggest modifications to be able to leverage the input.query
.
Hi, @filippo82! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue is about extending the SelfQueryRetriever
class to support Vespa. There has been a discussion about the best approach, with suggestions to modify the get_relevant_documents
method or introduce a new RetrieverWithFilters
class. The Vespa team has also joined the discussion and suggested adding a method to Vespa for retrieving with metadata filters.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution!