langchain [Feature] Augment SelfQueryRetriever to support Vespa

As of 0.0.155, the SelfQueryRetriever class supports Pinecone only.

I wanted to extend it myself to support Vespa too but, after reviewing the current implementation, I discovered that SelfQueryRetriever "wraps around a vector store" instead of a retriever.

Currently, there is no VectorStore implementation for Vespa. Hence, I believe it is not possible to augment SelfQueryRetriever to support Vespa.

After thinking about it for a couple of days, I think that a valid solution to this problem would require to rethink the implementation of the SelfQueryRetriever by making it wrap a retriever instead of a vector store. This sounds reasonable because each vector sore can act as a retriever thanks to the as_retriever method.

The search method added as part of the #3607 is a wrapper around either the similarity_search or max_marginal_relevance_search method which are also wrapped by the get_relevant_documents method of a retriever. Hence, refactoring SelfQueryRetriever to wrap a retriever instead of a vector store seems reasonable to me.

Obviously, mine is a fairly narrow point of view given that I mainly looked at the code related to the implementation of the SelfQueryRetriever class and I might miss key implications of such a proposed change.

I would be happy to have a go a the SelfQueryRetriever refactoring but first it would be great of someone from the code developers team could comment on this.

Tagging here @dev2049 because you were the author of #3607

May 02 '23 22:05 filippo82

How were you imagining you'd use the self-querying retriever with another retriever? Would it be rephrasing the original query string?

I ask because the main method exposed by the Retriever abstraction is

def get_relevant_documents(self, query: str) -> List[Document]:

This only takes a single string query as input. the motivation for the SelfQueryRetriever is to be able to create complex, structured queries against a data store. Since Retrievers don't expose such a method, it's not particularly interesting to wrap the self-querying functionality around a Retriever. VectorStores on the other hand support arbitrary kwargs:

def search(self, query: str, search_type: str, **kwargs: Any) -> List[Document]:

Many vector stores for example support metadata filtering, and the self-querying retriever can auto-generate those filtering conditions.

My intuition is if we just want a retriever that takes a query, passes it to an LLM to rephrase it, and then applies the rephrased query to a base retriever it's wrapped around, it's probably cleaner/easier to just introduce a new Retriever class. What do you think?

cc @hwchase17

May 02 '23 23:05 dev2049

To answer your first question: yes, that is what I thought. My idea would be to rephrase the original query string which is then passed to a downstream retriever.

You raise all valid points of course.

Here some thoughts (in no specific order):

modifying def get_relevant_documents(self, query: str) to accept an additional **kwargs argument would most likely solve this issue.
(this is a very opinionated point) referring to vector stores, I think that the various search (retrieval) capabilities should be exposed through their "retriever interface" only. This would work as some kind of "separation of concerns" mechanism. Everything which has to do with creation/maintenance/storing of indices happens within the VectorStore subclass of the specific vector store. However, retrieving documents should happen using methods exposed by the as_retriever interface of the vector store. All of this is to support my idea that the SelfQueryRetriever class should only be aware of the retriever capabilities of an object.
Ideally, one would create a SelfQueryRetriever object from an LLM and retriever SelfQueryRetriever.from_llm(llm, retriever, ...) and not from a vector store SelfQueryRetriever.from_llm(llm, vectorstore, ...). Then the get_relevant_documents of SelfQueryRetriever would just call the get_relevant_documents of the specific retriever with kwargs (filters) (this would require the modification described in the first bullet point obviously).
If the general idea for a retriever in LangChain is to have an interface to simply perform text based queries, then yes one might think about introduce an RetrieverWithFilters class (or something like that). This class could expose a get_relevant_documents_with_filters method for example. However I still think that adding a **kwargs to get_relevant_documents might be a leaner solution (maybe even without breaking anything 🤞🏻) which does not require creating a new class.

P.S. These are comments from my 2 AM tired brain. Hopefully they still make some kind of sense. Happy to rephrase if needed.

May 03 '23 00:05 filippo82

making this work with vespa should be a priority

i think what makes sense is to add a method to vespa like "retrieve with metadata filter" or something (better named). this can be a new method that takes in a query + a filter

we can add that method in its own pr, that should be pr #1

after that, we can figure out the right abstraction to allow to use that in self-query

May 03 '23 04:05 hwchase17

I am using redis and would love it if both redis hybrid search as well as SelfQueryRetriever would be available for redis.

More importantly, SelfQueryRetriever could be made independent of the vectorstore which is used though. I mean the LLM which returns the filter and the value could be made a separate entity which can then be used in any vectorstore.

May 03 '23 07:05 srivatsanv1

@srivatsanv1 the SelfQueryRetriever is generic and should work with any vector store that you build a translation layer for. the way the retriever works is the structured query is first created using an internal query language, and then that is translated to whatever format a specific vector store accepts. Pinecone is just the first (and currently only) built-in translation layer, but we should add more. But you could technically already build a translation module yourself and pass it in.

May 03 '23 18:05 dev2049

More thoughts:

Unfortunately, adding a new method to the Vespa retriever might not do the trick. Looking at the implementation of the Vespa retriever, the query structure needs to be provided when creating the VespaRetriever object, as shown in the docs:

from langchain.retrievers.vespa_retriever import VespaRetriever

vespa_query_body = {
    "yql": "select content from paragraph where userQuery()",  # <- THIS PART
    "hits": 5,
    "ranking": "documentation",
    "locale": "en-us"
}
vespa_content_field = "content"
retriever = VespaRetriever(vespa_app, vespa_query_body, vespa_content_field)

When using the SelfQueryRetriever upstream, then this would require a modification of the value of the yql key. For example, with the query What's a movie about toys after 1990, the yql value would become:

vespa_query_body = {
    "yql": "select content from paragraph where userQuery() AND year > 1990",  # <- THIS PART
    "hits": 5,
    "ranking": "documentation",
    "locale": "en-us",
    "query": "What's a movie about toys"
}

It should be possible to implement a clever logic to handle this modification but it can quickly become a daunting task because the definition of the yql entry is entirely left to the user. I think that it should be wise to add constraints to which kind of query one can build for Vespa (if its retriever must work with the SelfQueryRetriever).

I believe it is necessary to review how a VespaRetriever object is created.

@dev2049 I think that the recently introduced search method could be removed (this is true if it is used by the SelfQueryRetriever only though).

For example, in the retrievers/self_query/base.py, one could replace this:

        search_kwargs = {**self.search_kwargs, **new_kwargs}
        docs = self.vectorstore.search(query, self.search_type, **search_kwargs)

with this:

        kwargs = {
            "search_kwargs": {**self.search_kwargs, **new_kwargs},
            "search_type": self.search_type,
        }
        retriever = self.vectorstore.as_retriever(**kwargs)
        docs = retriever.get_relevant_documents(query)

I will have a go at a draft PR to (try to) tackle the issues discussed here.

May 03 '23 23:05 filippo82

Hey, I'm with the Vespa team.

The Vespa query API request parameter accepts queries expressed with Vespa simple query syntax, which is a typical search syntax you expect from web search engines (Well, except when Google launched G+ and they changed the way to search for an exact term).

What's a +movie about +toys +year:>1990

So there is still some flexibility with that, but I do think that we should a find a way to override the request body further as there are a lot of parameters that could be used, for example, native Vespa embedders

May 04 '23 12:05 jobergum

Hi @jobergum 👋 thanks for joining the discussion.

Question: with your comment "we should a find a way to override the request body further", are you specifically referring to the SelfQueryRetriever? Or is it a more general comment for the VespaRetriever?

May 04 '23 16:05 filippo82

I'm talking about the VespaRetriever, it should at least support passing the query string to embed functionality or multiple embedders. Currently, it will only replace the query request parameter.

May 04 '23 20:05 jobergum

Hi @jobergum, would you consider this a meaningful query?

        {
            "yql": "select content from paragraph where userQuery() or {targetHits: 100}nearestNeighbor(embedding, first_query)",
            "query": "Where is Waldo?",
            "input.query(first_query)": "embed(bertModel, Where is Waldo?)",
            "type": "weakAnd",
            "hits": 5,
            "ranking": "documentation",
            "locale": "en-us"
        }

Or would you push for allowing a variable number of input.query() entries?

May 10 '23 22:05 filippo82

Hi @jobergum, would you consider this a meaningful query?

Yes, hybrid retrieval like this is meaningful, and the query is valid. Not that many cases have multiple query embedding models. What we hope to add soon is support for referencing native request parameters in the embed function.

{
            "yql": "select content from paragraph where userQuery() or {targetHits: 100}nearestNeighbor(embedding, first_query)",
            "query": "Where is Waldo?",
            "last_messages": ".... ",
            "input.query(first_query)": "embed(@query)",
            "input.query(conversion_embedding)": embed(@last_messages)"
        }

In this case, both the question is encoded and the last_messages for additional context, which could be used for ranking, or retrieval (with another nearestNeighbor query operator).

May 12 '23 07:05 jobergum

What we hope to add soon is support for referencing native request parameters in the embed function.

That would be indeed a nice improvement!

I am moving this conversation to the discussion of the PR #4546. I like what Davis has done but I will now suggest modifications to be able to leverage the input.query.

May 12 '23 08:05 filippo82

Hi, @filippo82! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about extending the SelfQueryRetriever class to support Vespa. There has been a discussion about the best approach, with suggestions to modify the get_relevant_documents method or introduce a new RetrieverWithFilters class. The Vespa team has also joined the discussion and suggested adding a method to Vespa for retrieving with metadata filters.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution!

Sep 15 '23 16:09 dosubot[bot]

langchain langchain copied to clipboard

[Feature] Augment SelfQueryRetriever to support Vespa

langchain
langchain copied to clipboard