generative-ai icon indicating copy to clipboard operation
generative-ai copied to clipboard

Filter on Metadata for Vector Search using LangChain `RetrievalQA.from_chain_type`

Open chanirban opened this issue 1 year ago • 3 comments

In the Notebook https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain_matching_engine.ipynb, the below code is adding the metadata(document_name for example) at the time of embedding.

texts = [doc.page_content for doc in doc_splits]
metadatas = [
    [
        {"namespace": "source", "allow_list": [doc.metadata["source"]]},
        {"namespace": "document_name", "allow_list": [doc.metadata["document_name"]]},
        {"namespace": "chunk", "allow_list": [str(doc.metadata["chunk"])]},
    ]
    for doc in doc_splits
]

But during retrieval using the below code, even with a specified filter on the metadata say document_name , the RetrievalQA.from_chain_type does not seem to be working on the filter on metadata.

retriever = me.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": NUMBER_OF_RESULTS,
        "search_distance": SEARCH_DISTANCE_THRESHOLD,
        "filter": [{document_name":"abc.pdf"}]                                  <------ does not work 
    },
)

Not sure what is the supported method to specify a filter on metadata on Vector Search for the Notebook example.

chanirban avatar Jan 17 '24 18:01 chanirban

Assigned to @RajeshThallam Author of the Notebook

holtskinner avatar Jan 18 '24 10:01 holtskinner

@chanirban The current implementation in the repo does not support filters. Please clone the repo and replace utils/matching_engine.py with this gist. I will submit a PR shortly.

Follow the Matching Engine filter specification as mentioned in the docs. In your case, it would look something like this

filters = {"namespace": "document_name", "allow_list": "abc.pdf"}

RajeshThallam avatar Jan 19 '24 02:01 RajeshThallam

Hi @RajeshThallam , Even after replacing matching_engine.py and adding the filter like below, cannot restrict the search within a document while looking at the REFERENCE section of the ask() output. Would be useful to have an example please.

retriever = me.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": NUMBER_OF_RESULTS,
        "search_distance": SEARCH_DISTANCE_THRESHOLD},
       filters = {"namespace": "document_name", "allow_list": "abc.pdf"}
)

chanirban avatar Jan 19 '24 13:01 chanirban

@holtskinner @polong-lin Submitted a fix #403. Please review and approve the PR.

RajeshThallam avatar Feb 18 '24 00:02 RajeshThallam