Filter on Metadata for Vector Search using LangChain `RetrievalQA.from_chain_type`
In the Notebook https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-qa/question_answering_documents_langchain_matching_engine.ipynb, the below code is adding the metadata(document_name for example) at the time of embedding.
texts = [doc.page_content for doc in doc_splits]
metadatas = [
[
{"namespace": "source", "allow_list": [doc.metadata["source"]]},
{"namespace": "document_name", "allow_list": [doc.metadata["document_name"]]},
{"namespace": "chunk", "allow_list": [str(doc.metadata["chunk"])]},
]
for doc in doc_splits
]
But during retrieval using the below code, even with a specified filter on the metadata say document_name , the RetrievalQA.from_chain_type does not seem to be working on the filter on metadata.
retriever = me.as_retriever(
search_type="similarity",
search_kwargs={
"k": NUMBER_OF_RESULTS,
"search_distance": SEARCH_DISTANCE_THRESHOLD,
"filter": [{document_name":"abc.pdf"}] <------ does not work
},
)
Not sure what is the supported method to specify a filter on metadata on Vector Search for the Notebook example.
Assigned to @RajeshThallam Author of the Notebook
@chanirban The current implementation in the repo does not support filters. Please clone the repo and replace utils/matching_engine.py with this gist. I will submit a PR shortly.
Follow the Matching Engine filter specification as mentioned in the docs. In your case, it would look something like this
filters = {"namespace": "document_name", "allow_list": "abc.pdf"}
Hi @RajeshThallam , Even after replacing matching_engine.py and adding the filter like below, cannot restrict the search within a document while looking at the REFERENCE section of the ask() output. Would be useful to have an example please.
retriever = me.as_retriever(
search_type="similarity",
search_kwargs={
"k": NUMBER_OF_RESULTS,
"search_distance": SEARCH_DISTANCE_THRESHOLD},
filters = {"namespace": "document_name", "allow_list": "abc.pdf"}
)
@holtskinner @polong-lin Submitted a fix #403. Please review and approve the PR.