langchain
langchain copied to clipboard
How metadata is being used during similarity search and query?
I have 3 pdf files in my directory and I "documentized", added metadata, split, embed and store them in pinecone, like this:
loader = DirectoryLoader('data/dir', glob="**/*.pdf", loader_cls=UnstructuredPDFLoader)
data = loader.load()
#I added company names explicitly for now
data[0].metadata["company"]="Apple"
data[1].metadata["company"]="Miscrosoft"
data[2].metadata["company"]="Tesla"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
texts = text_splitter.split_documents(data)
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_API_ENV
)
metadatas = []
for text in texts:
metadatas.append({
"company": text.metadata["company"]
})
Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name, metadatas=metadatas)
I want to build a Q&A system, so that I will mention a company name in my query and pinecon should look for the documents having company A
in the metadata. Here what I have:
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_API_ENV
)
index_name = "index"
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
docsearch = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)
llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")
query = "What is the total revenue of Apple?"
docs = docsearch.similarity_search(query, include_metadata=True)
res = chain.run(input_documents=docs, question=query)
print(res)
However, there are still document chunks from non-Apple documents in the output of docs
. What am I doing wrong here and how do I utilize the information in metadata both on doc_search and chat-gpt query (If possible)? Thanks
You can pass metadata filter as a dictionary
docs = docsearch.similarity_search(query, {"company":"Apple"})
@egils-mtx For the following code,
docs = docsearch.similarity_search(query, {"url": "langchain.readthedocs.io"})
I am getting the error.
TypeError: '>' not supported between instances of 'dict' and 'int'
I have upgraded the chroma library.
@egils-mtx The thing is I don't know in advance what company will be used in a query. I provided the code above just as an example. Moreover, I don't even know If a query will contain a company name or not. I was expecting the query --> filter_by_metadata
type of behavior to happen under the hood, without my intervention. It seems like there is no such a functionality so far.
Hello @mzhadigerov, can you share info, how did you solve your problem?
I am also curious! I dont see any solution to this in the thread. Could you please provide information on how to deal with this?
+1
+1
Check this - https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/
I have not tried it yet but looks like it solves the problem you are facing.
@egils-mtx what's the best way to pass on multiple "values" for metadata?
For example take this metadata key/value pair:
{"category": "game"}
now what if i want to add two categories? With or without comma?
{"category": "game, adult"}
or
{"category": "game adult"}
And can I query them later using "eq" or a similar filter to query for a metadata that contains BOTH game AND Adult?
I used this docs = vectordb.similarity_search( question, k=3, filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"} )