langchain icon indicating copy to clipboard operation
langchain copied to clipboard

How metadata is being used during similarity search and query?

Open mzhadigerov opened this issue 1 year ago • 2 comments

I have 3 pdf files in my directory and I "documentized", added metadata, split, embed and store them in pinecone, like this:

loader = DirectoryLoader('data/dir', glob="**/*.pdf", loader_cls=UnstructuredPDFLoader)
data = loader.load()

#I added company names explicitly for now
data[0].metadata["company"]="Apple"
data[1].metadata["company"]="Miscrosoft"
data[2].metadata["company"]="Tesla"

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
texts = text_splitter.split_documents(data)

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

pinecone.init(
    api_key=PINECONE_API_KEY,  
    environment=PINECONE_API_ENV  
)

metadatas = []
for text in texts:
    metadatas.append({
        "company": text.metadata["company"]
    })

Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name, metadatas=metadatas)

I want to build a Q&A system, so that I will mention a company name in my query and pinecon should look for the documents having company A in the metadata. Here what I have:

pinecone.init(
    api_key=PINECONE_API_KEY, 
    environment=PINECONE_API_ENV  
)
index_name = "index"
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

docsearch = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)

llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

query = "What is the total revenue of Apple?"
docs = docsearch.similarity_search(query, include_metadata=True)

res = chain.run(input_documents=docs, question=query)
print(res)

However, there are still document chunks from non-Apple documents in the output of docs. What am I doing wrong here and how do I utilize the information in metadata both on doc_search and chat-gpt query (If possible)? Thanks

mzhadigerov avatar Mar 21 '23 01:03 mzhadigerov

You can pass metadata filter as a dictionary

docs = docsearch.similarity_search(query, {"company":"Apple"})

egils-mtx avatar Mar 23 '23 23:03 egils-mtx

@egils-mtx For the following code,

docs = docsearch.similarity_search(query, {"url": "langchain.readthedocs.io"})

I am getting the error.

TypeError: '>' not supported between instances of 'dict' and 'int'

I have upgraded the chroma library.

saxenarajat avatar Mar 24 '23 11:03 saxenarajat

@egils-mtx The thing is I don't know in advance what company will be used in a query. I provided the code above just as an example. Moreover, I don't even know If a query will contain a company name or not. I was expecting the query --> filter_by_metadata type of behavior to happen under the hood, without my intervention. It seems like there is no such a functionality so far.

mzhadigerov avatar Mar 26 '23 19:03 mzhadigerov

Hello @mzhadigerov, can you share info, how did you solve your problem?

nikitacorp avatar Aug 28 '23 17:08 nikitacorp

I am also curious! I dont see any solution to this in the thread. Could you please provide information on how to deal with this?

ksarang90 avatar Sep 02 '23 06:09 ksarang90

+1

quantuan125 avatar Sep 02 '23 22:09 quantuan125

+1

analyticanna avatar Sep 06 '23 15:09 analyticanna

Check this - https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/

I have not tried it yet but looks like it solves the problem you are facing.

raman-mt avatar Sep 15 '23 05:09 raman-mt

@egils-mtx what's the best way to pass on multiple "values" for metadata?

For example take this metadata key/value pair:

{"category": "game"}

now what if i want to add two categories? With or without comma?

{"category": "game, adult"}

or

{"category": "game adult"}

And can I query them later using "eq" or a similar filter to query for a metadata that contains BOTH game AND Adult?

pooriaarab avatar Sep 28 '23 21:09 pooriaarab

I used this docs = vectordb.similarity_search( question, k=3, filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"} )

halfbug avatar Mar 27 '24 12:03 halfbug