langchain How metadata is being used during similarity search and query?

How metadata is being used during similarity search and query?

Open mzhadigerov opened this issue 1 year ago • 2 comments

I have 3 pdf files in my directory and I "documentized", added metadata, split, embed and store them in pinecone, like this:

loader = DirectoryLoader('data/dir', glob="**/*.pdf", loader_cls=UnstructuredPDFLoader)
data = loader.load()

#I added company names explicitly for now
data[0].metadata["company"]="Apple"
data[1].metadata["company"]="Miscrosoft"
data[2].metadata["company"]="Tesla"

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
texts = text_splitter.split_documents(data)

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

pinecone.init(
    api_key=PINECONE_API_KEY,  
    environment=PINECONE_API_ENV  
)

metadatas = []
for text in texts:
    metadatas.append({
        "company": text.metadata["company"]
    })

Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name, metadatas=metadatas)

I want to build a Q&A system, so that I will mention a company name in my query and pinecon should look for the documents having company A in the metadata. Here what I have:

pinecone.init(
    api_key=PINECONE_API_KEY, 
    environment=PINECONE_API_ENV  
)
index_name = "index"
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

docsearch = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)

llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

query = "What is the total revenue of Apple?"
docs = docsearch.similarity_search(query, include_metadata=True)

res = chain.run(input_documents=docs, question=query)
print(res)

However, there are still document chunks from non-Apple documents in the output of docs. What am I doing wrong here and how do I utilize the information in metadata both on doc_search and chat-gpt query (If possible)? Thanks

Mar 21 '23 01:03 mzhadigerov

You can pass metadata filter as a dictionary

docs = docsearch.similarity_search(query, {"company":"Apple"})

Mar 23 '23 23:03 egils-mtx

@egils-mtx For the following code,

docs = docsearch.similarity_search(query, {"url": "langchain.readthedocs.io"})

I am getting the error.

TypeError: '>' not supported between instances of 'dict' and 'int'

I have upgraded the chroma library.

Mar 24 '23 11:03 saxenarajat

@egils-mtx The thing is I don't know in advance what company will be used in a query. I provided the code above just as an example. Moreover, I don't even know If a query will contain a company name or not. I was expecting the query --> filter_by_metadata type of behavior to happen under the hood, without my intervention. It seems like there is no such a functionality so far.

Mar 26 '23 19:03 mzhadigerov

Hello @mzhadigerov, can you share info, how did you solve your problem?