langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Pinecone retriever throwing: KeyError: 'text'

Open kindbuds opened this issue 1 year ago • 10 comments

My query code is below:

pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY'),  # app.pinecone.io
    environment=os.environ.get('PINECONE_ENV')  # next to API key in console
)
index = pinecone.Index(index_name)
embeddings = OpenAIEmbeddings(openai_api_key=os.environ.get('OPENAI_API_KEY'))

vectordb = Pinecone(
    index=index,
    embedding_function=embeddings.embed_query,
    text_key="text",
)
llm=ChatOpenAI(
    openai_api_key=os.environ.get('OPENAI_API_KEY'),
    temperature=0,
    model_name='gpt-3.5-turbo'
)
retriever = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectordb.as_retriever()
)
tools = [Tool(
    func=retriever.run,
    description=tool_desc,
    name='Product DB'
)]
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",  # important to align with agent prompt (below)
    k=5,
    return_messages=True
)
agent = initialize_agent(
    agent='chat-conversational-react-description', 
    tools=tools, 
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method="generate",
    memory=memory,
)

If I run: agent({'chat_history':[], 'input':'What is a product?'})

It throws:

File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\vectorstores\pinecone.py", line 160, in similarity_search text = metadata.pop(self._text_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyError: 'text'

This is the offending block in site-packages/pinecone.py:

 for res in results["matches"]:
            # print('metadata.pop(self._text_key) = ' + metadata.pop(self._text_key))
            metadata = res["metadata"]
            text = metadata.pop(self._text_key)
            docs.append(Document(page_content=text, metadata=metadata))

If I remove my tool like the line below, everything executes (just not my tool):

tools = []

Can anyone help me fix this KeyError: 'text' issue? My versions of langchain, pinecone-client and python are 0.0.147, 2.2.1 and 3.11.3 respectively.

kindbuds avatar Apr 24 '23 18:04 kindbuds

From having a first look (without knowing more about the setup): Might it be that the PineconeDB is empty? Depending on your setup, you (having documents locally or already a Pinecone index created), you might want to try

vectordb = Pinecone.from_existing_index(
    index_name=index,
    embedding=embeddings,
)

Or Pinecone.from_documents. What you are currently doing is initializing Pinecone. If however the pinecone client tries to perform a similarity search, there is nothing to be found?

christianwarmuth avatar Apr 25 '23 13:04 christianwarmuth

Thanks for the response @christianwarmuth! There are 13 documents in the vectordb. I tried your change and it still throwing:

File "C:\Users\xxx_\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\base.py", line 116, in call raise e File "C:\Users\xxx_\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\base.py", line 113, in call outputs = self.call(inputs) ^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\agents\agent.py", line 792, in call next_step_output = self.take_next_step( ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\agents\agent.py", line 695, in take_next_step observation = tool.run( ^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\tools\base.py", line 107, in run raise e File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\tools\base.py", line 104, in run observation = self.run(*tool_args, **tool_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\agents\tools.py", line 31, in run return self.func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\base.py", line 213, in run return self(args[0])[self.output_keys[0]] ^^^^^^^^^^^^^ File "C:\Users\xxx_\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\base.py", line 116, in call raise e File "C:\Users\xxx_\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\base.py", line 113, in call outputs = self.call(inputs) ^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\retrieval_qa\base.py", line 109, in call docs = self.get_docs(question) ^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\chains\retrieval_qa\base.py", line 166, in get_docs return self.retriever.get_relevant_documents(question) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\vectorstores\base.py", line 279, in get_relevant_documents docs = self.vectorstore.similarity_search(query, **self.search_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\xxx\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain\vectorstores\pinecone.py", line 160, in similarity_search text = metadata.pop(self._text_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyError: 'text'

kindbuds avatar Apr 25 '23 15:04 kindbuds

same problem here, any solution?

alarcon7a avatar May 03 '23 04:05 alarcon7a

Same problem here!! I'm using from_documents

berengamble avatar May 05 '23 14:05 berengamble

+1 here. Subscribing.

aringer117 avatar May 08 '23 02:05 aringer117

+1 PInecone retrieval seems broken. It either flips out with 'text' error or hangs indefinitely.

verveguy avatar May 17 '23 00:05 verveguy

Pinecone doesn't store documents explicitly; it only stores ids, embeddings, and metadata. So, if when querying Pinecone you'd like to have access to the documents themselves, you should add them to the metadata, as illustrated here. If you add them under a text key in the metadata, the KeyError should resolve.

Alternatively, you can specify your own key name when you initialize Pinecone, via text_str. Code reference is here.

PlatosTwin avatar Jun 08 '23 22:06 PlatosTwin

This explains a lot - thank you. However, wouldn't storing the entire text of a document in addition to the vector drastically increase the storage requirements? I believe Pinecone also suggests avoiding pieces of metadata that are very unique, which this certainly would be.

Is it not possible to use the vector data itself as an input to ChatGPT for answering questions on the data? From above, it sounds like you need to do as follows:

  1. Embed the text and store both the vector and original text at Pinecone.
  2. Perform a similarity search on the Pinecone index to find relevant documents.
  3. Take text from Pinecone metadata and feed that back into the LLM for QA.

Appreciate any additional ideas. Thanks.

PeakProsperityDotCom avatar Jun 15 '23 19:06 PeakProsperityDotCom

Your alternative is to store the document text elsewhere and then store a reference / URL to it in the pinecone metadata. This is what I do with Tana notes, for example. Pinecone is the search index - Tana is the content repository.

On Thu, Jun 15, 2023 at 3:13 PM Peak Prosperity @.***> wrote:

This explains a lot - thank you. However, wouldn't storing the entire text of a document in addition to the vector drastically increase the storage requirements? I believe Pinecone also suggests avoiding pieces of metadata that are very unique, which this certainly would be.

Is it not possible to use the vector data itself as an input to ChatGPT for answering questions on the data? From above, it sounds like you need to do as follows:

  1. Embed the text and store both the vector and original text at Pinecone.
  2. Perform a similarity search on the Pinecone index to find relevant documents.
  3. Take text from Pinecone metadata and feed that back into the LLM for QA.

Appreciate any additional ideas. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/hwchase17/langchain/issues/3460#issuecomment-1593592203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAUCPNHIWD7INUAK5SNXW3XLNNE7ANCNFSM6AAAAAAXJ5SREY . You are receiving this because you commented.Message ID: @.***>

verveguy avatar Jun 16 '23 01:06 verveguy

Thanks, @verveguy . I created a table in our SQL database that references the vector ID and it seems to be working well.

PeakProsperityDotCom avatar Jun 16 '23 10:06 PeakProsperityDotCom

same problem here, any solution?

text = metadata.pop(self._text_key) KeyError: 'text'

kishorgujjar avatar Jun 30 '23 08:06 kishorgujjar

In my situation, I encountered an issue due to the metadata associated with the vectors, specifically the need to include a key labeled "text" (e.g., "text": text). I found the solution to this problem in the comment at this GitHub link, which proved to be extremely helpful for me.

AMRedichkina avatar Aug 07 '23 19:08 AMRedichkina

Hi, @kindbuds. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on the information provided, it seems that the issue you reported is related to a KeyError: 'text' when trying to retrieve the 'text' key from the metadata dictionary. Removing the tool resolves the issue, but you were seeking help to fix the error while still using the tool.

@christianwarmuth suggested trying Pinecone.from_existing_index or Pinecone.from_documents to resolve the issue, but it seems that the error still persists. Other users have also reported experiencing the same problem and are looking for a solution. @PlatosTwin suggested adding the document text to the metadata under a 'text' key or specifying a custom key name when initializing Pinecone. @verveguy mentioned storing the document text elsewhere and referencing it in the Pinecone metadata. @PeakProsperityDotCom found success by creating a table in their SQL database that references the vector ID.

If this issue is still relevant to the latest version of the LangChain repository, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project. Let us know if you have any further questions or concerns.

dosubot[bot] avatar Nov 06 '23 16:11 dosubot[bot]

Is there any update about the issue. I am also getting this error:

from langchain.vectorstores import Pinecone
​
text_field = "text"
​
# switch back to normal index for langchain
​
vectorstore = Pinecone(
    index, embed.embed_query, text_key='text'
)
query = "when "
​
vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)
Found document with no `text` key. Skipping.
Found document with no `text` key. Skipping.
Found document with no `text` key. Skipping.
[]

rpalsaxena avatar Mar 14 '24 20:03 rpalsaxena

Same issue, any updates?

zubairahmed-ai avatar Mar 17 '24 18:03 zubairahmed-ai

I solve the problem. Here is a detailed explanation and for last a question for @langchainTeam. When you create a record on pinecone by yourself for example (pinecone documentation):

pc = Pinecone(api_key='YOUR_API_KEY')
index_name = "docs-quickstart-index"

pc.create_index(
    name=index_name,
    dimension=8,
    metric="cosine",
    spec=ServerlessSpec(
        cloud='aws', 
        region='us-east-1'
    ) 
) 

index = pc.Index(index_name)

index.upsert(
    vectors=[
        {"id": "vec1", "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], "metadata": {"type": "doc1"}},
        {"id": "vec2", "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2], "metadata": {"type": "doc2"}},
        {"id": "vec3", "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3], "metadata": {"type": "doc3"}},
        {"id": "vec4", "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]}
    ],
    namespace="ns1"
)

It doesnt add the metadata field "text", only your custon metadata in case you put one, and you can do a similarity search to that vectors with no problem (pinecone documentation)

index.query(
    namespace="ns2",
    vector=[0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7],
    top_k=3,
    include_values=True
)

When you create a vector on pinecone by langchain for example (langchain documentation):

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)

vectorstore.add_texts(["More text!"])

It adds to the vector a metadata field named "text" that is the actual text embeded. textProblem

And then, with that metadata field, you can do:

vectorstore.similarity_search(query="More Text")

Or alternatively, with the retriver interface:

retriver = vectorstore.as_retriever()
results = retriver.invoke("More Text")

My question is, Which concept im missing ? Why have a vectorstore with the vector values (the vector representation of the text) if I have to run the querys on the "text" metadata field with te actual text ? Thank you for your time, and I appreciate your response to help resolve my doubt. In summary, why maintain a vectorstore with vector values if queries ultimately need to be run on the 'text' metadata field with the actual text?

CessarGL avatar May 06 '24 01:05 CessarGL

@CessarGL I had the same issue since I stored vectors as shown on the pinecone website. I think the point of including "text" as a metadata is for the language model to interpret the context. Vector search only retrieves the similar vectors but vector itself are jusyt bunch of numbers. In order for the LLMs to understand the semantic meanings of each vector I think you need to attach content under text to the vectors.

Mizuki8783 avatar May 19 '24 00:05 Mizuki8783