haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

PineconeDocumentStore raises error to the metadata produced by DocumentSplitter

Open bilgeyucel opened this issue 7 months ago • 3 comments

Describe the bug PineconeDocumentStore raises an error when I try to index a document that was split by DocumentSplitter. Error message 👇

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 23 Jul 2024 12:46:03 GMT', 'Content-Type': 'application/json', 'Content-Length': '160', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '903', 'x-pinecone-request-id': '2298458388900737762', 'x-envoy-upstream-service-time': '37', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Metadata value must be a string, number, boolean or list of strings, got '[{\"doc_id\":\"22e0...' for field '_split_overlap'","details":[]}

Document object that raises the error is below. "_split_overlap" seems to be a list of dict

Document(id=37fa03ca409f457046696a3bec987d5cb627f655cbcf0c019f7334bc170da4b8, content: 'Vegan Persimmon Flan

Recipe  by Tilde Thurium

This makes 2 servings. Why did I write a recipe that...', meta: {'file_path': '/content/recipe_files/vegan_flan_recipe.md', 'source_id': 'a01a0ae2f396930e9cd3475986ae716cb26c554f6b49d4c61dfeb473ddeb7ced', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': '0520d3c17150c5fd057a19bdc796e9f9c3a632f1d9acf730154d888ee3fc86be', 'range': (0, 305)}]})

To Reproduce

import os

os.environ["PINECONE_API_KEY"] = "PINECONE-KEY"

from haystack_integrations.document_stores.pinecone import PineconeDocumentStore

document_store = PineconeDocumentStore(
    index="<ENTER_PINECONE_INDEX_NAME>",
    namespace="<ENTER_PINECONE-PROJECT-NAME>",
    dimension=1536,
    spec={"serverless": {"region": "us-east-1", "cloud": "aws"}},
)

from haystack.components.preprocessors import DocumentSplitter
from haystack import Document

source_docs = [Document(content="""
Vegan Persimmon Flan
Recipe by Tilde Thurium
This makes 2 servings. Why did I write a recipe that only makes 2 servings? It was the height of COVID, okay, don't judge me.
Tools:
2 ramekins
Blender
Ingredients:
½ cup persimmon pulp, strained. This takes 2 average sized fuyu persimmons. If they have seeds, remove them.
1 tbsp cornstarch
½ tsp agar agar
1 tbsp agave nectar, or to taste
2 tbsp granulated sugar
¼ cup coconut creme
½ cup almond milk
½ tsp vanilla
Steps
I tried making caramel with the [Full Of Plants](https://www.google.com/url?q=https%3A%2F%2Ffullofplants.com%2Feasy-vegan-caramel-sauce%2F) method but it was a pain in the ass and I burned myself.
For this recipe, just put the sugar at the bottom of the cup and it somehow magically turns into sauce. Lifehack!
Combine the cornstarch with the almond milk and stir it in.
whisk persimmon pulp, milk/cornstarch, agar agar, coconut creme, and agave in a saucepan. Bring to a boil.
The persimmon pulp got a little congealed, so I mixed it with an immersion blender. But you do you, boo.
Let the persimmon mixture cool a bit, for maybe 5 minutes. Stir in the vanilla. Pour it in to your ramekins or what have you.
Don’t forget and let it cool to room temperature. Agar agar waits for no man.
Refrigerate for at least 4 hours, or overnight.
To remove from ramekin, try the hot water bath method (didn’t work for me, maybe the water wasn’t hot enough.) Or just run a knife along the edges of the ramekin and jiggle it out.""")]

document_splitter = DocumentSplitter(split_by="word", split_length=40, split_overlap=10)
split_docs = document_splitter.run(documents=source_docs)
document_store.write_documents(documents=split_docs["documents"])

Describe your environment (please complete the following information):

  • OS: Colab
  • Haystack version: 2.3
  • Integration version: 1.2.1

bilgeyucel avatar Jul 23 '24 13:07 bilgeyucel