langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Pinecone.from_texts() added 'text' metadata field modifies data object passed in as argument AND errors if text size exceeds Pinecone metadata limit per vector

Open andrewreece opened this issue 1 year ago • 5 comments

The Pinecone.from_documents() embeddings-creation/upsert (based on this example) produces two unexpected behaviors:

  1. Mutates the original docs object inplace, such that each entry's Document.metadata dict now has a 'text' key that is assigned the value of Document.page_content.
  2. Does not check whether the addition of this metadata['text'] entry exceeds the maximum allowable metadata bytes per vector set by Pinecone (40960 bpv), allowing the API call to throw an error in cases where this value is exceeded.

I was sending a batch of documents to Pinecone using Pinecone.from_documents(), and was surprised to see the operation fail on this error:

ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'date': 'Sat, 29 Apr 2023 04:19:35 GMT', 'x-envoy-upstream-service-time': '5', 'content-length': '115', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"metadata size is 41804 bytes, which exceeds the limit of 40960 bytes per vector","details":[]}

Because there was just minimal metadata per each record and I'd already sent about 50,000 successfully to Pinecone in the same call. Then, when I inspected the docs list of Documents (compiled from DataFrameLoader().load()), I was surprised to see the extra 'text' field in the metadata. It wasn't until I went poking around in pinecone.py that I found this was an added field that was updating the passed-in Document.metadata dicts in-place (because of Python's pass-by-sharing rules about mutability).

Suggestions:

  • If metadata['text'] is required (I'm not sure it is for Pinecone upserts?), then make the user do that (error and refuse to upsert if it's not there) rather than modify silently in .from_texts().
  • The metadata limit issue would be great to test ahead of the API's HTTP request, could do a quick check on user metadata input to make sure it's not going to get rejected by Pinecone (otherwise warn the user). In my case, I don't want to make smaller chunks of text (my use case involves a certain number of turns of dialogue in each embedded chunk), but I may just write in a check for overflow and truncate the 'text' metadata accordingly.
  • Fail gracefully by catching all ApiException errors so that the embeddings-creation and upsert process isn't interrupted.
  • Maybe consider something like an add_text_metadata flag in the call to from_documents() so users have the option to have it done automatically for them?

I'm pretty new to LangChain and Pinecone, so if I'm missing something or doing it wrong, apologies - otherwise, hope this is useful feedback!

andrewreece avatar Apr 29 '23 19:04 andrewreece

One other item to consider: Because ids_batch assigns a random UUID, if the upsert fails then I'm not sure how to restart at the right spot (or how to ensure no collisions if I try to restart in the middle)

andrewreece avatar Apr 29 '23 22:04 andrewreece

I'm having the same issue, thanks for bringing this up

FayZ676 avatar May 02 '23 18:05 FayZ676

I used this filter, based on byte length, which helped prevent the API error.

byte_max = 30000

def utf8len(s):
    return len(s.encode('utf-8'))

screened_docs = list()

for i, doc in enumerate(loaded_docs):
    if utf8len(doc.page_content) <= byte_max:
        screened_docs.append(doc)

andrewreece avatar May 03 '23 22:05 andrewreece

I also ran into this issue and as a workaround just filtered texts that are too big after chunking:

texts = list(filter(lambda x: len(x) <= 40960, texts))

dalberto avatar May 04 '23 23:05 dalberto

+1 all of this, had to truncate the upsert documents.

asidapara avatar May 09 '23 17:05 asidapara

Happens because of this line for j, line in enumerate(lines_batch): metadata[j][text_key] = line in langchain.vectorstores.Pinecone from_texts class method

From what I can see in the docs its an intentional behaviour, so that one can have there text along with embeddings, but I think it should be optional, as vector stores impose a limit and it not needed all the times to have the text that was embedded with you on retrieval. The retrievers use it for ease of access.

ipriyam26 avatar May 17 '23 21:05 ipriyam26

Hi, @andrewreece! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you raised is about the Pinecone.from_documents() function in the langchain repository. It seems that the function mutates the original docs object and adds a 'text' key to each entry's Document.metadata dict, without checking if it exceeds the maximum allowable metadata bytes per vector. Some suggestions for improvement have been made, such as making the addition of metadata['text'] optional, testing the metadata limit before making the API request, and catching all ApiException errors. Other users have also shared workarounds and suggestions for filtering or truncating texts to avoid the issue.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Sep 16 '23 16:09 dosubot[bot]