langchain
langchain copied to clipboard
Pinecone.from_texts() added 'text' metadata field modifies data object passed in as argument AND errors if text size exceeds Pinecone metadata limit per vector
The Pinecone.from_documents()
embeddings-creation/upsert (based on this example) produces two unexpected behaviors:
- Mutates the original
docs
object inplace, such that each entry'sDocument.metadata
dict now has a 'text' key that is assigned the value ofDocument.page_content
. - Does not check whether the addition of this
metadata['text']
entry exceeds the maximum allowable metadata bytes per vector set by Pinecone (40960 bpv), allowing the API call to throw an error in cases where this value is exceeded.
I was sending a batch of documents to Pinecone using Pinecone.from_documents()
, and was surprised to see the operation fail on this error:
ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'date': 'Sat, 29 Apr 2023 04:19:35 GMT', 'x-envoy-upstream-service-time': '5', 'content-length': '115', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"metadata size is 41804 bytes, which exceeds the limit of 40960 bytes per vector","details":[]}
Because there was just minimal metadata per each record and I'd already sent about 50,000 successfully to Pinecone in the same call. Then, when I inspected the docs
list of Documents
(compiled from DataFrameLoader().load()
), I was surprised to see the extra 'text'
field in the metadata. It wasn't until I went poking around in pinecone.py that I found this was an added field that was updating the passed-in Document.metadata
dicts in-place (because of Python's pass-by-sharing rules about mutability).
Suggestions:
- If
metadata['text']
is required (I'm not sure it is for Pinecone upserts?), then make the user do that (error and refuse to upsert if it's not there) rather than modify silently in.from_texts()
. - The metadata limit issue would be great to test ahead of the API's HTTP request, could do a quick check on user metadata input to make sure it's not going to get rejected by Pinecone (otherwise warn the user). In my case, I don't want to make smaller chunks of text (my use case involves a certain number of turns of dialogue in each embedded chunk), but I may just write in a check for overflow and truncate the
'text'
metadata accordingly. - Fail gracefully by catching all
ApiException
errors so that the embeddings-creation and upsert process isn't interrupted. - Maybe consider something like an
add_text_metadata
flag in the call tofrom_documents()
so users have the option to have it done automatically for them?
I'm pretty new to LangChain and Pinecone, so if I'm missing something or doing it wrong, apologies - otherwise, hope this is useful feedback!
One other item to consider: Because ids_batch
assigns a random UUID, if the upsert fails then I'm not sure how to restart at the right spot (or how to ensure no collisions if I try to restart in the middle)
I'm having the same issue, thanks for bringing this up
I used this filter, based on byte length, which helped prevent the API error.
byte_max = 30000
def utf8len(s):
return len(s.encode('utf-8'))
screened_docs = list()
for i, doc in enumerate(loaded_docs):
if utf8len(doc.page_content) <= byte_max:
screened_docs.append(doc)
I also ran into this issue and as a workaround just filtered texts that are too big after chunking:
texts = list(filter(lambda x: len(x) <= 40960, texts))
+1 all of this, had to truncate the upsert documents.
Happens because of this line
for j, line in enumerate(lines_batch): metadata[j][text_key] = line
in langchain.vectorstores.Pinecone from_texts class method
From what I can see in the docs its an intentional behaviour, so that one can have there text along with embeddings, but I think it should be optional, as vector stores impose a limit and it not needed all the times to have the text that was embedded with you on retrieval. The retrievers use it for ease of access.
Hi, @andrewreece! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you raised is about the Pinecone.from_documents()
function in the langchain repository. It seems that the function mutates the original docs
object and adds a 'text' key to each entry's Document.metadata
dict, without checking if it exceeds the maximum allowable metadata bytes per vector. Some suggestions for improvement have been made, such as making the addition of metadata['text']
optional, testing the metadata limit before making the API request, and catching all ApiException
errors. Other users have also shared workarounds and suggestions for filtering or truncating texts to avoid the issue.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!