langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Including text metadata in pinecone upsert.

Open SnoozingSimian opened this issue 2 years ago • 4 comments

While using langchain for pinecone upsert I was frequently running into an error discussed here. I found that langchain is including text metadata along with whatever the user sends as metadata here. What is the purpose of doing this? It will easily generate metadatas which are over the permissible limit set by pinecone in many cases.

SnoozingSimian avatar Feb 28 '23 15:02 SnoozingSimian

Even though this might still cause high cardinality for metadata, I guess this step is essential in order to actually fetch the text.

SnoozingSimian avatar Feb 28 '23 16:02 SnoozingSimian

TL;DR: make sure you create your index specifying only the metadata fields other than "text" (where the metadata will be put by default) for indexing

I also had the same question. I did find that pinecone have many examples of putting the text in the metadata (e.g. https://docs.pinecone.io/docs/semantic-text-search#creating-an-index or https://docs.pinecone.io/docs/gen-qa-openai#indexing-data-in-vector-db etc.). They do suggest putting high cardinality data in the metadata for filtering is not a good idea (https://docs.pinecone.io/docs/troubleshooting#pods-are-full), they do have a solution for storing metadata that is not filterable: Selective metadata indexing (https://docs.pinecone.io/docs/manage-indexes#selective-metadata-indexing), which should be fine for us as the langchain code does not ever create the index, so one should be able to specify that the text metadata column should not be indexed when creating the index.

I haven't seen this pointed out in any pinecone docs, though it surely must cause issues for anyone just working from their examples. It might also be a good idea to note this somewhere in the langchain documentation specific to pinecone, that the index should be set up to not index the metadata field where the text will be stored.

This might become more of a concern if langchain ever decides to create indexes, or more deeply integrate searching the metadata - would then need to ensure metadata_config was set up correctly.

(I've combined my two previous comments, posted without fully investigating, into this summary one)

kitfit-dave avatar Apr 18 '23 02:04 kitfit-dave

@kitfit-dave thanks so much for this explanation. My index of embeddings was created elsewhere and I'm not keen on rebuilding it, so i think I understand now how I can modify my index to have metadata. It may not be recommended, but it'll suit my purpose especially as I'm just learning and testing for now.

amuhareb avatar May 02 '23 07:05 amuhareb

@kitfit-dave thanks so much for this explanation. My index of embeddings was created elsewhere and I'm not keen on rebuilding it, so i think I understand now how I can modify my index to have metadata. It may not be recommended, but it'll suit my purpose especially as I'm just learning and testing for now.

amuhareb avatar May 02 '23 07:05 amuhareb

Hi, @SnoozingSimian! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about the inclusion of text metadata in pinecone upsert using langchain. It seems that the issue has been resolved by suggesting to specify only the metadata fields other than "text" when creating the index to avoid exceeding the permissible limit set by pinecone. This solution was found helpful by another user as it suited their purpose for learning and testing.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Sep 20 '23 16:09 dosubot[bot]