llama_index
llama_index copied to clipboard
Example using an existing Pinecone index & namespace
Hey, is it possible to use an existing Pinecone index? In this example, you create a new one & index it
https://github.com/jerryjliu/gpt_index/blob/main/examples/vector_indices/PineconeIndexDemo.ipynb
Also, not clear if we can use specific namespaces in Pinecone?
Thanks for the great lib anyway 🎸
You could try using the PineconeReader (not the Pinecone Index) to load docs from an existing Pinecone index (https://gpt-index.readthedocs.io/en/latest/how_to/vector_stores.html), and then feed those into a GPT index (e.g. a GPTSimpleVectorIndex or GPTListIndex).
The PineconeReader isn't perfect though, let me know your feedback on that
@louis030195 did you have a chance to try this out?
Side question:
I'd like to insert documents one by one to my Pinecone index, but all I see in the examples is SimpleDirectoryReader(...).load_data()
. But my source data is not a directory of text files, it's a string that comes from a web (POST) request.
Here's my current code (simplified):
pinecone.init(api_key="...", environment="...")
pinecone_index = pinecone.Index("...")
index = GPTPineconeIndex([], pinecone_index=pinecone_index)
newtext = """
removed to keep the code short
"""
index.insert(Document(text=newtext))
Is what I'm trying to do even possible?
Should I use Pinecone directly to create/update the index, and use gpt_index
for querying exclusively?
It may be obvious but I'm very new to all of this, sorry if it sounds dumb.
See, I can make it work like this:
pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)
index = GPTPineconeIndex(documents=[], pinecone_index=pinecone.Index('...'))
index.load_from_disk('../my_index.json')
index.insert(document=Document("document contents"))
index.save_to_disk('../my_index.json')
response = index.query("My question ?", verbose=True)
But I'm relying on a JSON file, the whole point of using Pinecone for me was to keep the data somewhere else, without touching the local filesystem. Is that possible?
from gpt_index import Document
docs = []
for m in r.matches:
docs.append(Document(
text=m.metadata["text"],
doc_id=m.id,
embedding=m.vector,
))
gpt_index = GPTListIndex(docs)
gpt_index.query("What is the future of AI?")
r
being the result of a pinecone query. Unfortunately, don't know how to fetch all documents
It's a feature not a bug: ended up using gpt index on a subset of my pinecone index, potentially interesting
This works for me. This allows you to do both - use an existing pinecone index and namespace.
import pinecone
from gpt_index import GPTPineconeIndex
from gpt_index.data_structs.data_structs import PineconeIndexStruct
from IPython.display import Markdown, display
api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key, environment="us-east1-gcp")
# debug - verify connection
# print(pinecone.list_indexes())
# replace with your index and namespace
index_name = "your_index_name"
namespace = "your_namespace"
index = pinecone.Index(index_name)
# debug - verify index stats
# print(index.describe_index_stats())
# passing index_struct bypasses the 'creation' of index and sets it up for use
index = GPTPineconeIndex(pinecone_index=index,
index_struct=PineconeIndexStruct())
# (optional) required only if you want to query a specific namespace
query_kwargs = {
"pinecone_kwargs": {"namespace": namespace}
}
response = index.query("What did the author do growing up?", verbose=True, **query_kwargs)
display(Markdown(f"<b>{response}</b>"))
This works for me. This allows you to do both - use an existing pinecone index and namespace.
import pinecone from gpt_index import GPTPineconeIndex from gpt_index.data_structs.data_structs import PineconeIndexStruct from IPython.display import Markdown, display api_key = os.getenv("PINECONE_API_KEY") pinecone.init(api_key=api_key, environment="us-east1-gcp") # debug - verify connection # print(pinecone.list_indexes()) # replace with your index and namespace index_name = "your_index_name" namespace = "your_namespace" index = pinecone.Index(index_name) # debug - verify index stats # print(index.describe_index_stats()) # passing index_struct bypasses the 'creation' of index and sets it up for use index = GPTPineconeIndex(pinecone_index=index, index_struct=PineconeIndexStruct()) # (optional) required only if you want to query a specific namespace query_kwargs = { "pinecone_kwargs": {"namespace": namespace} } response = index.query("What did the author do growing up?", verbose=True, **query_kwargs) display(Markdown(f"<b>{response}</b>"))
Just tried it and it works. I believe index_struct=PineconeIndexStruct()
was the missing piece.
Thank you so much!
This works for me. This allows you to do both - use an existing pinecone index and namespace.
import pinecone from gpt_index import GPTPineconeIndex from gpt_index.data_structs.data_structs import PineconeIndexStruct from IPython.display import Markdown, display api_key = os.getenv("PINECONE_API_KEY") pinecone.init(api_key=api_key, environment="us-east1-gcp") # debug - verify connection # print(pinecone.list_indexes()) # replace with your index and namespace index_name = "your_index_name" namespace = "your_namespace" index = pinecone.Index(index_name) # debug - verify index stats # print(index.describe_index_stats()) # passing index_struct bypasses the 'creation' of index and sets it up for use index = GPTPineconeIndex(pinecone_index=index, index_struct=PineconeIndexStruct()) # (optional) required only if you want to query a specific namespace query_kwargs = { "pinecone_kwargs": {"namespace": namespace} } response = index.query("What did the author do growing up?", verbose=True, **query_kwargs) display(Markdown(f"<b>{response}</b>"))
This almost works for me 😛 Problem is that my text isn't in the key "text" metadata
This https://github.com/jerryjliu/gpt_index/blob/cab30c4aec7b94c6d12a6efe3fb6b91a605f3869/gpt_index/indices/query/vector_store/pinecone.py#L79
Maybe could have the possibility to customize where it's picked in the metadata, and also considering the fact that the text is not in pinecone index metadata?
See, I can make it work like this:
pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV) index = GPTPineconeIndex(documents=[], pinecone_index=pinecone.Index('...')) index.load_from_disk('../my_index.json') index.insert(document=Document("document contents")) index.save_to_disk('../my_index.json') response = index.query("My question ?", verbose=True)
But I'm relying on a JSON file, the whole point of using Pinecone for me was to keep the data somewhere else, without touching the local filesystem. Is that possible?
Hey @bouiboui sorry I missed this. Just a heads up that by using the PineconeIndex, your data is stored in Pinecone, not in the .json file. In fact, doing save_to_disk
and load_from_disk
on the PineconeIndex doesn't really do anything. doing index.insert
should work!
going to close for now, let me know if anything specific pops up
This works for me. This allows you to do both - use an existing pinecone index and namespace.
import pinecone from gpt_index import GPTPineconeIndex from gpt_index.data_structs.data_structs import PineconeIndexStruct from IPython.display import Markdown, display api_key = os.getenv("PINECONE_API_KEY") pinecone.init(api_key=api_key, environment="us-east1-gcp") # debug - verify connection # print(pinecone.list_indexes()) # replace with your index and namespace index_name = "your_index_name" namespace = "your_namespace" index = pinecone.Index(index_name) # debug - verify index stats # print(index.describe_index_stats()) # passing index_struct bypasses the 'creation' of index and sets it up for use index = GPTPineconeIndex(pinecone_index=index, index_struct=PineconeIndexStruct()) # (optional) required only if you want to query a specific namespace query_kwargs = { "pinecone_kwargs": {"namespace": namespace} } response = index.query("What did the author do growing up?", verbose=True, **query_kwargs) display(Markdown(f"<b>{response}</b>"))
Hey @mahpat16, does this still work for you? I need to use namespaces from now on, and it doesn't seem to work.
query_kwargs = {
"pinecone_kwargs": {
# "namespace": namespace
}
}
response = index.query(message, verbose=True, **query_kwargs)
works great, it generates a natural language response, like "I can't answer with the documents provided".
But if I uncomment the # "namespace": namespace
line, it either sends me something like "Empty Response" or it crashes :
text = match.metadata["text"]
TypeError: 'NoneType' object is not subscriptable
@bouiboui I'm facing the same/a similar problem trying to get a multi-namespace index going.
- If I attempt to pass pinecone_kwargs in the index function, records are added but no namespace is applied.
- If I query with the namespace option I get "Empty Response" back
- Commenting out the namespace option means I get results based on the records that are in the index without a namespace
Resolved
You now pass pinecone_kwargs when creating the index rather than in the query/index methods:
index = GPTPineconeIndex(pinecone_index=index,
index_struct=PineconeIndexStruct(), pinecone_kwargs={"namespace": namespace})
index.insert(document=(Document(text, doc_id=doc_id)))
It looks like there's been a refactor of the IndexStruct code and PineconeIndexStruct is no longer, so this method no longer works on the latest version of Llamaindex.
hopefully this link could help: https://discord.com/channels/1059199217496772688/1059200010622873741/1079094097916211270
@jerryjliu I've attempted to make a branch that uses what's suggested there but unsuccessfully.
I've resolved this temporarily by pinning to gpt-index==0.4.5 in my requirements.txt
Happy to contribute to this feature once the refactoring has stabilized!