llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

Example using an existing Pinecone index & namespace

Open louis030195 opened this issue 2 years ago • 9 comments

Hey, is it possible to use an existing Pinecone index? In this example, you create a new one & index it

https://github.com/jerryjliu/gpt_index/blob/main/examples/vector_indices/PineconeIndexDemo.ipynb

Also, not clear if we can use specific namespaces in Pinecone?

Thanks for the great lib anyway 🎸

louis030195 avatar Jan 20 '23 15:01 louis030195

You could try using the PineconeReader (not the Pinecone Index) to load docs from an existing Pinecone index (https://gpt-index.readthedocs.io/en/latest/how_to/vector_stores.html), and then feed those into a GPT index (e.g. a GPTSimpleVectorIndex or GPTListIndex).

The PineconeReader isn't perfect though, let me know your feedback on that

jerryjliu avatar Jan 22 '23 00:01 jerryjliu

@louis030195 did you have a chance to try this out?

jerryjliu avatar Jan 22 '23 19:01 jerryjliu

Side question: I'd like to insert documents one by one to my Pinecone index, but all I see in the examples is SimpleDirectoryReader(...).load_data(). But my source data is not a directory of text files, it's a string that comes from a web (POST) request.

Here's my current code (simplified):

pinecone.init(api_key="...", environment="...")
pinecone_index = pinecone.Index("...")
index = GPTPineconeIndex([], pinecone_index=pinecone_index)

newtext = """
removed to keep the code short
"""

index.insert(Document(text=newtext))

Is what I'm trying to do even possible? Should I use Pinecone directly to create/update the index, and use gpt_index for querying exclusively?

It may be obvious but I'm very new to all of this, sorry if it sounds dumb.

bouiboui avatar Jan 25 '23 10:01 bouiboui

See, I can make it work like this:

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)
index = GPTPineconeIndex(documents=[], pinecone_index=pinecone.Index('...'))

index.load_from_disk('../my_index.json')

index.insert(document=Document("document contents"))

index.save_to_disk('../my_index.json')

response = index.query("My question ?", verbose=True)

But I'm relying on a JSON file, the whole point of using Pinecone for me was to keep the data somewhere else, without touching the local filesystem. Is that possible?

bouiboui avatar Jan 25 '23 16:01 bouiboui

from gpt_index import Document
docs = []
for m in r.matches:
    docs.append(Document(
        text=m.metadata["text"],
        doc_id=m.id,
        embedding=m.vector,
    ))
gpt_index = GPTListIndex(docs)
gpt_index.query("What is the future of AI?")

r being the result of a pinecone query. Unfortunately, don't know how to fetch all documents

It's a feature not a bug: ended up using gpt index on a subset of my pinecone index, potentially interesting

louis030195 avatar Jan 25 '23 18:01 louis030195

This works for me. This allows you to do both - use an existing pinecone index and namespace.

import pinecone
from gpt_index import GPTPineconeIndex
from gpt_index.data_structs.data_structs import PineconeIndexStruct
from IPython.display import Markdown, display

api_key = os.getenv("PINECONE_API_KEY")

pinecone.init(api_key=api_key, environment="us-east1-gcp")

# debug - verify connection
# print(pinecone.list_indexes())

# replace with your index and namespace
index_name = "your_index_name"
namespace = "your_namespace"

index = pinecone.Index(index_name)

# debug - verify index stats
# print(index.describe_index_stats())

# passing index_struct bypasses the 'creation' of index and sets it up for use
index = GPTPineconeIndex(pinecone_index=index,
                         index_struct=PineconeIndexStruct())

# (optional) required only if you want to query a specific namespace
query_kwargs = {
    "pinecone_kwargs": {"namespace": namespace}
}
response = index.query("What did the author do growing up?", verbose=True, **query_kwargs)
display(Markdown(f"<b>{response}</b>"))

mahpat16 avatar Jan 26 '23 06:01 mahpat16

This works for me. This allows you to do both - use an existing pinecone index and namespace.

import pinecone
from gpt_index import GPTPineconeIndex
from gpt_index.data_structs.data_structs import PineconeIndexStruct
from IPython.display import Markdown, display

api_key = os.getenv("PINECONE_API_KEY")

pinecone.init(api_key=api_key, environment="us-east1-gcp")

# debug - verify connection
# print(pinecone.list_indexes())

# replace with your index and namespace
index_name = "your_index_name"
namespace = "your_namespace"

index = pinecone.Index(index_name)

# debug - verify index stats
# print(index.describe_index_stats())

# passing index_struct bypasses the 'creation' of index and sets it up for use
index = GPTPineconeIndex(pinecone_index=index,
                         index_struct=PineconeIndexStruct())

# (optional) required only if you want to query a specific namespace
query_kwargs = {
    "pinecone_kwargs": {"namespace": namespace}
}
response = index.query("What did the author do growing up?", verbose=True, **query_kwargs)
display(Markdown(f"<b>{response}</b>"))

Just tried it and it works. I believe index_struct=PineconeIndexStruct() was the missing piece. Thank you so much!

bouiboui avatar Jan 26 '23 07:01 bouiboui

This works for me. This allows you to do both - use an existing pinecone index and namespace.

import pinecone
from gpt_index import GPTPineconeIndex
from gpt_index.data_structs.data_structs import PineconeIndexStruct
from IPython.display import Markdown, display

api_key = os.getenv("PINECONE_API_KEY")

pinecone.init(api_key=api_key, environment="us-east1-gcp")

# debug - verify connection
# print(pinecone.list_indexes())

# replace with your index and namespace
index_name = "your_index_name"
namespace = "your_namespace"

index = pinecone.Index(index_name)

# debug - verify index stats
# print(index.describe_index_stats())

# passing index_struct bypasses the 'creation' of index and sets it up for use
index = GPTPineconeIndex(pinecone_index=index,
                         index_struct=PineconeIndexStruct())

# (optional) required only if you want to query a specific namespace
query_kwargs = {
    "pinecone_kwargs": {"namespace": namespace}
}
response = index.query("What did the author do growing up?", verbose=True, **query_kwargs)
display(Markdown(f"<b>{response}</b>"))

This almost works for me 😛 Problem is that my text isn't in the key "text" metadata

This https://github.com/jerryjliu/gpt_index/blob/cab30c4aec7b94c6d12a6efe3fb6b91a605f3869/gpt_index/indices/query/vector_store/pinecone.py#L79

Maybe could have the possibility to customize where it's picked in the metadata, and also considering the fact that the text is not in pinecone index metadata?

louis030195 avatar Jan 26 '23 09:01 louis030195

See, I can make it work like this:

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENV)
index = GPTPineconeIndex(documents=[], pinecone_index=pinecone.Index('...'))

index.load_from_disk('../my_index.json')

index.insert(document=Document("document contents"))

index.save_to_disk('../my_index.json')

response = index.query("My question ?", verbose=True)

But I'm relying on a JSON file, the whole point of using Pinecone for me was to keep the data somewhere else, without touching the local filesystem. Is that possible?

Hey @bouiboui sorry I missed this. Just a heads up that by using the PineconeIndex, your data is stored in Pinecone, not in the .json file. In fact, doing save_to_disk and load_from_disk on the PineconeIndex doesn't really do anything. doing index.insert should work!

jerryjliu avatar Jan 27 '23 22:01 jerryjliu

going to close for now, let me know if anything specific pops up

jerryjliu avatar Feb 01 '23 01:02 jerryjliu

This works for me. This allows you to do both - use an existing pinecone index and namespace.

import pinecone
from gpt_index import GPTPineconeIndex
from gpt_index.data_structs.data_structs import PineconeIndexStruct
from IPython.display import Markdown, display

api_key = os.getenv("PINECONE_API_KEY")

pinecone.init(api_key=api_key, environment="us-east1-gcp")

# debug - verify connection
# print(pinecone.list_indexes())

# replace with your index and namespace
index_name = "your_index_name"
namespace = "your_namespace"

index = pinecone.Index(index_name)

# debug - verify index stats
# print(index.describe_index_stats())

# passing index_struct bypasses the 'creation' of index and sets it up for use
index = GPTPineconeIndex(pinecone_index=index,
                         index_struct=PineconeIndexStruct())

# (optional) required only if you want to query a specific namespace
query_kwargs = {
    "pinecone_kwargs": {"namespace": namespace}
}
response = index.query("What did the author do growing up?", verbose=True, **query_kwargs)
display(Markdown(f"<b>{response}</b>"))

Hey @mahpat16, does this still work for you? I need to use namespaces from now on, and it doesn't seem to work.

query_kwargs = {
    "pinecone_kwargs": {
        # "namespace": namespace
    }
}
response = index.query(message, verbose=True, **query_kwargs)

works great, it generates a natural language response, like "I can't answer with the documents provided". But if I uncomment the # "namespace": namespace line, it either sends me something like "Empty Response" or it crashes :

text = match.metadata["text"]
TypeError: 'NoneType' object is not subscriptable

bouiboui avatar Feb 11 '23 23:02 bouiboui

@bouiboui I'm facing the same/a similar problem trying to get a multi-namespace index going.

  • If I attempt to pass pinecone_kwargs in the index function, records are added but no namespace is applied.
  • If I query with the namespace option I get "Empty Response" back
  • Commenting out the namespace option means I get results based on the records that are in the index without a namespace

stefl avatar Feb 16 '23 10:02 stefl

Resolved

You now pass pinecone_kwargs when creating the index rather than in the query/index methods:

index = GPTPineconeIndex(pinecone_index=index,
                         index_struct=PineconeIndexStruct(), pinecone_kwargs={"namespace": namespace})
index.insert(document=(Document(text, doc_id=doc_id)))

stefl avatar Feb 18 '23 09:02 stefl

It looks like there's been a refactor of the IndexStruct code and PineconeIndexStruct is no longer, so this method no longer works on the latest version of Llamaindex.

stefl avatar Feb 26 '23 09:02 stefl

hopefully this link could help: https://discord.com/channels/1059199217496772688/1059200010622873741/1079094097916211270

jerryjliu avatar Feb 26 '23 09:02 jerryjliu

@jerryjliu I've attempted to make a branch that uses what's suggested there but unsuccessfully.

I've resolved this temporarily by pinning to gpt-index==0.4.5 in my requirements.txt

Happy to contribute to this feature once the refactoring has stabilized!

stefl avatar Feb 26 '23 17:02 stefl