llama_index
llama_index copied to clipboard
error creating GPTListIndex
Hi everyone, i was trying to replicate this example with my own data but it's failing to create the GPTListIndex, for the embeddings I'm using Azure
from gpt_index import GPTListIndex
from gpt_index import download_loader, GPTSimpleVectorIndex
from langchain.llms import AzureOpenAI
from llama_index import LangchainEmbedding
from langchain.embeddings import OpenAIEmbeddings
from llama_index import (
GPTSimpleVectorIndex,
SimpleDirectoryReader,
LLMPredictor,
PromptHelper
)
from pathlib import Path
UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)
topics = ['topic1', 'topic2', 'topic3', 'topic4']
loader = UnstructuredReader()
doc_set = {}
all_docs = []
for topic in topics:
topic_docs = loader.load_data(file=Path(f'../data/{topic}.txt'), split_documents=False)
# insert year metadata into each year
for d in topic_docs:
d.extra_info = {"topic": topic}
doc_set[topic] = topic_docs
all_docs.extend(topic_docs)
llm_predictor = LLMPredictor(llm = AzureOpenAI(deployment_name="text-ada-001", model_name="text-ada-001", temperature=0))
embedding_llm = LangchainEmbedding(OpenAIEmbeddings(
document_model_name="text-search-ada-doc-001",
query_model_name="text-search-ada-query-001"
))
index_set = {}
for topic in topics:
cur_index = GPTSimpleVectorIndex(doc_set[topic], chunk_size_limit=512, llm_predictor=llm_predictor, embed_model=embedding_llm)
index_set[topic] = cur_index
cur_index.save_to_disk(f'../indices/index_{topic}.json')
list_index = GPTListIndex([index_set[y] for y in topics], llm_predictor=llm_predictor, embed_model=embedding_llm)
I'm getting the following error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[33], line 3
1 # define a list index over the vector indices
2 # allows us to synthesize information across each index
----> 3 list_index = GPTListIndex([index_set[y] for y in topics])
File [~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/list/base.py:57](https://vscode-remote+amlext-002b2f737562736372697074696f6e732f30623831303564362d653536372d343236312d626566652d3765393066653231616234372f7265736f7572636547726f7570732f666e6d7a2d73616e64626f782d6465762f70726f7669646572732f4d6963726f736f66742e4d616368696e654c6561726e696e6753657276696365732f776f726b7370616365732f4d4c776f726b73706163654461746143656e7465722f636f6d70757465732f6d6172636f2d6770752d7632.vscode-resource.vscode-cdn.net/home/azureuser/cloudfiles/code/Users/marco.moresi/learning_hours/chatbot_azure/_notebooks/~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/list/base.py:57), in GPTListIndex.__init__(self, documents, index_struct, text_qa_template, llm_predictor, text_splitter, **kwargs)
55 """Initialize params."""
56 self.text_qa_template = text_qa_template or DEFAULT_TEXT_QA_PROMPT
---> 57 super().__init__(
58 documents=documents,
59 index_struct=index_struct,
60 llm_predictor=llm_predictor,
61 text_splitter=text_splitter,
62 **kwargs,
63 )
File [~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:109](https://vscode-remote+amlext-002b2f737562736372697074696f6e732f30623831303564362d653536372d343236312d626566652d3765393066653231616234372f7265736f7572636547726f7570732f666e6d7a2d73616e64626f782d6465762f70726f7669646572732f4d6963726f736f66742e4d616368696e654c6561726e696e6753657276696365732f776f726b7370616365732f4d4c776f726b73706163654461746143656e7465722f636f6d70757465732f6d6172636f2d6770752d7632.vscode-resource.vscode-cdn.net/home/azureuser/cloudfiles/code/Users/marco.moresi/learning_hours/chatbot_azure/_notebooks/~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:109), in BaseGPTIndex.__init__(self, documents, index_struct, llm_predictor, embed_model, docstore, index_registry, prompt_helper, text_splitter, chunk_size_limit, include_extra_info, llama_logger)
107 else:
108 documents = cast(Sequence[DOCUMENTS_INPUT], documents)
--> 109 documents = self._process_documents(
110 documents, self._docstore, self._index_registry
111 )
112 self._validate_documents(documents)
113 # TODO: introduce document store outside __init__ function
File [~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:187](https://vscode-remote+amlext-002b2f737562736372697074696f6e732f30623831303564362d653536372d343236312d626566652d3765393066653231616234372f7265736f7572636547726f7570732f666e6d7a2d73616e64626f782d6465762f70726f7669646572732f4d6963726f736f66742e4d616368696e654c6561726e696e6753657276696365732f776f726b7370616365732f4d4c776f726b73706163654461746143656e7465722f636f6d70757465732f6d6172636f2d6770752d7632.vscode-resource.vscode-cdn.net/home/azureuser/cloudfiles/code/Users/marco.moresi/learning_hours/chatbot_azure/_notebooks/~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:187), in BaseGPTIndex._process_documents(self, documents, docstore, index_registry)
185 results.append(doc)
186 else:
--> 187 raise ValueError(f"Invalid document type: {type(doc)}.")
188 return cast(List[BaseDocument], results)
ValueError: Invalid document type: .
I'm passing to the GPTListIndex a list of llama_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex objects.
environment with: Python 3.8 llama_index 0.4.36 gpt_index 0.4.36 langchain 0.0.121
Looking forward to receiving your comments.
Hey @mrcmoresi, hard to help debug without more information.
The ValueError seems to indicate the document is None.
Some follow questions:
- Did you load the
GPTSimpleVectorIndexfrom disk after saving it? - Could you try using the
ComposableGraphobject to handle save/load instead?
Hi @Disiok thanks for your answer.
-
I tested two scenarios, using the GPTSimpleVectorIndex directly from Memory and also loading from disk, both situations ended up in the same error.
-
I will try
Going to close this issue for now. The ChatbotSEC tutorial should be updated to work with the latest version of llama-index
Feel free to re-open if this is still an issue!