llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

error creating GPTListIndex

Open mrcmoresi opened this issue 2 years ago • 1 comments

Hi everyone, i was trying to replicate this example with my own data but it's failing to create the GPTListIndex, for the embeddings I'm using Azure

from gpt_index import GPTListIndex
from gpt_index import download_loader, GPTSimpleVectorIndex
from langchain.llms import AzureOpenAI
from llama_index import LangchainEmbedding
from langchain.embeddings import OpenAIEmbeddings
from llama_index import (
    GPTSimpleVectorIndex,
    SimpleDirectoryReader, 
    LLMPredictor,
    PromptHelper
)
from pathlib import Path

UnstructuredReader = download_loader("UnstructuredReader", refresh_cache=True)

topics = ['topic1', 'topic2', 'topic3', 'topic4']

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for topic in topics:
    topic_docs = loader.load_data(file=Path(f'../data/{topic}.txt'), split_documents=False)
    # insert year metadata into each year
    for d in topic_docs:
        d.extra_info = {"topic": topic}
    doc_set[topic] = topic_docs
    all_docs.extend(topic_docs)




llm_predictor = LLMPredictor(llm = AzureOpenAI(deployment_name="text-ada-001", model_name="text-ada-001", temperature=0))
embedding_llm = LangchainEmbedding(OpenAIEmbeddings(
    document_model_name="text-search-ada-doc-001",
    query_model_name="text-search-ada-query-001"
))

index_set = {}
for topic in topics:
    cur_index = GPTSimpleVectorIndex(doc_set[topic], chunk_size_limit=512, llm_predictor=llm_predictor, embed_model=embedding_llm)
    index_set[topic] = cur_index
    cur_index.save_to_disk(f'../indices/index_{topic}.json')


list_index = GPTListIndex([index_set[y] for y in topics], llm_predictor=llm_predictor, embed_model=embedding_llm)

I'm getting the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[33], line 3
      1 # define a list index over the vector indices
      2 # allows us to synthesize information across each index
----> 3 list_index = GPTListIndex([index_set[y] for y in topics])

File [~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/list/base.py:57](https://vscode-remote+amlext-002b2f737562736372697074696f6e732f30623831303564362d653536372d343236312d626566652d3765393066653231616234372f7265736f7572636547726f7570732f666e6d7a2d73616e64626f782d6465762f70726f7669646572732f4d6963726f736f66742e4d616368696e654c6561726e696e6753657276696365732f776f726b7370616365732f4d4c776f726b73706163654461746143656e7465722f636f6d70757465732f6d6172636f2d6770752d7632.vscode-resource.vscode-cdn.net/home/azureuser/cloudfiles/code/Users/marco.moresi/learning_hours/chatbot_azure/_notebooks/~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/list/base.py:57), in GPTListIndex.__init__(self, documents, index_struct, text_qa_template, llm_predictor, text_splitter, **kwargs)
     55 """Initialize params."""
     56 self.text_qa_template = text_qa_template or DEFAULT_TEXT_QA_PROMPT
---> 57 super().__init__(
     58     documents=documents,
     59     index_struct=index_struct,
     60     llm_predictor=llm_predictor,
     61     text_splitter=text_splitter,
     62     **kwargs,
     63 )

File [~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:109](https://vscode-remote+amlext-002b2f737562736372697074696f6e732f30623831303564362d653536372d343236312d626566652d3765393066653231616234372f7265736f7572636547726f7570732f666e6d7a2d73616e64626f782d6465762f70726f7669646572732f4d6963726f736f66742e4d616368696e654c6561726e696e6753657276696365732f776f726b7370616365732f4d4c776f726b73706163654461746143656e7465722f636f6d70757465732f6d6172636f2d6770752d7632.vscode-resource.vscode-cdn.net/home/azureuser/cloudfiles/code/Users/marco.moresi/learning_hours/chatbot_azure/_notebooks/~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:109), in BaseGPTIndex.__init__(self, documents, index_struct, llm_predictor, embed_model, docstore, index_registry, prompt_helper, text_splitter, chunk_size_limit, include_extra_info, llama_logger)
    107 else:
    108     documents = cast(Sequence[DOCUMENTS_INPUT], documents)
--> 109     documents = self._process_documents(
    110         documents, self._docstore, self._index_registry
    111     )
    112     self._validate_documents(documents)
    113     # TODO: introduce document store outside __init__ function

File [~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:187](https://vscode-remote+amlext-002b2f737562736372697074696f6e732f30623831303564362d653536372d343236312d626566652d3765393066653231616234372f7265736f7572636547726f7570732f666e6d7a2d73616e64626f782d6465762f70726f7669646572732f4d6963726f736f66742e4d616368696e654c6561726e696e6753657276696365732f776f726b7370616365732f4d4c776f726b73706163654461746143656e7465722f636f6d70757465732f6d6172636f2d6770752d7632.vscode-resource.vscode-cdn.net/home/azureuser/cloudfiles/code/Users/marco.moresi/learning_hours/chatbot_azure/_notebooks/~/miniconda3/envs/chatbot/lib/python3.8/site-packages/gpt_index/indices/base.py:187), in BaseGPTIndex._process_documents(self, documents, docstore, index_registry)
    185         results.append(doc)
    186     else:
--> 187         raise ValueError(f"Invalid document type: {type(doc)}.")
    188 return cast(List[BaseDocument], results)

ValueError: Invalid document type: .

I'm passing to the GPTListIndex a list of llama_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex objects.

environment with: Python 3.8 llama_index 0.4.36 gpt_index 0.4.36 langchain 0.0.121

Looking forward to receiving your comments.

mrcmoresi avatar Mar 24 '23 10:03 mrcmoresi

Hey @mrcmoresi, hard to help debug without more information.

The ValueError seems to indicate the document is None.

Some follow questions:

  1. Did you load the GPTSimpleVectorIndex from disk after saving it?
  2. Could you try using the ComposableGraph object to handle save/load instead?

Disiok avatar Mar 25 '23 01:03 Disiok

Hi @Disiok thanks for your answer.

  1. I tested two scenarios, using the GPTSimpleVectorIndex directly from Memory and also loading from disk, both situations ended up in the same error.

  2. I will try

mrcmoresi avatar Mar 27 '23 06:03 mrcmoresi

Going to close this issue for now. The ChatbotSEC tutorial should be updated to work with the latest version of llama-index

Feel free to re-open if this is still an issue!

logan-markewich avatar Jun 06 '23 03:06 logan-markewich