private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Index list out of range with too many PDFs in source_documents?

Open PierrickLozach opened this issue 1 year ago • 5 comments

First, thanks for your work. It's amazing!

Running on a Mac M1, when I upload more than 7-8 PDFs in the source_documents folder, I get this error:

% python ingest.py

llama.cpp: loading model from models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format     = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1000
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size  = 1000.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 36, in <module>
    main()
  File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 31, in main
    db = Chroma.from_documents(texts, llama, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 413, in from_documents
    return cls.from_texts(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 381, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 159, in add_texts
    self._collection.add(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 97, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 340, in _validate_embedding_set
    ids = validate_ids(maybe_cast_one_to_many(ids))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/types.py", line 75, in maybe_cast_one_to_many
    if isinstance(target[0], (int, float)):
IndexError: list index out of range

Is this due to a memory/file limit?

PierrickLozach avatar May 12 '23 09:05 PierrickLozach

Not sure, maybe you can run a couple tests with fewer docs to check if that's the case. Thanks for your help!

imartinez avatar May 12 '23 09:05 imartinez

That's what I did. It works with 7-8 PDFs but then gives this error when I add more

PierrickLozach avatar May 12 '23 10:05 PierrickLozach

Ok great, thanks for sharing. Does it fail right away? It is interesting, will need to look into it. Please share your findings!

imartinez avatar May 12 '23 10:05 imartinez

It fails right at the beginning, see the output above. Let me know if you need anything else.

PierrickLozach avatar May 12 '23 10:05 PierrickLozach

@imartinez I just realized with this issue that currently the ingest file only loads a single file regardless of the number of documents in the source_documents directory. The code changes the type of document loader for each file and then loads its content at the end (the very last file only).

# Load document and split in chunks
for root, dirs, files in os.walk("source_documents"):
    for file in files:
        if file.endswith(".txt"):
            loader = TextLoader(os.path.join(root, file), encoding="utf8")
        elif file.endswith(".pdf"):
            loader = PDFMinerLoader(os.path.join(root, file))
        elif file.endswith(".csv"):
            loader = CSVLoader(os.path.join(root, file))
documents = loader.load() # loads only the last file content! 

I am working on a fix right now.

andreakiro avatar May 12 '23 12:05 andreakiro

i pulled the git today and yet get the same issue. I am not even uploading multiple files. i am just trying to check this using colab with state of the union.txt.

Loading documents from source_documents Loaded 0 documents from source_documents Split into 0 chunks of text (max. 500 tokens each) llama.cpp: loading model from llm/ggml-model-q4_0.bin llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1000 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4113748.20 KB llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state) ................................................................................................... . llama_init_from_file: kv self size = 1000.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Using embedded DuckDB with persistence: data will be stored in: db Traceback (most recent call last): File "/content/privateGPT/ingest.py", line 96, in main() File "/content/privateGPT/ingest.py", line 90, in main db = Chroma.from_documents(texts, llama, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS) File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 413, in from_documents return cls.from_texts( File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 381, in from_texts chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids) File "/usr/local/lib/python3.10/dist-packages/langchain/vectorstores/chroma.py", line 159, in add_texts self._collection.add( File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 97, in add ids, embeddings, metadatas, documents = self._validate_embedding_set( File "/usr/local/lib/python3.10/dist-packages/chromadb/api/models/Collection.py", line 340, in _validate_embedding_set ids = validate_ids(maybe_cast_one_to_many(ids)) File "/usr/local/lib/python3.10/dist-packages/chromadb/api/types.py", line 75, in maybe_cast_one_to_many if isinstance(target[0], (int, float)): IndexError: list index out of range

rkrkrediffmail avatar May 17 '23 02:05 rkrkrediffmail