private-gpt
private-gpt copied to clipboard
Index list out of range with too many PDFs in source_documents?
First, thanks for your work. It's amazing!
Running on a Mac M1, when I upload more than 7-8 PDFs in the source_documents
folder, I get this error:
% python ingest.py
llama.cpp: loading model from models/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1000
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size = 1000.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 36, in <module>
main()
File "/Users/pierrick.lozach/Documents/privateGPT/ingest.py", line 31, in main
db = Chroma.from_documents(texts, llama, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 413, in from_documents
return cls.from_texts(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 381, in from_texts
chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/vectorstores/chroma.py", line 159, in add_texts
self._collection.add(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 97, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 340, in _validate_embedding_set
ids = validate_ids(maybe_cast_one_to_many(ids))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/chromadb/api/types.py", line 75, in maybe_cast_one_to_many
if isinstance(target[0], (int, float)):
IndexError: list index out of range
Is this due to a memory/file limit?
Not sure, maybe you can run a couple tests with fewer docs to check if that's the case. Thanks for your help!
That's what I did. It works with 7-8 PDFs but then gives this error when I add more
Ok great, thanks for sharing. Does it fail right away? It is interesting, will need to look into it. Please share your findings!
It fails right at the beginning, see the output above. Let me know if you need anything else.
@imartinez I just realized with this issue that currently the ingest
file only loads a single file regardless of the number of documents in the source_documents
directory. The code changes the type of document loader for each file and then loads its content at the end (the very last file only).
# Load document and split in chunks
for root, dirs, files in os.walk("source_documents"):
for file in files:
if file.endswith(".txt"):
loader = TextLoader(os.path.join(root, file), encoding="utf8")
elif file.endswith(".pdf"):
loader = PDFMinerLoader(os.path.join(root, file))
elif file.endswith(".csv"):
loader = CSVLoader(os.path.join(root, file))
documents = loader.load() # loads only the last file content!
I am working on a fix right now.
i pulled the git today and yet get the same issue. I am not even uploading multiple files. i am just trying to check this using colab with state of the union.txt.
Loading documents from source_documents
Loaded 0 documents from source_documents
Split into 0 chunks of text (max. 500 tokens each)
llama.cpp: loading model from llm/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format = 'ggml' (old version with low tokenizer quality and no mmap support)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1000
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required = 5809.33 MB (+ 2052.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size = 1000.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
File "/content/privateGPT/ingest.py", line 96, in