private-gpt
private-gpt copied to clipboard
Progress Bar for injest
Having a Progress Bar or the percentage displayed would be helpful on injest.py
I am up to~500 CPU hours on my xeon server (32 core) ingesting 100mb of text, no idea what's happening, or if something is actually being done....
I am up to~500 CPU hours on my xeon server (32 core) ingesting 100mb of text, no idea what's happening, or if something is actually being done....
There is a new update that significantly reduces the time to ingest.
I am up to~500 CPU hours on my xeon server (32 core) ingesting 100mb of text, no idea what's happening, or if something is actually being done....
There is a new update that significantly reduces the time to ingest.
Was the update in the last 25 hours, as that's when I started???
I believe it was introduced when this PR was merged. (Cant remember if it was this exact one but it was around this time) If you are on the version with this update i would recommend leaving it for a few more hours (just to see if it actually does finish) and if it dosent i would redownload the repo and model.
https://github.com/imartinez/privateGPT/commit/355b4be7c0972f71208251a14f47d739f8456fb5
Something like this works in main()
if len(texts) > 100:
batch_size = int(len(texts) / 100)
batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
for batch in tqdm(batches, desc="Processing batches"):
db = Chroma.from_documents(
batch, ef, persist_directory=persist_directory)
else:
db = Chroma.from_documents(
texts, ef, persist_directory=persist_directory)
I also noticed ingest.py would load all of the documents into memory before starting to add them to the index and create emdeddings. I modified it to load documents and add them to the index one-by-one. This reduces the mempory overhead when importing large amounts of data. My use case only loads txt. This is what my ingest.py looks like now:
#!/usr/bin/env python3
import nltk
import os
import glob
from typing import List
from dotenv import load_dotenv
from multiprocessing import Pool
from tqdm import tqdm
import chromadb
from chromadb.utils import embedding_functions
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
load_dotenv()
# Load environment variables
persist_directory = os.environ.get('PERSIST_DIRECTORY')
source_directory = os.environ.get('SOURCE_DIRECTORY', 'source_documents')
embeddings_model_name = os.environ.get('EMBEDDINGS_MODEL_NAME')
chunk_size = 500
chunk_overlap = 50
nltk.download('punkt')
def load_single_document(file_path: str) -> List[Document]:
loader = TextLoader(file_path)
return loader.load()
source_dir = 'pdf'
def main():
# Create embeddings
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="thenlper/gte-base", device='cuda:0')
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
client = chromadb.PersistentClient(path=persist_directory)
collection = client.get_or_create_collection(name="tenancy", embedding_function=ef)
all_files = glob.glob(os.path.join(source_dir, f"**/*.txt"), recursive=True)
with Pool(processes=4) as pool:
for docs in tqdm(pool.imap_unordered(load_single_document, all_files), total=len(all_files)):
texts = text_splitter.split_documents(docs)
for i, text in enumerate(texts):
id = text.metadata['source'] + '-' + str(i)
collection.add(
ids=[id], metadatas=text.metadata, documents=text.page_content
)
client = None
print(f"Done!")
if __name__ == "__main__":
main()
Please be sure to use the same embedding function when creating the index and when querying the index.