private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Progress Bar for injest

Open apcameron opened this issue 2 years ago • 4 comments

Having a Progress Bar or the percentage displayed would be helpful on injest.py

apcameron avatar May 17 '23 17:05 apcameron

I am up to~500 CPU hours on my xeon server (32 core) ingesting 100mb of text, no idea what's happening, or if something is actually being done....

scorpion44 avatar May 18 '23 00:05 scorpion44

I am up to~500 CPU hours on my xeon server (32 core) ingesting 100mb of text, no idea what's happening, or if something is actually being done....

There is a new update that significantly reduces the time to ingest.

isthisausername2 avatar May 18 '23 00:05 isthisausername2

I am up to~500 CPU hours on my xeon server (32 core) ingesting 100mb of text, no idea what's happening, or if something is actually being done....

There is a new update that significantly reduces the time to ingest.

Was the update in the last 25 hours, as that's when I started???

scorpion44 avatar May 18 '23 01:05 scorpion44

I believe it was introduced when this PR was merged. (Cant remember if it was this exact one but it was around this time) If you are on the version with this update i would recommend leaving it for a few more hours (just to see if it actually does finish) and if it dosent i would redownload the repo and model.

https://github.com/imartinez/privateGPT/commit/355b4be7c0972f71208251a14f47d739f8456fb5

isthisausername2 avatar May 19 '23 00:05 isthisausername2

Something like this works in main()

if len(texts) > 100:
    batch_size = int(len(texts) / 100)
    batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]
    for batch in tqdm(batches, desc="Processing batches"):
        db = Chroma.from_documents(
            batch, ef, persist_directory=persist_directory)
else:
    db = Chroma.from_documents(
        texts, ef, persist_directory=persist_directory)

gardner avatar Aug 05 '23 22:08 gardner

I also noticed ingest.py would load all of the documents into memory before starting to add them to the index and create emdeddings. I modified it to load documents and add them to the index one-by-one. This reduces the mempory overhead when importing large amounts of data. My use case only loads txt. This is what my ingest.py looks like now:

#!/usr/bin/env python3
import nltk
import os
import glob
from typing import List
from dotenv import load_dotenv
from multiprocessing import Pool
from tqdm import tqdm
import chromadb
from chromadb.utils import embedding_functions

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader


load_dotenv()

#  Load environment variables
persist_directory = os.environ.get('PERSIST_DIRECTORY')
source_directory = os.environ.get('SOURCE_DIRECTORY', 'source_documents')
embeddings_model_name = os.environ.get('EMBEDDINGS_MODEL_NAME')
chunk_size = 500
chunk_overlap = 50

nltk.download('punkt')


def load_single_document(file_path: str) -> List[Document]:
    loader = TextLoader(file_path)
    return loader.load()

source_dir = 'pdf'

def main():
    # Create embeddings
    ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="thenlper/gte-base", device='cuda:0')

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    client = chromadb.PersistentClient(path=persist_directory)
    collection = client.get_or_create_collection(name="tenancy", embedding_function=ef)

    all_files = glob.glob(os.path.join(source_dir, f"**/*.txt"), recursive=True)

    with Pool(processes=4) as pool:

        for docs in tqdm(pool.imap_unordered(load_single_document, all_files), total=len(all_files)):
            texts = text_splitter.split_documents(docs)

            for i, text in enumerate(texts):
                id = text.metadata['source'] + '-' + str(i)

                collection.add(
                    ids=[id], metadatas=text.metadata, documents=text.page_content
                )

    client = None

    print(f"Done!")


if __name__ == "__main__":
    main()

Please be sure to use the same embedding function when creating the index and when querying the index.

gardner avatar Aug 06 '23 19:08 gardner