localGPT icon indicating copy to clipboard operation
localGPT copied to clipboard

Issue 72: Refactor document loading, add device selection, and update README

Open teleprint-me opened this issue 1 year ago • 5 comments

This pull request addresses Issue #72 and includes the following changes:

  1. Refactor document loading in ingest.py to utilize the DOCUMENT_MAP dictionary from constants.py. This improves modularity and allows for easier addition of new document loaders. (Commit: a4c0fee)

  2. Add the option to select the device type when running ingest.py. The default device is set to "cuda". Users can now specify the device using the --device_type argument. The README.md file has been updated to include instructions on device selection and how to see the full list of supported devices. (Commit: a4c0fee)

  3. Update the README.md file to include instructions on using the --device_type argument and provide details on the supported devices. This enhances the documentation and helps users understand the device selection functionality. (Commit: fb244cb)

  4. Ignore generated database results by adding appropriate rules to the gitignore file. (Commit: 3d1806d)

  5. Add git ignore rules for Python files to ensure generated files and temporary files are not committed. (Commit: 3ebb267)

These changes improve the flexibility and modularity of the codebase, enhance the documentation, and ensure unwanted files are excluded from version control.

teleprint-me avatar Jun 04 '23 07:06 teleprint-me

This looks good, however, I would like to request a couple of more updates.

  1. Can you please also update the run_localGPT.py file so it has similar options for device_type as the ingest.py. This will ensure consistency across the code base
  2. Support for other file types was added, see - #44 Can you please integrate those as part of this PR. Thanks

PromtEngineer avatar Jun 05 '23 20:06 PromtEngineer

Why not use ConfluenceLoader or UnstructuredExcelLoader? Couldn't we use those to load Excel based files?

teleprint-me avatar Jun 06 '23 06:06 teleprint-me

You are right, it's a much better approach. We can also loaders for other file types (DOC, EPub, etc.) directly supported by long-chain. Can you please add those as part of the PR? Thanks

PromtEngineer avatar Jun 06 '23 06:06 PromtEngineer

Technically, utils.py is terrible module name. A few more steps to support DOCUMENT_MAP would yeild the following (this duplicates code because we can't inherit from csv_loader in a sane way):

import csv
import tempfile
from typing import Dict, List, Optional

import openpyxl
from langchain.docstore.document import Document
from langchain.document_loaders import CSVLoader
from langchain.document_loaders.base import BaseLoader


# https://learn.microsoft.com/en-us/deployoffice/compat/office-file-format-reference
def xlsx_to_csv(file_path: str, sheet_name: str = None) -> list[str]:
    """
    Convert a workbook into a list of csv files
    :param file_path: the path to the workbook
    :param sheet_name: the name of the sheet to convert
    :return: a list of temporary file names
    """
    # Load the workbook and select the active worksheet
    wb = openpyxl.load_workbook(file_path)
    # ws = wb.active
    #
    # # Create a new temporary file and write the contents of the worksheet to it
    # with tempfile.NamedTemporaryFile(mode='w+', newline='', delete=False) as f:
    #     c = csv.writer(f)
    #     for r in ws.rows:
    #         c.writerow([cell.value for cell in r])
    # return f.name
    # load all sheets if sheet_name is None
    wb = wb if sheet_name is None else [wb[sheet_name]]
    temp_file_name = []
    # Iterate over the worksheets in the workbook
    for ws in wb:
        # Create a new temporary file and write the contents of the worksheet to it
        with tempfile.NamedTemporaryFile(mode='w+', newline='', suffix='.csv', delete=False) as f:
            c = csv.writer(f)
            for r in ws.rows:
                c.writerow([cell.value for cell in r])
            temp_file_name.append(f.name)
    # print(f'all Sheets are saved to temporary file {temp_file_name}')
    return temp_file_name


class XLSXLoader(BaseLoader):
    """Loads an XLSX file into a list of documents.

    Each document represents one row of the CSV file converted from the XLSX file. 
    Every row is converted into a key/value pair and outputted to a new line in the 
    document's page_content.

    The source for each document loaded from csv is set to the value of the
    `file_path` argument for all doucments by default.
    You can override this by setting the `source_column` argument to the
    name of a column in the CSV file.
    The source of each document will then be set to the value of the column
    with the name specified in `source_column`.

    Output Example:
        .. code-block:: txt

            column1: value1
            column2: value2
            column3: value3
    """

    def __init__(
        self,
        file_path: str,
        source_column: Optional[str] = None,
        csv_args: Optional[Dict] = None,
        encoding: Optional[str] = None,
    ):
        self.file_path = file_path
        self.source_column = source_column
        self.encoding = encoding
        self.csv_args = csv_args or {}

    def load(self) -> List[Document]:
        """Load data into document objects."""

        docs = []
        csv_files = xlsx_to_csv(self.file_path)
        for csv_file in csv_files:
            with open(csv_file, newline="", encoding=self.encoding) as csvfile:
                csv_reader = csv.DictReader(csvfile, **self.csv_args)  # type: ignore
                for i, row in enumerate(csv_reader):
                    content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
                    try:
                        source = (
                            row[self.source_column]
                            if self.source_column is not None
                            else self.file_path
                        )
                    except KeyError:
                        raise ValueError(
                            f"Source column '{self.source_column}' not found in CSV file."
                        )
                    metadata = {"source": source, "row": i}
                    doc = Document(page_content=content, metadata=metadata)
                    docs.append(doc)

        return docs

This isn't really necessary though and is overkill. It also feels like something that langchain should take care of for us already (which it does).

Otherwise, it would result as:

import os

# from dotenv import load_dotenv
from chromadb.config import Settings
from langchain.document_loaders import CSVLoader, PDFMinerLoader, TextLoader

from xlxs_loader import XLSXLoader

# load_dotenv()
ROOT_DIRECTORY = os.path.dirname(os.path.realpath(__file__))

# Define the folder for storing database
SOURCE_DIRECTORY = f"{ROOT_DIRECTORY}/SOURCE_DOCUMENTS"

PERSIST_DIRECTORY = f"{ROOT_DIRECTORY}/DB"

# Define the Chroma settings
CHROMA_SETTINGS = Settings(
        chroma_db_impl='duckdb+parquet',
        persist_directory=PERSIST_DIRECTORY,
        anonymized_telemetry=False
)

DOCUMENT_MAP = {
    ".txt": TextLoader,
    ".pdf": PDFMinerLoader,
    ".csv": CSVLoader,
    ".xlxs": XLSXLoader
}

I'm in the middle of merging the conflicts and handling before the commit. I'm just running through my options. I really don't like the above as an option and I'm leaning towards just doing the following instead:

import os

# from dotenv import load_dotenv
from chromadb.config import Settings
# https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/excel.html?highlight=xlsx#microsoft-excel
from langchain.document_loaders import (
    UnstructuredExcelLoader,
    CSVLoader,
    PDFMinerLoader,
    TextLoader,
)

# load_dotenv()
ROOT_DIRECTORY = os.path.dirname(os.path.realpath(__file__))

# Define the folder for storing database
SOURCE_DIRECTORY = f"{ROOT_DIRECTORY}/SOURCE_DOCUMENTS"

PERSIST_DIRECTORY = f"{ROOT_DIRECTORY}/DB"

# Define the Chroma settings
CHROMA_SETTINGS = Settings(
        chroma_db_impl='duckdb+parquet',
        persist_directory=PERSIST_DIRECTORY,
        anonymized_telemetry=False
)

# https://python.langchain.com/en/latest/_modules/langchain/document_loaders/excel.html#UnstructuredExcelLoader
DOCUMENT_MAP = {
    ".txt": TextLoader,
    ".pdf": PDFMinerLoader,
    ".csv": CSVLoader,
    ".xlxs": UnstructuredExcelLoader
}

The issue with the last example is that it requires langchain version 0.0.191, or at least a version that supports UnstructuredExcelLoader.

teleprint-me avatar Jun 06 '23 08:06 teleprint-me

I also like the last option. It will be more standardized. We can experiment with 0.0.191 to ensure it doesn't break anything else.

PromtEngineer avatar Jun 07 '23 01:06 PromtEngineer

@teleprint-me Thanks for this PR and for adding my requests. Looking forward to your contributions to this project.

PromtEngineer avatar Jun 07 '23 18:06 PromtEngineer

The funny part is I can't run the code. My hardware is out of date. 😉😇

teleprint-me avatar Jun 07 '23 19:06 teleprint-me

I tested it and it worked :)

PromtEngineer avatar Jun 08 '23 04:06 PromtEngineer

Is it alright if I use ingest.py as a basis in other code?

I was able to get ingest.py to work with CPU. I don't have enough RAM for run_localGPT.py using the HF models and I can run a GGML 7B model (inference is slow).

If I come up with anything useful that's within scope, I can probably use it as a basis to contribute in the future. It would be cheaper to do local embeddings over using OpenAI Embeddings as well.

teleprint-me avatar Jun 08 '23 05:06 teleprint-me

Yes, please free to do that. Looking forward to your contribution.

PromtEngineer avatar Jun 08 '23 05:06 PromtEngineer