localGPT
localGPT copied to clipboard
Issue 72: Refactor document loading, add device selection, and update README
This pull request addresses Issue #72 and includes the following changes:
-
Refactor document loading in
ingest.py
to utilize theDOCUMENT_MAP
dictionary fromconstants.py
. This improves modularity and allows for easier addition of new document loaders. (Commit: a4c0fee) -
Add the option to select the device type when running
ingest.py
. The default device is set to "cuda". Users can now specify the device using the--device_type
argument. The README.md file has been updated to include instructions on device selection and how to see the full list of supported devices. (Commit: a4c0fee) -
Update the README.md file to include instructions on using the
--device_type
argument and provide details on the supported devices. This enhances the documentation and helps users understand the device selection functionality. (Commit: fb244cb) -
Ignore generated database results by adding appropriate rules to the gitignore file. (Commit: 3d1806d)
-
Add git ignore rules for Python files to ensure generated files and temporary files are not committed. (Commit: 3ebb267)
These changes improve the flexibility and modularity of the codebase, enhance the documentation, and ensure unwanted files are excluded from version control.
This looks good, however, I would like to request a couple of more updates.
- Can you please also update the
run_localGPT.py
file so it has similar options for device_type as theingest.py
. This will ensure consistency across the code base - Support for other file types was added, see - #44 Can you please integrate those as part of this PR. Thanks
Why not use ConfluenceLoader
or UnstructuredExcelLoader
? Couldn't we use those to load Excel based files?
You are right, it's a much better approach. We can also loaders for other file types (DOC, EPub, etc.) directly supported by long-chain. Can you please add those as part of the PR? Thanks
Technically, utils.py
is terrible module name. A few more steps to support DOCUMENT_MAP
would yeild the following (this duplicates code because we can't inherit from csv_loader
in a sane way):
import csv
import tempfile
from typing import Dict, List, Optional
import openpyxl
from langchain.docstore.document import Document
from langchain.document_loaders import CSVLoader
from langchain.document_loaders.base import BaseLoader
# https://learn.microsoft.com/en-us/deployoffice/compat/office-file-format-reference
def xlsx_to_csv(file_path: str, sheet_name: str = None) -> list[str]:
"""
Convert a workbook into a list of csv files
:param file_path: the path to the workbook
:param sheet_name: the name of the sheet to convert
:return: a list of temporary file names
"""
# Load the workbook and select the active worksheet
wb = openpyxl.load_workbook(file_path)
# ws = wb.active
#
# # Create a new temporary file and write the contents of the worksheet to it
# with tempfile.NamedTemporaryFile(mode='w+', newline='', delete=False) as f:
# c = csv.writer(f)
# for r in ws.rows:
# c.writerow([cell.value for cell in r])
# return f.name
# load all sheets if sheet_name is None
wb = wb if sheet_name is None else [wb[sheet_name]]
temp_file_name = []
# Iterate over the worksheets in the workbook
for ws in wb:
# Create a new temporary file and write the contents of the worksheet to it
with tempfile.NamedTemporaryFile(mode='w+', newline='', suffix='.csv', delete=False) as f:
c = csv.writer(f)
for r in ws.rows:
c.writerow([cell.value for cell in r])
temp_file_name.append(f.name)
# print(f'all Sheets are saved to temporary file {temp_file_name}')
return temp_file_name
class XLSXLoader(BaseLoader):
"""Loads an XLSX file into a list of documents.
Each document represents one row of the CSV file converted from the XLSX file.
Every row is converted into a key/value pair and outputted to a new line in the
document's page_content.
The source for each document loaded from csv is set to the value of the
`file_path` argument for all doucments by default.
You can override this by setting the `source_column` argument to the
name of a column in the CSV file.
The source of each document will then be set to the value of the column
with the name specified in `source_column`.
Output Example:
.. code-block:: txt
column1: value1
column2: value2
column3: value3
"""
def __init__(
self,
file_path: str,
source_column: Optional[str] = None,
csv_args: Optional[Dict] = None,
encoding: Optional[str] = None,
):
self.file_path = file_path
self.source_column = source_column
self.encoding = encoding
self.csv_args = csv_args or {}
def load(self) -> List[Document]:
"""Load data into document objects."""
docs = []
csv_files = xlsx_to_csv(self.file_path)
for csv_file in csv_files:
with open(csv_file, newline="", encoding=self.encoding) as csvfile:
csv_reader = csv.DictReader(csvfile, **self.csv_args) # type: ignore
for i, row in enumerate(csv_reader):
content = "\n".join(f"{k.strip()}: {v.strip()}" for k, v in row.items())
try:
source = (
row[self.source_column]
if self.source_column is not None
else self.file_path
)
except KeyError:
raise ValueError(
f"Source column '{self.source_column}' not found in CSV file."
)
metadata = {"source": source, "row": i}
doc = Document(page_content=content, metadata=metadata)
docs.append(doc)
return docs
This isn't really necessary though and is overkill. It also feels like something that langchain
should take care of for us already (which it does).
Otherwise, it would result as:
import os
# from dotenv import load_dotenv
from chromadb.config import Settings
from langchain.document_loaders import CSVLoader, PDFMinerLoader, TextLoader
from xlxs_loader import XLSXLoader
# load_dotenv()
ROOT_DIRECTORY = os.path.dirname(os.path.realpath(__file__))
# Define the folder for storing database
SOURCE_DIRECTORY = f"{ROOT_DIRECTORY}/SOURCE_DOCUMENTS"
PERSIST_DIRECTORY = f"{ROOT_DIRECTORY}/DB"
# Define the Chroma settings
CHROMA_SETTINGS = Settings(
chroma_db_impl='duckdb+parquet',
persist_directory=PERSIST_DIRECTORY,
anonymized_telemetry=False
)
DOCUMENT_MAP = {
".txt": TextLoader,
".pdf": PDFMinerLoader,
".csv": CSVLoader,
".xlxs": XLSXLoader
}
I'm in the middle of merging the conflicts and handling before the commit. I'm just running through my options. I really don't like the above as an option and I'm leaning towards just doing the following instead:
import os
# from dotenv import load_dotenv
from chromadb.config import Settings
# https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/excel.html?highlight=xlsx#microsoft-excel
from langchain.document_loaders import (
UnstructuredExcelLoader,
CSVLoader,
PDFMinerLoader,
TextLoader,
)
# load_dotenv()
ROOT_DIRECTORY = os.path.dirname(os.path.realpath(__file__))
# Define the folder for storing database
SOURCE_DIRECTORY = f"{ROOT_DIRECTORY}/SOURCE_DOCUMENTS"
PERSIST_DIRECTORY = f"{ROOT_DIRECTORY}/DB"
# Define the Chroma settings
CHROMA_SETTINGS = Settings(
chroma_db_impl='duckdb+parquet',
persist_directory=PERSIST_DIRECTORY,
anonymized_telemetry=False
)
# https://python.langchain.com/en/latest/_modules/langchain/document_loaders/excel.html#UnstructuredExcelLoader
DOCUMENT_MAP = {
".txt": TextLoader,
".pdf": PDFMinerLoader,
".csv": CSVLoader,
".xlxs": UnstructuredExcelLoader
}
The issue with the last example is that it requires langchain
version 0.0.191
, or at least a version that supports UnstructuredExcelLoader
.
I also like the last option. It will be more standardized. We can experiment with 0.0.191
to ensure it doesn't break anything else.
@teleprint-me Thanks for this PR and for adding my requests. Looking forward to your contributions to this project.
The funny part is I can't run the code. My hardware is out of date. 😉😇
I tested it and it worked :)
Is it alright if I use ingest.py
as a basis in other code?
I was able to get ingest.py
to work with CPU. I don't have enough RAM for run_localGPT.py
using the HF models and I can run a GGML 7B model (inference is slow).
If I come up with anything useful that's within scope, I can probably use it as a basis to contribute in the future. It would be cheaper to do local embeddings over using OpenAI Embeddings as well.
Yes, please free to do that. Looking forward to your contribution.