private-gpt
private-gpt copied to clipboard
Any chance of PDF ingestion?
Try using the PDF loader in langchain instead of a text loader that's used in the ingest script
Personally, I've been having a look at text extraction from PDFs and unfortunately it doesn't seem to be an easy thing. Looking at academic articles, often you have text split into columns, headers, footers, page numbers, tables, text as an image (no OCR) etc that are all created very, uh, 'concretely' by placing elements on the page in terms of their coordinates rather than through some form of markup.
If someone knows some magic sauce that'd be amazing.
Yeah, it's a pain. HTML can be a pain too-at least the stuff I'm looking at (SEC filings) which have XBRL in them. Data prep is typically some of the biggest effort with the AI/ML stuff.
https://gpt-index.readthedocs.io/en/latest/how_to/data_connectors.html
also see possible from llama_index.readers.chroma import ChromaReader
from pathlib import Path
from llama_index import download_loader
PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path('./article.pdf'))
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
# Create embeddings
llama = LlamaCppEmbeddings(model_path="./models/ggml-model-q4_0.bin")
# Create and store locally vectorstore
persist_directory = 'db'
db = Chroma.from_documents(texts, llama, persist_directory=persist_directory)
db.persist()
Also take a look at this: LangChain has support for several PDF Loaders
You can change the TextLoader in ingest.py and use any PDF loader as described in that link. Feel free to open a PR!
You have to write a script to make it ONLY read content I think.