private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Any chance of PDF ingestion?

Open RuairiSpain opened this issue 1 year ago • 6 comments

RuairiSpain avatar May 08 '23 21:05 RuairiSpain

Try using the PDF loader in langchain instead of a text loader that's used in the ingest script

dennydream avatar May 09 '23 22:05 dennydream

Personally, I've been having a look at text extraction from PDFs and unfortunately it doesn't seem to be an easy thing. Looking at academic articles, often you have text split into columns, headers, footers, page numbers, tables, text as an image (no OCR) etc that are all created very, uh, 'concretely' by placing elements on the page in terms of their coordinates rather than through some form of markup.

If someone knows some magic sauce that'd be amazing.

JavaGT avatar May 09 '23 22:05 JavaGT

Yeah, it's a pain. HTML can be a pain too-at least the stuff I'm looking at (SEC filings) which have XBRL in them. Data prep is typically some of the biggest effort with the AI/ML stuff.

dennydream avatar May 09 '23 23:05 dennydream

https://gpt-index.readthedocs.io/en/latest/how_to/data_connectors.html

also see possible from llama_index.readers.chroma import ChromaReader

from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('./article.pdf'))

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
    # Create embeddings
llama = LlamaCppEmbeddings(model_path="./models/ggml-model-q4_0.bin")
    # Create and store locally vectorstore
persist_directory = 'db'
db = Chroma.from_documents(texts, llama, persist_directory=persist_directory)
db.persist()

su77ungr avatar May 10 '23 03:05 su77ungr

Also take a look at this: LangChain has support for several PDF Loaders

You can change the TextLoader in ingest.py and use any PDF loader as described in that link. Feel free to open a PR!

imartinez avatar May 10 '23 18:05 imartinez

You have to write a script to make it ONLY read content I think.

d2rgaming-9000 avatar May 14 '23 23:05 d2rgaming-9000