private-gpt Any chance of PDF ingestion?

May 08 '23 21:05 RuairiSpain

Try using the PDF loader in langchain instead of a text loader that's used in the ingest script

May 09 '23 22:05 dennydream

Personally, I've been having a look at text extraction from PDFs and unfortunately it doesn't seem to be an easy thing. Looking at academic articles, often you have text split into columns, headers, footers, page numbers, tables, text as an image (no OCR) etc that are all created very, uh, 'concretely' by placing elements on the page in terms of their coordinates rather than through some form of markup.

If someone knows some magic sauce that'd be amazing.

May 09 '23 22:05 JavaGT

Yeah, it's a pain. HTML can be a pain too-at least the stuff I'm looking at (SEC filings) which have XBRL in them. Data prep is typically some of the biggest effort with the AI/ML stuff.

May 09 '23 23:05 dennydream

https://gpt-index.readthedocs.io/en/latest/how_to/data_connectors.html

also see possible from llama_index.readers.chroma import ChromaReader

from pathlib import Path
from llama_index import download_loader

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('./article.pdf'))

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
    # Create embeddings
llama = LlamaCppEmbeddings(model_path="./models/ggml-model-q4_0.bin")
    # Create and store locally vectorstore
persist_directory = 'db'
db = Chroma.from_documents(texts, llama, persist_directory=persist_directory)
db.persist()

May 10 '23 03:05 su77ungr

Also take a look at this: LangChain has support for several PDF Loaders

You can change the TextLoader in ingest.py and use any PDF loader as described in that link. Feel free to open a PR!

May 10 '23 18:05 imartinez

You have to write a script to make it ONLY read content I think.

May 14 '23 23:05 d2rgaming-9000

private-gpt private-gpt copied to clipboard

Any chance of PDF ingestion?

private-gpt
private-gpt copied to clipboard