docq icon indicating copy to clipboard operation
docq copied to clipboard

CORE: Sophisticated PDFReader with Image and Table extraction

Open janaka opened this issue 2 years ago • 1 comments

Current

The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the pypdf package. It iterates through pages (pypdf.pdfreader.pages) then uses the page.extract_text() method to grab to text for the document.

The following are ignored:

  • Images
  • Tables
  • PDF Metadata
  • Document structure such as headings

We should be able to improve retrieval by extracting information present in these components

Solution

Fork the standard LlamaiIndex PDFReader and customise it. Look into the various LlamaIndex Image readers.

Alternatives

Use readers from Unstructured.io

janaka avatar Oct 16 '23 10:10 janaka

Also look into LayoutPDFReader by LLMSherpa

janaka avatar Oct 23 '23 13:10 janaka