CORE: Sophisticated PDFReader with Image and Table extraction

Open janaka opened this issue 2 years ago • 1 comments

Current

The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the pypdf package. It iterates through pages (pypdf.pdfreader.pages) then uses the page.extract_text() method to grab to text for the document.

The following are ignored:

Images
Tables
PDF Metadata
Document structure such as headings

We should be able to improve retrieval by extracting information present in these components

Solution

Fork the standard LlamaiIndex PDFReader and customise it. Look into the various LlamaIndex Image readers.

Alternatives

Use readers from Unstructured.io

Oct 16 '23 10:10 janaka

Also look into LayoutPDFReader by LLMSherpa

Oct 23 '23 13:10 janaka