docq
docq copied to clipboard
CORE: Sophisticated PDFReader with Image and Table extraction
Current
The LlamaIndex PDFReader (part of the SimpleDirectoryReader) currently only handles simple (naive) text extraction. It uses the pypdf package. It iterates through pages (pypdf.pdfreader.pages) then uses the page.extract_text() method to grab to text for the document.
The following are ignored:
- Images
- Tables
- PDF Metadata
- Document structure such as headings
We should be able to improve retrieval by extracting information present in these components
Solution
Fork the standard LlamaiIndex PDFReader and customise it. Look into the various LlamaIndex Image readers.
Alternatives
Use readers from Unstructured.io
Also look into LayoutPDFReader by LLMSherpa