langchain
langchain copied to clipboard
Feature: add pdfplumber based PDF loader, with loader utilizing visual debugger
Add new PDF loader based on pdfplumber, and additional loader method for visual debugging
First time contributing to open-source projects, any suggestions would be greatly appreciated. Thank you.
- Finding out whether PDFs are loaded cleanly/correctly into Documents can be significant work when dealing with more PDFs.
- pdfplumber provides visual debugger for checking the PDF parse.
- Goal: allow users to take advantage of pdfplumber's visual debugger when desired while keeping the
loadmethod unchanged.
Integration includes
- A
loadmethod nearly identical to that of PyMuPDFLoader. - A
annotate_and_loadmethod that takes advantage of pdfplumber's visual debugging to save an annotated version of the PDF being loaded at an user defined directory. - Example usage in
pdf.ipynb - Integration tests in
test_pdf.py
Who can review?
@hwchase17 @eyurtsev
Thanks you @lesterpjy !! I'll try to review in the evening.
I'm in the process of the loader abstraction to decouple parsing from loading of content to make it easier to reuse the parser with different types of loaders (e.g., content loaded from s3 or the web and not only the file system).
Here are the changes for PDFs:
https://github.com/hwchase17/langchain/pull/4356/files