langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Feature: add pdfplumber based PDF loader, with loader utilizing visual debugger

Open lesterpjy opened this issue 2 years ago • 1 comments

Add new PDF loader based on pdfplumber, and additional loader method for visual debugging

First time contributing to open-source projects, any suggestions would be greatly appreciated. Thank you.

  • Finding out whether PDFs are loaded cleanly/correctly into Documents can be significant work when dealing with more PDFs.
  • pdfplumber provides visual debugger for checking the PDF parse.
  • Goal: allow users to take advantage of pdfplumber's visual debugger when desired while keeping the load method unchanged.

Integration includes

  1. A load method nearly identical to that of PyMuPDFLoader.
  2. A annotate_and_load method that takes advantage of pdfplumber's visual debugging to save an annotated version of the PDF being loaded at an user defined directory.
  3. Example usage in pdf.ipynb
  4. Integration tests in test_pdf.py

Who can review?

@hwchase17 @eyurtsev

lesterpjy avatar May 08 '23 18:05 lesterpjy

Thanks you @lesterpjy !! I'll try to review in the evening.

I'm in the process of the loader abstraction to decouple parsing from loading of content to make it easier to reuse the parser with different types of loaders (e.g., content loaded from s3 or the web and not only the file system).

Here are the changes for PDFs:

https://github.com/hwchase17/langchain/pull/4356/files

eyurtsev avatar May 08 '23 19:05 eyurtsev