Refactoring PDF loaders: all

Open pprados opened this issue 2 months ago • 3 comments

This PR is a composition of many other PRs. Modifications will be published one after the other, to facilitate analysis and integration into langchain.

Refactoring all PDF loader and parser: community

Description: refactoring of PDF parsers and loaders. See below
Issue: missing locks, parameter inconsistency, missing lazy approach, split loader and parser, etc.
Twitter handle: pprados
[X] Add tests and docs:
- Add tests to check the consistency of different implementations
- Add tests to check the array and images extraction
- Update or add notebooks in docs/docs/integrations directory
[X] Lint and test: done

Rational

Even though Document has a page_content parameter (rather than text or body), we believe it’s not good practice to work with pages. Indeed, this approach creates memory gaps in RAG projects. If a paragraph spans two pages, the beginning of the paragraph is at the end of one page, while the rest is at the start of the next. With a page-based approach, there will be two separate chunks, each containing part of a sentence. The corresponding vectors won’t be relevant. These chunks are unlikely to be selected when there’s a question specifically about the split paragraph. If one of the chunks is selected, there’s little chance the LLM can answer the question. This issue is worsened by the injection of headers, footers (if parsers haven’t properly removed them), images, or tables at the end of a page, as most current implementations tend to do.

Why is it important to unify the different parsers? Each has its own characteristics and strategies, more or less effective depending on the family of PDF files. One strategy is to identify the family of the PDF file (by inspecting the metadata or the content of the first page) and then select the most efficient parser in that case. By unifying parsers, the following code doesn't need to deal with the specifics of different parsers, as the result is similar for each. We'll propose a Parser using this strategy in another PR.

The PR

We propose a substantial PR to improve the different PDF parser integrations. All my clients struggle with PDFs. I took the initiative to address this issue at its root by refactoring the various integrations of Python PDF parsers. The goal is to standardize a minimum set of parameters and metadata and bring improvements to each one (bug fixes, feature additions).

Don't worry about the size of the PR. In the end, there are only two modified files. The rest is just updating unit tests and docs.

source	what
- `langchain_community/document_loaders/pdf.py` - `langchain_community/document_loaders/parsers/pdf.py`	Modified source code
- `langchain_community/tests/integration_tests/document_loaders/pdf.py` - `langchain_community/tests/integration_tests/document_loaders/parsers/pdf.py`	Modified tests
- `docs/docs/integrations/document_loaders/pdf.ipynb`	A replication off one notebook
- `docs/docs/how_to/document_loader_pdf.ipynb`	An overview of pdf parsing
- `docs/docs/how_to/document_loader_custom.ipynb`	Enhanced separation of loaders and parsers

In order to qualify all the code, we worked in a separate project, using the langchain-common structure. In this way, we can compare the results of the historical implementation with the new ones.

We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the langchain-common structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here. The only difference is the name to import classes.

All this // project is available here. Consult the compare_old_new directory with your development environment, using DIFF to identify differences.

git clone https://github.com/pprados/patch_langchain_common.git
cd patch_langchain_common/compare_old_new

Metadata

All parsers use lowercase keys for pdf file metadata. Except PDFPlumberParser. For this particular case, we've added a dictionary wrapper that warns when keys with upper case letters are used.

Images

The current implementation in LangChain involves asking each parser for the text on a page, then retrieving images to apply OCR. The text extracted from images is then appended to the end of the page text, which may split paragraphs across pages, worsening the RAG model’s performance.

To avoid this, we modified the strategy for injecting OCR results from images. Now, the result is inserted between two paragraphs of text (\n\n or \n), just before the end of the page. This allows a half-paragraph to be combined with the first paragraph of the following page.

Currently, the LangChain implementation uses RapidOCR to analyze images and extract any text. This algorithm is designed to work with Chinese and English, not other languages. Since the implementation uses a function rather than a method, it’s not possible to modify it. We have modified the various parsers to allow for selecting the algorithm to analyze images. Now, it’s possible to use RapidOCR, Tesseract, or invoke a multimodal LLM to get a description of the image.

For converting images to text, the possible formats are: text, markdown, and HTML. Why is this important? If it’s necessary to split a result, based on the origin of the text fragments, it’s possible to do so at the level of image translations. An identification rule such as ![text](...) or <img …/> allows us to identify text fragments originating from an image.

Tables

Tables present in PDF files are another challenge. Some algorithms can detect part of them. This typically involves a specialized process, separate from the text flow. That is, the text extracted from the page includes each cell's content, sometimes in columns, sometimes in rows. This text is challenging for the LLM to interpret. Depending on the capabilities of the libraries, it may be possible to detect tables, then identify the cell boxes during text extraction to inject the table in its entirety. This way, the flow remains coherent. It’s even possible to add a few paragraphs before and after the table to prompt an LLM to describe it. Only the description of the table will be used for embedding.

Tables identified in PDF pages can be translated into markdown (if there are no merged cells) or HTML (which consumes more tokens). LLMs can then make use of them.

Unfortunately, this approach isn’t always feasible. In such cases, we can apply the approach used for images, by injecting tables and images between two paragraphs in the page’s text flow. This is always better than placing them at the end of the page.

Combining Pages

As mentioned, in a RAG project, we want to work with the text flow of a document, rather than by page. A mode is dedicated to this, which can be configured to specify the character to use for page delimiters in the flow. This could simply be \n, ------\n or \f to clearly indicate a page change, or  for seamless injection in a Markdown viewer without a visual effect.

Why is it important to identify page breaks when retrieving the full document flow? Because we generally want to provide a URL with the chunk’s location when the LLM answers. While it’s possible to reference the entire PDF, this isn’t practical if it’s more than two pages long. It’s better to indicate the specific page to display in the URL. Therefore, assistance is needed so that chunking algorithms can add the page metadata to each chunk. The choice of delimiter helps the algorithm calculate this parameter.

Similarly, we’ve added metadata in all parsers with the total number of pages in the document. Why is this important? If we want to reference a document, we need to determine if it’s relevant. A reference is valid if it helps the user quickly locate the fragment within the document (using the page and/or a chunk excerpt). But if the URL points to a PDF file without a page number (for various reasons) and the file has a large number of pages, we want to remove the reference that doesn’t assist the user. There’s no point in referencing a 100-page document! The total_pages metadata can then be used. We recommend this approach in an extension to LangChain that we propose for managing document references: langchain-reference.

Compatibility

We have tried, as much as possible, to maintain compatibility with the previous version. This is reflected in preserving the order of parameters and using the default values for each implementation so that the results remain similar. The unit and integration tests for the various parsers have not been modified; they are still valid.

Ideally, we would prefer an interface like:

class XXXLoader(...):
  def __init__(file_path, *, ...):
    ...

but this could break compatibility for positional arguments.

Perhaps it would be feasible to plan a migration for LangChain v1.0 by modifying the default parameters to make them mandatory during the transition to v1.0. At that point, we could reintroduce default values.

Normalisation

The AzureAIDocumentIntelligenceParser class introduces the mode parameter, which accepts the values single, page, and markdown. The deprecated UnstructuredPDFLoader class introduces the mode parameter, which accepts the values single, paged, and markdown. Based on this model, we are extending the presence of the mode parameter to most parsers, with the value single, page, and markdown. paged is declared depreciated.

The different Loader and BlobParser classes now offer the following parameters:

file_path str or PurePath with the file name.
password str with the file password, if needed.
mode to return a single document per file or one document per page (extended with elements in the case of Unstructured or other specific parser).
pages_delimiter to specify how to join pages (\f by default).
extract_images to enable image extraction (already present in most Loaders/Parsers).
images_parser to specify how to handle images (invoking OCR, LLM, etc.).
images_inner_format to specify inject the image (text, markdown-img, html-img).
extract_tables to allow extraction of tables detected by underlying libraries, for certain parsers.
Other parameters are specific to each parser.

The integration of image texts is now between two paragraphs.

For the images_parser parameter, we propose three parsers:

RapidOCRBlobParser()
TesseractBlobParser()
LLMImageBlobParser()

Here’s how it’s used:

XXXLoader(
  file_path,
  images_parser=LLMImageBlobParser(
    model=ChatOpenAI(model="gpt-4o", max_tokens=1024),
 ),
  images_inner_format="markdown-img",
)

Tables

Some parsers are able to extract arrays, but this is not integrated into langchain. We've added the necessary features to take this into account.

PyMuPDFLoader
PDFPlumberLoader
ZeroxPDFLoader
UnstructuredPDFLoader

Metadata

The different parsers offer a minimum set of common metadata:

source
page
total_page
creationdate
creator
producer
and whatever additional metadata the modules can extract from PDF files.
Dates are converted to ISO 8601 format for easier handling and consistency with other file formats.

All keys are in lowercase.

Tests

We propose matrix tests to validate all parsers compatible with the new approach.

test_standard_parameters()
test_parser_with_table()

To validate all the parsers, we retrieved all the PDF files used by each parser for its own tests, and invoked all the parsers from langchain, along with all these files. This ensures that there are no crashes when parsing a PDF file.

New features of parsers

We resume the modification for each parsers

	metadata	images	table	password	parser	deprecared	lazy_load	lock
PyPDF	✔	✔
PyPDFium2	✔	✔		✔				✔
PyPDFMiner	✔	✔		✔			✔
PyMuPDF	✔	✔	✔	✔				✔
PDFPlumber	✔	✔	✔	✔			✔	✔
OnlinePDF		✔					✔
ZeroxPDF	✔	✔	✔		✔
UnstructuredPDF	✔	✔	✔	✔	✔			✔
UnstructuredPDF						✔
PyPDFDirectory		✔		✔		✔
PagedPDFSplitter						✔
OnlinePDF						✔
UnstructuredPDF	✔	✔	✔	✔	✔			✔

PyPDFMinerLoader: When the extract_images parameter is set to true, the current implementation does not respect the concatenate_pages parameter. It returns multiple pages instead of a single one, as specified by default.

OnlinePDFLoader: This class is a poorly implemented (lacking lazy_load()) wrapper around UnstructuredPDFLoader.

parser: Split the loader and parser. As discussed in the LangChain documentation, it can be useful to decouple analysis logic from loading logic, making it easier to reuse a given analyzer regardless of how the data has been loaded. Where necessary, we have split the two logics.

New loader / parsers

New parsers will be introduced in a separate pull request.

Classes	What
`UnstructuredPDF`	Extend unstructured to be conform to the new specification
`LlamaIndexPDF`	Integration of online LlamaIndex API
`PyMuPDF4LLM`	Integration of PyMuPDF4LLM
`PDFRouter`	Dynamically selects the parser
`DoclingPDF`	Use Docling
`PDFMulti`	Use multiple parser and select the best

For example, with the unification of parsers, it will be possible to choose the parser according to the characteristics of the PDF file.

    routes = [
        # Name, keys with regex, parser
        ("Microsoft", {"producer": "Microsoft", "creator": "Microsoft"},
        PyMuPDFParser()),
        ("LibreOffice", {"producer": "LibreOffice", }, PDFPlumberParser()),
        ("Xdvipdfmx", {"producer": "xdvipdfmx.*", "page1":"Hello"}, PDFPlumberParser()),
        ("defautl", {}, PyPDFium2Parser())
    ]
    loader = PDFRouterLoader(filename, routes=routes)
    loader.load()

This will be present in other PRs.

Succession of PR

Step	What?
01	✅Prepare the upcoming PRs
02	✅PyMUPDF
03	✅PyPDF
04	✅PDFMiner
05	✅PyPDFium2
06	⌛PDFPlumber
07	ZeroxPDF
08	unstructured-inference unstructured Unstructured
09	how_to
10	deprecated
11	PDFRouter
12	LlamaIndex
13	Docling
...	...

Dec 30 '24 15:12 pprados

langchain langchain copied to clipboard

Refactoring PDF loaders: all

Refactoring all PDF loader and parser: community

Rational

The PR

Metadata

Images

Tables

Combining Pages

Compatibility

Normalisation

Tables

Metadata

Tests

New features of parsers

New loader / parsers

Succession of PR

langchain
langchain copied to clipboard