langextract icon indicating copy to clipboard operation
langextract copied to clipboard

Is it able to work on large PDF files?

Open BrunoVox opened this issue 4 months ago • 2 comments

I'm trying to extract key-value pairs from PDFs that follow a very similar pattern. These PDFs are quite large (around 20-30 pages) and I'm getting some errors which make me wonder if LangExtract is ready to work with such documents. For now I have been extracting raw text from these PDFs and there are two issues I detected:

  1. The resulting string is VERY large
  2. The string contains loads of useless information

Also, the run time is long to just partially analyze one particular PDF. Have you used this lib on a similar use case? If so, could you share some information on how you did it?

In my most recent implementation, I was able to tackle some issues, but I'm still unable to move past an error that has been happening since I first moved from the examples to the extracted PDF text, which I show you next:

LangExtract: Processing, current=9,533 chars, processed=9,533 chars:  [00:33]
Traceback (most recent call last):
  File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/resolver.py", line 224, in resolve
    extraction_data = self.string_to_extraction_data(input_text)
  File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/resolver.py", line 392, in string_to_extraction_data
    raise ResolverParsingError("Content must contain an 'extractions' key.")
langextract.resolver.ResolverParsingError: Content must contain an 'extractions' key.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/---/Projects/---/lx_experimenting/main.py", line 119, in <module>
    result = lx.extract(
        text_or_documents=pdf_to_text("./data/labeled/docs/a-b.pdf"),
    ...<9 lines>...
        max_char_buffer=1000    # Smaller contexts for better accuracy
    )
  File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/__init__.py", line 291, in extract
    return annotator.annotate_text(
           ~~~~~~~~~~~~~~~~~~~~~~~^
        text=text_or_documents,
        ^^^^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        max_workers=max_workers,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/annotation.py", line 506, in annotate_text
    annotations = list(
        self.annotate_documents(
    ...<7 lines>...
        )
    )
  File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/annotation.py", line 236, in annotate_documents
    yield from self._annotate_documents_single_pass(
        documents, resolver, max_char_buffer, batch_length, debug, **kwargs
    )
  File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/annotation.py", line 356, in _annotate_documents_single_pass
    annotated_chunk_extractions = resolver.resolve(
        top_inference_result, debug=debug, **kwargs
    )
  File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/resolver.py", line 233, in resolve
    raise ResolverParsingError("Failed to parse content.") from e
langextract.resolver.ResolverParsingError: Failed to parse content.

I think this error is very weird because, if I run the same script for very simple strings instead of the raw text extraction from the PDF I'm able to successfully run it. Also, the extraction key actually exists, I can even see in verbose information that the script is able to extract the information I'm asking it to.

I hope the information I provided wasn't too messy, and I'm available to add more context as needed.

BrunoVox avatar Aug 22 '25 16:08 BrunoVox

PDF support is not available yet but it's something that I'm interested in. I have a small prototype running on my local and as the library has core aspects stabilized will work on adding this in.

aksg87 avatar Aug 23 '25 14:08 aksg87

Much needed

vanshavenger avatar Aug 29 '25 19:08 vanshavenger