Is it able to work on large PDF files?
I'm trying to extract key-value pairs from PDFs that follow a very similar pattern. These PDFs are quite large (around 20-30 pages) and I'm getting some errors which make me wonder if LangExtract is ready to work with such documents. For now I have been extracting raw text from these PDFs and there are two issues I detected:
- The resulting string is VERY large
- The string contains loads of useless information
Also, the run time is long to just partially analyze one particular PDF. Have you used this lib on a similar use case? If so, could you share some information on how you did it?
In my most recent implementation, I was able to tackle some issues, but I'm still unable to move past an error that has been happening since I first moved from the examples to the extracted PDF text, which I show you next:
LangExtract: Processing, current=9,533 chars, processed=9,533 chars: [00:33]
Traceback (most recent call last):
File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/resolver.py", line 224, in resolve
extraction_data = self.string_to_extraction_data(input_text)
File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/resolver.py", line 392, in string_to_extraction_data
raise ResolverParsingError("Content must contain an 'extractions' key.")
langextract.resolver.ResolverParsingError: Content must contain an 'extractions' key.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/---/Projects/---/lx_experimenting/main.py", line 119, in <module>
result = lx.extract(
text_or_documents=pdf_to_text("./data/labeled/docs/a-b.pdf"),
...<9 lines>...
max_char_buffer=1000 # Smaller contexts for better accuracy
)
File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/__init__.py", line 291, in extract
return annotator.annotate_text(
~~~~~~~~~~~~~~~~~~~~~~~^
text=text_or_documents,
^^^^^^^^^^^^^^^^^^^^^^^
...<6 lines>...
max_workers=max_workers,
^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/annotation.py", line 506, in annotate_text
annotations = list(
self.annotate_documents(
...<7 lines>...
)
)
File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/annotation.py", line 236, in annotate_documents
yield from self._annotate_documents_single_pass(
documents, resolver, max_char_buffer, batch_length, debug, **kwargs
)
File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/annotation.py", line 356, in _annotate_documents_single_pass
annotated_chunk_extractions = resolver.resolve(
top_inference_result, debug=debug, **kwargs
)
File "/Users/---/Projects/---/lx_experimenting/.venv/lib/python3.13/site-packages/langextract/resolver.py", line 233, in resolve
raise ResolverParsingError("Failed to parse content.") from e
langextract.resolver.ResolverParsingError: Failed to parse content.
I think this error is very weird because, if I run the same script for very simple strings instead of the raw text extraction from the PDF I'm able to successfully run it. Also, the extraction key actually exists, I can even see in verbose information that the script is able to extract the information I'm asking it to.
I hope the information I provided wasn't too messy, and I'm available to add more context as needed.
PDF support is not available yet but it's something that I'm interested in. I have a small prototype running on my local and as the library has core aspects stabilized will work on adding this in.
Much needed