open-parse
open-parse copied to clipboard
Some PDF documents cannot be parsed
Initial Checks
- [X] I confirm that I'm on the latest version
Description
Example Code
import openparse
from openparse import DocumentParser
from IPython.display import display
pdf_path = "/Users/tjk/Desktop/ceshi_pdf/example1.pdf"
parser = DocumentParser(
table_args={
"parsing_algorithm": "pymupdf"}
)
parsed_content = parser.parse(pdf_path)
Python, open-parse & OS Version
python_version: 3.8.18
operating_system: Darwin
os_version: 23.0.0
open-parse version: 0.5.7
install path: /Users/tjk/miniconda3/envs/pytorch/lib/python3.8/site-packages/openparse
python version: 3.8.18 (default, Sep 11 2023, 08:17:16) [Clang 14.0.6 ]
platform: macOS-14.0-arm64-arm-64bit
related packages: tokenizers-0.19.1 PyMuPDF-1.24.9 torchvision-0.18.1 transformers-4.43.1 torch-2.3.1 pydantic-2.8.2
waiting
+1 on this. pdfminer struggles with a large amount of documents I'm testing with. pymupdf, on the other hand opens anything I throw at it flawlessly. ocr=true will flip to use pymupdf, but has additional logic that makes it useful to OCR.
seems to be pdfminer: https://github.com/pdfminer/pdfminer.six/issues/1004 https://github.com/NixOS/nixpkgs/pull/339919
there's a fix now but you'll have to wait until it gets released, which could be a while.