pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

TypeError: 'PDFObjRef' object is not iterable

Open corobin opened this issue 7 months ago • 4 comments

after updating to version 20240706 extract_text() on a pdf throws an error TypeError: 'PDFObjRef' object is not iterable

this did not occur on the previous version 20231228

Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> from pdfminer.high_level import extract_text
>>> text = extract_text("Working.pdf")
>>> text = extract_text("Error.pdf")
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    text = extract_text(path)
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
  File "C:\Program Files\Python312\Lib\site-packages\pdfminer\pdfpage.py", line 63, in __init__
    mediabox_params: List[Any] = [
TypeError: 'PDFObjRef' object is not iterable
>>>

Working.pdf - newly created blank page with acrobat

Error.pdf - downloaded, I cannot change the process of its creation. I deleted all visible text on the page which did not appear to affect the behaviour of the error

corobin avatar Jul 10 '24 00:07 corobin