pypdf 'IndexError: list index out of range' when extracting text

'IndexError: list index out of range' when extracting text

Open MartinThoma opened this issue 1 year ago • 6 comments

I've got an IndexError when extracting text. The file opens fine in Chrome.

Environment

$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2

Code + PDF

The file: pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf

>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf')
/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py:1229: PdfReadWarning: incorrect startxref pointer(1)
  warnings.warn(
>>> for page in reader.pages: print(page.extract_text())
[...]
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
[...]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1507, in extract_text
    return self._extract_text(
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1441, in _extract_text
    process_operation(operator, operands)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1301, in process_operation
    float(operands[5]),
IndexError: list index out of range

It's print(reader.pages[10].extract_text()) to be exact.

Jul 10 '22 09:07 MartinThoma

The same file gives a ValueError("invalid literal for int() with base 10: b'7267753-726774'") when trying to make an overlay.

Jul 10 '22 09:07 MartinThoma

fwiw, i'm seeing a similar error with dump.pdf which is generated during the test suite of xml2rfc.

Python 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from PyPDF2 import PdfReader
from PyPDF2 import PdfReaderr

In [2]: r = PdfReader('../dump.pdf')
r = PdfReader('../dump.pdf'))

In [3]: r.pages[0].extract_text()
r.pages[0].extract_text())
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-2445f91b85f4> in <module>
----> 1 r.pages[0].extract_text()

/usr/lib/python3/dist-packages/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
   1314         :return: The extracted text
   1315         """
-> 1316         return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
   1317 
   1318     def extract_xform_text(

/usr/lib/python3/dist-packages/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
   1127         if "/Font" in resources_dict:
   1128             for f in cast(DictionaryObject, resources_dict["/Font"]):
-> 1129                 cmaps[f] = build_char_map(f, space_width, obj)
   1130         cmap: Tuple[
   1131             Union[str, Dict[int, str]], Dict[str, str], str

/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in build_char_map(font_name, space_width, obj)
     19     space_code = 32
     20     encoding, space_code = parse_encoding(ft, space_code)
---> 21     map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
     22 
     23     # encoding can be either a string for decode (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)

/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in parse_to_unicode(ft, space_code)
    244                         "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
    245                     )
--> 246                 ] = unhexlify(lst[1]).decode(
    247                     "utf-16-be", "surrogatepass"
    248                 )  # join is here as some cases where the code was split

IndexError: list index out of range

In [4]:

Jul 14 '22 17:07 dkg

I'm not sure that bb2d1dbf20dbe6a77d60be46cbd8646fde6b418c resolves this issue. looking at 966635.pdf (from the original report), and working from bb2d1dbf20dbe6a77d60be46cbd8646fde6b418c, when i do:

r = PdfReader('966635.pdf')
p = r.pages[10].extract_text()

I get this crash (ipython3 backtrace):

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-14-182fc7811fdb> in <module>
----> 1 p = r.pages[10].extract_text()

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
   1424         :return: The extracted text
   1425         """
-> 1426         return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
   1427 
   1428     def extract_xform_text(

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
   1402                     text = ""
   1403             else:
-> 1404                 process_operation(operator, operands)
   1405         output += text  # just in case of
   1406         return output

~/src/pypdf2/PyPDF2/PyPDF2/_page.py in process_operation(operator, operands)
   1269                     float(operands[3]),
   1270                     float(operands[4]),
-> 1271                     float(operands[5]),
   1272                 ]
   1273             elif operator == b"T*":

sorry for having commented here just because i also got an IndexError on extract_text! The issue i'd found is probably better characterized by #1111, and it is distinct from this one.

I think this report should be re-opened.

Jul 14 '22 20:07 dkg

Thank you for letting me know 🤗

Jul 14 '22 20:07 MartinThoma

By the way, this is how the page causing the issues looks like:

Aug 06 '22 06:08 MartinThoma

Trying it via https://www.pdf-online.com/osa/validate.aspx :

Validating file "non-compliant.pdf" for conformance level pdf1.3

The 'xref' keyword was not found or the xref table is malformed.
The file trailer dictionary is missing or invalid.
The "Length" key of the stream object is wrong.
Error in Flate stream: data error.
The embedded ICC profile couldn't be read.
The embedded font program 'JNLDEF+TimesNewRoman' cannot be read.
The "Length" key of the stream object is wrong.
Error in Flate stream: data error.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
Error in Flate stream: data error.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
A path start operator was missing.
Error in Flate stream: data error.
Graphics operator m is not allowed in page description.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
A path start operator was missing.
Error in Flate stream: data error.
The "Length" key of the stream object is wrong.
The operator has an invalid number of operands.
Error in Flate stream: data error.
Graphics operator l is not allowed in text object.

Aug 06 '22 06:08 MartinThoma

Similar exception (v3.0.1) :

  File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
    return self._extract_text(
  File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
    multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
  File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
    nbi = max(len(lst[0]), len(lst[1]))
IndexError: list index out of range

at post-mortem lst has only one element:

>>> lst
[b'fffd']
>>>
>>> PyPDF2.__version__
'3.0.1'

(unfortunately cannot publish the pdf )

Jan 02 '23 18:01 kxrob

@kxrob the whole pdf is not required for the analysis; can you locate the failing page and extract the fonts data with this script:

failing_pdf="xxxx.pdf"    # to be updated
failing_page= 0            # to be updated
w = pypdf.PdfWriter()
w.add_page(pypdf.PdfReader(failing_pdf).pages[failing_page])
del w.pages[0]["/Contents"]
w.write("cleaned_page.pdf")

Jan 03 '23 08:01 pubpub-zz

can you locate the failing page and extract the fonts data with this script

Here is the stripped page - it causes the same error with PdfReader("cleaned_page.pdf").pages[0].extract_text() cleaned_page.pdf

Jan 03 '23 19:01 kxrob

@kxrob can you retry replacing in _cmap.py the code of function parse_bfrange with the following code (about line 270):

def parse_bfrange(
    l: bytes,
    map_dict: Dict[Any, Any],
    int_entry: List[int],
    multiline_rg: Union[None, Tuple[int, int]],
) -> Union[None, Tuple[int, int]]:
    lst = [x for x in l.split(b" ") if x]
    closure_found = False
    if multiline_rg is not None:
        fmt = b"%%0%dX" % (map_dict[-1] * 2)
        a = multiline_rg[0]  # a, b not in the current line
        b = multiline_rg[1]
        for sq in lst[1:]:
            if sq == b"]":
                closure_found = True
                break
            map_dict[
                unhexlify(fmt % a).decode(
                    "charmap" if map_dict[-1] == 1 else "utf-16-be",
                    "surrogatepass",
                )
            ] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
            int_entry.append(a)
            a += 1
    else:
        a = int(lst[0], 16)
        b = int(lst[1], 16)
        nbi = max(len(lst[0]), len(lst[1]))
        map_dict[-1] = ceil(nbi / 2)
        fmt = b"%%0%dX" % (map_dict[-1] * 2)
        if lst[2] == b"[":
            for sq in lst[3:]:
                if sq == b"]":
                    closure_found = True
                    break
                map_dict[
                    unhexlify(fmt % a).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be",
                        "surrogatepass",
                    )
                ] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
                int_entry.append(a)
                a += 1
        else:  # case without list
            c = int(lst[2], 16)
            fmt2 = b"%%0%dX" % max(4, len(lst[2]))
            closure_found = True
            while a <= b:
                map_dict[
                    unhexlify(fmt % a).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be",
                        "surrogatepass",
                    )
                ] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
                int_entry.append(a)
                a += 1
                c += 1
    return None if closure_found else (a, b)

Jan 04 '23 20:01 pubpub-zz

@kxrob a PR has been issued. If you can confirm it is fixing your issue too

Jan 09 '23 21:01 pubpub-zz

@kxrob I close this issue as normally closed. Feel free to ask for reopen if you have new inputs

Feb 05 '23 15:02 pubpub-zz

pypdf pypdf copied to clipboard

'IndexError: list index out of range' when extracting text

Environment

Code + PDF

pypdf
pypdf copied to clipboard