pypdf
pypdf copied to clipboard
'IndexError: list index out of range' when extracting text
I've got an IndexError when extracting text. The file opens fine in Chrome.
Environment
$ python -m platform
Linux-5.4.0-121-generic-x86_64-with-glibc2.31
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.4.2
Code + PDF
The file: pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf
>>> from PyPDF2 import PdfReader
>>> reader = PdfReader('pdf/5cf3eb1c20fb4bea8654f2a9b64b5a62.pdf')
/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py:1229: PdfReadWarning: incorrect startxref pointer(1)
warnings.warn(
>>> for page in reader.pages: print(page.extract_text())
[...]
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
Invalid FloatObject b'71.5131592.8861'
Invalid FloatObject b'58.1.5131592.63'
[...]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1507, in extract_text
return self._extract_text(
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1441, in _extract_text
process_operation(operator, operands)
File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1301, in process_operation
float(operands[5]),
IndexError: list index out of range
It's print(reader.pages[10].extract_text())
to be exact.
The same file gives a ValueError("invalid literal for int() with base 10: b'7267753-726774'")
when trying to make an overlay.
fwiw, i'm seeing a similar error with dump.pdf which is generated during the test suite of xml2rfc.
Python 3.10.5 (main, Jun 8 2022, 09:26:22) [GCC 11.3.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from PyPDF2 import PdfReader
from PyPDF2 import PdfReaderr
In [2]: r = PdfReader('../dump.pdf')
r = PdfReader('../dump.pdf'))
In [3]: r.pages[0].extract_text()
r.pages[0].extract_text())
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-2445f91b85f4> in <module>
----> 1 r.pages[0].extract_text()
/usr/lib/python3/dist-packages/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
1314 :return: The extracted text
1315 """
-> 1316 return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
1317
1318 def extract_xform_text(
/usr/lib/python3/dist-packages/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
1127 if "/Font" in resources_dict:
1128 for f in cast(DictionaryObject, resources_dict["/Font"]):
-> 1129 cmaps[f] = build_char_map(f, space_width, obj)
1130 cmap: Tuple[
1131 Union[str, Dict[int, str]], Dict[str, str], str
/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in build_char_map(font_name, space_width, obj)
19 space_code = 32
20 encoding, space_code = parse_encoding(ft, space_code)
---> 21 map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
22
23 # encoding can be either a string for decode (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)
/usr/lib/python3/dist-packages/PyPDF2/_cmap.py in parse_to_unicode(ft, space_code)
244 "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
245 )
--> 246 ] = unhexlify(lst[1]).decode(
247 "utf-16-be", "surrogatepass"
248 ) # join is here as some cases where the code was split
IndexError: list index out of range
In [4]:
I'm not sure that bb2d1dbf20dbe6a77d60be46cbd8646fde6b418c resolves this issue. looking at 966635.pdf (from the original report), and working from bb2d1dbf20dbe6a77d60be46cbd8646fde6b418c, when i do:
r = PdfReader('966635.pdf')
p = r.pages[10].extract_text()
I get this crash (ipython3 backtrace):
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-14-182fc7811fdb> in <module>
----> 1 p = r.pages[10].extract_text()
~/src/pypdf2/PyPDF2/PyPDF2/_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
1424 :return: The extracted text
1425 """
-> 1426 return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
1427
1428 def extract_xform_text(
~/src/pypdf2/PyPDF2/PyPDF2/_page.py in _extract_text(self, obj, pdf, space_width, content_key)
1402 text = ""
1403 else:
-> 1404 process_operation(operator, operands)
1405 output += text # just in case of
1406 return output
~/src/pypdf2/PyPDF2/PyPDF2/_page.py in process_operation(operator, operands)
1269 float(operands[3]),
1270 float(operands[4]),
-> 1271 float(operands[5]),
1272 ]
1273 elif operator == b"T*":
sorry for having commented here just because i also got an IndexError
on extract_text
! The issue i'd found is probably better characterized by #1111, and it is distinct from this one.
I think this report should be re-opened.
Thank you for letting me know 🤗
By the way, this is how the page causing the issues looks like:
Trying it via https://www.pdf-online.com/osa/validate.aspx :
Validating file "non-compliant.pdf" for conformance level pdf1.3
- The 'xref' keyword was not found or the xref table is malformed.
- The file trailer dictionary is missing or invalid.
- The "Length" key of the stream object is wrong.
- Error in Flate stream: data error.
- The embedded ICC profile couldn't be read.
- The embedded font program 'JNLDEF+TimesNewRoman' cannot be read.
- The "Length" key of the stream object is wrong.
- Error in Flate stream: data error.
- The "Length" key of the stream object is wrong.
- The operator has an invalid number of operands.
- Error in Flate stream: data error.
- The "Length" key of the stream object is wrong.
- The operator has an invalid number of operands.
- A path start operator was missing.
- Error in Flate stream: data error.
- Graphics operator m is not allowed in page description.
- The "Length" key of the stream object is wrong.
- The operator has an invalid number of operands.
- A path start operator was missing.
- Error in Flate stream: data error.
- The "Length" key of the stream object is wrong.
- The operator has an invalid number of operands.
- Error in Flate stream: data error.
- Graphics operator l is not allowed in text object.
Similar exception (v3.0.1) :
File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1851, in extract_text
return self._extract_text(
File "C:\Python38\lib\site-packages\PyPDF2\_page.py", line 1342, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 28, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 196, in parse_to_unicode
process_rg, process_char, multiline_rg = process_cm_line(
File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 264, in process_cm_line
multiline_rg = parse_bfrange(l, map_dict, int_entry, multiline_rg)
File "C:\Python38\lib\site-packages\PyPDF2\_cmap.py", line 278, in parse_bfrange
nbi = max(len(lst[0]), len(lst[1]))
IndexError: list index out of range
at post-mortem lst
has only one element:
>>> lst
[b'fffd']
>>>
>>> PyPDF2.__version__
'3.0.1'
(unfortunately cannot publish the pdf )
@kxrob the whole pdf is not required for the analysis; can you locate the failing page and extract the fonts data with this script:
failing_pdf="xxxx.pdf" # to be updated
failing_page= 0 # to be updated
w = pypdf.PdfWriter()
w.add_page(pypdf.PdfReader(failing_pdf).pages[failing_page])
del w.pages[0]["/Contents"]
w.write("cleaned_page.pdf")
can you locate the failing page and extract the fonts data with this script
Here is the stripped page - it causes the same error with PdfReader("cleaned_page.pdf").pages[0].extract_text()
cleaned_page.pdf
@kxrob
can you retry replacing in _cmap.py the code of function parse_bfrange
with the following code (about line 270):
def parse_bfrange(
l: bytes,
map_dict: Dict[Any, Any],
int_entry: List[int],
multiline_rg: Union[None, Tuple[int, int]],
) -> Union[None, Tuple[int, int]]:
lst = [x for x in l.split(b" ") if x]
closure_found = False
if multiline_rg is not None:
fmt = b"%%0%dX" % (map_dict[-1] * 2)
a = multiline_rg[0] # a, b not in the current line
b = multiline_rg[1]
for sq in lst[1:]:
if sq == b"]":
closure_found = True
break
map_dict[
unhexlify(fmt % a).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be",
"surrogatepass",
)
] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
int_entry.append(a)
a += 1
else:
a = int(lst[0], 16)
b = int(lst[1], 16)
nbi = max(len(lst[0]), len(lst[1]))
map_dict[-1] = ceil(nbi / 2)
fmt = b"%%0%dX" % (map_dict[-1] * 2)
if lst[2] == b"[":
for sq in lst[3:]:
if sq == b"]":
closure_found = True
break
map_dict[
unhexlify(fmt % a).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be",
"surrogatepass",
)
] = unhexlify(sq).decode("utf-16-be", "surrogatepass")
int_entry.append(a)
a += 1
else: # case without list
c = int(lst[2], 16)
fmt2 = b"%%0%dX" % max(4, len(lst[2]))
closure_found = True
while a <= b:
map_dict[
unhexlify(fmt % a).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be",
"surrogatepass",
)
] = unhexlify(fmt2 % c).decode("utf-16-be", "surrogatepass")
int_entry.append(a)
a += 1
c += 1
return None if closure_found else (a, b)
@kxrob a PR has been issued. If you can confirm it is fixing your issue too
@kxrob I close this issue as normally closed. Feel free to ask for reopen if you have new inputs