pypdf ValueError: invalid literal for int() with base 16: b'F:'

ValueError: invalid literal for int() with base 16: b'F:'

Open sureshkvl opened this issue 2 months ago • 2 comments

pypdf version: 4.2.0 platform: Linux-6.5.0-1018-oem-x86_64-with-glibc2.35 Python: 3.10.12

Traceback error

File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 1804, in _extract_text
    for operands, operator in content.operations:
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1245, in operations
    self._parse_content_stream(BytesIO(b_(self._data)))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1135, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1286, in read_object
    return read_hex_string_from_stream(stream, forced_encoding)
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_utils.py", line 29, in read_hex_string_from_stream
    txt += chr(int(x, base=16))
ValueError: invalid literal for int() with base 16: b'F:'

Below is the python script

from pypdf import PdfReader
reader = PdfReader("biology/lebo102.pdf")
page = reader.pages[0]
print(page.extract_text())
page = reader.pages[1]
print(page.extract_text())
page = reader.pages[2]
print(page.extract_text())

The pdf file is attached lebo102.pdf

Apr 15 '24 17:04 sureshkvl

The issue is on page 2. Due to peeking with <F the corresponding stream part is considered hexadecimal, but starts with <F\x00\x00:, where the : is no valid hexadecimal character.

I am not sure where this actually originates from, thus further analysis is required here.

Apr 16 '24 08:04 stefan6419846

I've started the analysis and the issue is coming from EI and inline image extraction. I've found in pdf.js some approach to isolate the data. Work in progress

Apr 16 '24 11:04 pubpub-zz

pypdf pypdf copied to clipboard

ValueError: invalid literal for int() with base 16: b'F:'

pypdf
pypdf copied to clipboard