pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

ValueError: invalid literal for int() with base 16: b'F:'

Open sureshkvl opened this issue 2 months ago • 2 comments

pypdf version: 4.2.0 platform: Linux-6.5.0-1018-oem-x86_64-with-glibc2.35 Python: 3.10.12

Traceback error

File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 2083, in extract_text
    return self._extract_text(
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 1804, in _extract_text
    for operands, operator in content.operations:
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1245, in operations
    self._parse_content_stream(BytesIO(b_(self._data)))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1135, in _parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1286, in read_object
    return read_hex_string_from_stream(stream, forced_encoding)
  File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_utils.py", line 29, in read_hex_string_from_stream
    txt += chr(int(x, base=16))
ValueError: invalid literal for int() with base 16: b'F:'


Below is the python script

from pypdf import PdfReader
reader = PdfReader("biology/lebo102.pdf")
page = reader.pages[0]
print(page.extract_text())
page = reader.pages[1]
print(page.extract_text())
page = reader.pages[2]
print(page.extract_text())

The pdf file is attached lebo102.pdf

sureshkvl avatar Apr 15 '24 17:04 sureshkvl

The issue is on page 2. Due to peeking with <F the corresponding stream part is considered hexadecimal, but starts with <F\x00\x00:, where the : is no valid hexadecimal character.

I am not sure where this actually originates from, thus further analysis is required here.

stefan6419846 avatar Apr 16 '24 08:04 stefan6419846

I've started the analysis and the issue is coming from EI and inline image extraction. I've found in pdf.js some approach to isolate the data. Work in progress

pubpub-zz avatar Apr 16 '24 11:04 pubpub-zz