pypdf
pypdf copied to clipboard
ValueError: invalid literal for int() with base 16: b'F:'
pypdf version: 4.2.0 platform: Linux-6.5.0-1018-oem-x86_64-with-glibc2.35 Python: 3.10.12
Traceback error
File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 2083, in extract_text
return self._extract_text(
File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/_page.py", line 1804, in _extract_text
for operands, operator in content.operations:
File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1245, in operations
self._parse_content_stream(BytesIO(b_(self._data)))
File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1135, in _parse_content_stream
operands.append(read_object(stream, None, self.forced_encoding))
File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1286, in read_object
return read_hex_string_from_stream(stream, forced_encoding)
File "/home/suresh/venv-lanchain/lib/python3.10/site-packages/pypdf/generic/_utils.py", line 29, in read_hex_string_from_stream
txt += chr(int(x, base=16))
ValueError: invalid literal for int() with base 16: b'F:'
Below is the python script
from pypdf import PdfReader
reader = PdfReader("biology/lebo102.pdf")
page = reader.pages[0]
print(page.extract_text())
page = reader.pages[1]
print(page.extract_text())
page = reader.pages[2]
print(page.extract_text())
The pdf file is attached lebo102.pdf
The issue is on page 2. Due to peeking with <F
the corresponding stream part is considered hexadecimal, but starts with <F\x00\x00:
, where the :
is no valid hexadecimal character.
I am not sure where this actually originates from, thus further analysis is required here.
I've started the analysis and the issue is coming from EI and inline image extraction. I've found in pdf.js some approach to isolate the data. Work in progress