Raises "EI stream not found" while reading RunLengthDecode (RL) inline image
I am trying to read the content of a PDF
Environment
$ python -m platform
Linux-6.16.12+deb14+1-amd64-x86_64-with-glibc2.36
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.2.0, crypt_provider=('cryptography', '3.4.8'), PIL=9.0.1
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader("/path-to-file.pdf")
for page in reader.pages:
text = page.extract_text()
The PDF is cedolini_esempio-1.pdf.
While debugging, I found out that the image it is trying to parse is:
\x00\xf8\xff\x00\x00\x02\xfe\xff\x00\x80\xff\x00\x00?\x00\xff\x00\xfe\xfe\x00\xfc\xff\x00\x80\xff\x00\x00[...]\xfbU\x00\x7f\x80\r\nEI
The problem seems to be that https://github.com/py-pdf/pypdf/blob/85b53d8eb014d1c6363a71401cebfadd9d7300b0/pypdf/generic/_image_inline.py#L131 finds the \x80 inside the image, so the following tokens are not EI as expected.
I read the PDF documentation (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf) and it says:
The value 128 is placed at the end of the compressed data, as an EOD marker.
but I can't see such value 128.
Why aren't we looking for the EI directly like it is done in the default handler https://github.com/py-pdf/pypdf/blob/85b53d8eb014d1c6363a71401cebfadd9d7300b0/pypdf/generic/_image_inline.py#L199 ?
Traceback
This is the relevant part of the traceback I see:
...
for value in page.extract_text().split():
File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 2043, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1726, in _extract_text
for operands, operator in content.operations:
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1406, in operations
self._parse_content_stream(BytesIO(self._data))
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1285, in _parse_content_stream
ii = self._read_inline_image(stream)
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1328, in _read_inline_image
data = extract_inline_RL(stream)
File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_image_inline.py", line 142, in extract_inline_RL
raise PdfReadError("EI stream not found")
pypdf.errors.PdfReadError: EI stream not found
Thanks for the report. Unfortunately, without a PDF file, we are most likely going to close this issue.
I read the PDF documentation (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf) and it says:
The value 128 is placed at the end of the compressed data, as an EOD marker.
but I can't see such value 128.
This is the documentation for PDF version 1.0, which you most likely do not have. Nevertheless, 128_10 = 80_16, which you can quickly verify by ord(b'\x80'), thus our interpretation is correct here.
Why aren't we looking for the EI directly like it is done in the default handler [...]?
Because the corresponding standard says (as per the specification for at least PDF version 1.7 and 2.0):
A length value of 128 shall denote EOD.
Thus your stream seems to violate the PDF specification here.
Thanks for the quick and useful feedback 🤗, I'll try to create an anonymized PDF that reproduces this issue.
Do you have a link to the newer PDF specification? I only found the one I linked (that is for version 1.0).
Do you have a link to the newer PDF specification? I only found the one I linked (that is for version 1.0).
See https://pdfa.org/resource/pdf-specification-archive/
An update: the PyPDF2==3.0.1 can read the PDF correctly.
Still working on an anonymized version of the file to add to this issue.
@stefan6419846 I included the PDF 🚀 please remove the label https://github.com/py-pdf/pypdf/issues?q=label%3Aneeds-pdf
An update: the PyPDF2==3.0.1 can read the PDF correctly.
Its inline image handling worked differently and looked for the EI marker directly. I have implemented this as a fallback under the assumption that the RL-encoded inline image does not contain a EI marker in its stream and converted the EOD filter in the filter itself as well. These changes are not yet committed and I am not sure whether this is the correct approach, as this will simply stop reading the final image on the first byte value 128.
It remains mysterious to me what the actual reason for having multiple (51 in one case!) EOD marker bytes in the image streams is and how an application is expected to actually handle this. In case you are in control of the generator, you might want to check if this can be fixed there as well.
Thanks for analyzing this issue so carefully!
In case you are in control of the generator, you might want to check if this can be fixed there as well.
I am not in control: I waited 3 weeks to have an anonymized version of the PDF so I could attach it here.
Unfortunately I understood just some of your analysis because I don't know this library or PDFs well enough, but may I suggest to use the fallback method just in case the image parsing fails? That shouldn't do any harm