pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Raises "EI stream not found" while reading RunLengthDecode (RL) inline image

Open SirPyTech opened this issue 1 month ago • 7 comments

I am trying to read the content of a PDF

Environment

$ python -m platform
Linux-6.16.12+deb14+1-amd64-x86_64-with-glibc2.36

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.2.0, crypt_provider=('cryptography', '3.4.8'), PIL=9.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader("/path-to-file.pdf")

for page in reader.pages:
    text = page.extract_text()

The PDF is cedolini_esempio-1.pdf.

While debugging, I found out that the image it is trying to parse is:

\x00\xf8\xff\x00\x00\x02\xfe\xff\x00\x80\xff\x00\x00?\x00\xff\x00\xfe\xfe\x00\xfc\xff\x00\x80\xff\x00\x00[...]\xfbU\x00\x7f\x80\r\nEI

The problem seems to be that https://github.com/py-pdf/pypdf/blob/85b53d8eb014d1c6363a71401cebfadd9d7300b0/pypdf/generic/_image_inline.py#L131 finds the \x80 inside the image, so the following tokens are not EI as expected.

I read the PDF documentation (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf) and it says:

The value 128 is placed at the end of the compressed data, as an EOD marker.

but I can't see such value 128.

Why aren't we looking for the EI directly like it is done in the default handler https://github.com/py-pdf/pypdf/blob/85b53d8eb014d1c6363a71401cebfadd9d7300b0/pypdf/generic/_image_inline.py#L199 ?

Traceback

This is the relevant part of the traceback I see:

...
    for value in page.extract_text().split():
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 2043, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.10/site-packages/pypdf/_page.py", line 1726, in _extract_text
    for operands, operator in content.operations:
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1406, in operations
    self._parse_content_stream(BytesIO(self._data))
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1285, in _parse_content_stream
    ii = self._read_inline_image(stream)
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1328, in _read_inline_image
    data = extract_inline_RL(stream)
  File "/usr/local/lib/python3.10/site-packages/pypdf/generic/_image_inline.py", line 142, in extract_inline_RL
    raise PdfReadError("EI stream not found")
pypdf.errors.PdfReadError: EI stream not found

SirPyTech avatar Nov 12 '25 10:11 SirPyTech

Thanks for the report. Unfortunately, without a PDF file, we are most likely going to close this issue.

I read the PDF documentation (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf) and it says:

The value 128 is placed at the end of the compressed data, as an EOD marker.

but I can't see such value 128.

This is the documentation for PDF version 1.0, which you most likely do not have. Nevertheless, 128_10 = 80_16, which you can quickly verify by ord(b'\x80'), thus our interpretation is correct here.

Why aren't we looking for the EI directly like it is done in the default handler [...]?

Because the corresponding standard says (as per the specification for at least PDF version 1.7 and 2.0):

A length value of 128 shall denote EOD.

Thus your stream seems to violate the PDF specification here.

stefan6419846 avatar Nov 12 '25 10:11 stefan6419846

Thanks for the quick and useful feedback 🤗, I'll try to create an anonymized PDF that reproduces this issue.

Do you have a link to the newer PDF specification? I only found the one I linked (that is for version 1.0).

SirPyTech avatar Nov 12 '25 10:11 SirPyTech

Do you have a link to the newer PDF specification? I only found the one I linked (that is for version 1.0).

See https://pdfa.org/resource/pdf-specification-archive/

stefan6419846 avatar Nov 12 '25 11:11 stefan6419846

An update: the PyPDF2==3.0.1 can read the PDF correctly.

Still working on an anonymized version of the file to add to this issue.

SirPyTech avatar Dec 01 '25 09:12 SirPyTech

@stefan6419846 I included the PDF 🚀 please remove the label https://github.com/py-pdf/pypdf/issues?q=label%3Aneeds-pdf

SirPyTech avatar Dec 03 '25 08:12 SirPyTech

An update: the PyPDF2==3.0.1 can read the PDF correctly.

Its inline image handling worked differently and looked for the EI marker directly. I have implemented this as a fallback under the assumption that the RL-encoded inline image does not contain a EI marker in its stream and converted the EOD filter in the filter itself as well. These changes are not yet committed and I am not sure whether this is the correct approach, as this will simply stop reading the final image on the first byte value 128.

Image Image

It remains mysterious to me what the actual reason for having multiple (51 in one case!) EOD marker bytes in the image streams is and how an application is expected to actually handle this. In case you are in control of the generator, you might want to check if this can be fixed there as well.

stefan6419846 avatar Dec 03 '25 15:12 stefan6419846

Thanks for analyzing this issue so carefully!

In case you are in control of the generator, you might want to check if this can be fixed there as well.

I am not in control: I waited 3 weeks to have an anonymized version of the PDF so I could attach it here.

Unfortunately I understood just some of your analysis because I don't know this library or PDFs well enough, but may I suggest to use the fallback method just in case the image parsing fails? That shouldn't do any harm

SirPyTech avatar Dec 03 '25 16:12 SirPyTech