pypdf
pypdf copied to clipboard
BUG: Fix Parsing of Inline Images
The inline image parser does not look for whitespace before the EI
keyword as it should. Thus if you have a content stream as follows, the parser would crash:
BI [inline image dictionary]
ID
asfASF213ad>]asf
213lkasdf9as12EI
QsdkfjasdfkjfdiI
EI
Q
Notice the EI
on one line and the Q
on the following line occurs in two places. To properly check, we need to make sure the EI is preceded by white-space.
Also, added a protection against infinite loops in case the PDF is corrupt and the inline image never ends.
#331 is also implements protection against incorrect images. Also make parsing of inline images a lot faster.
The current solution is not compatible with the recent BytesIO implementation. Do you mind to adjust your PR?
I fixed the merge conflict, I'm not sure what you're referring to re BytesIO
.
I fixed the merge conflict, I'm not sure what you're referring to re
BytesIO
.
CI is failing:
@speedplane We made some pretty heavy changes to PyPDF2 recently. If you search for if tok2 == b"I":
in generic.py
, you can see the section that you adjusted. Do you want to adjust the PR / open a new PR?
Do you have an example PDF where this adjustment is necessary? Does it close one of the open issues?
It would help me a lot if we had an image that shows the described issue.
Sorry, this is all I have. I can't remember what this fixed or how it fixes it.
@speedplane The issue you addressed was fixed via #1327 .
May I add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html ? Your PR was not merged, but you did make a valuable contribution with this PR. It was just me not being able to understand it at the time.