pypdf
pypdf copied to clipboard
BUG: Fix argument invalid seek when startxref is zero
Hi!
When pdf file ends with the following lines below, the stream.seek(-11, 1) at line 1854 raises Exception Value (22) Argument invalid.
xref
0 85
0000000000 65535 f
....
0000000000 00000 n
trailer
<<
/Size 85
/Root 84 0 R
/Info 83 0 R
startxref
0
%%EOF
So, when startxref
is zero, this commit expand the search to try to find the xref table in the last 2KB.
@VictorCarlquist, it sounds like this is a valuable addition to PyPDF2! Would you add a unit test to your commit (like an example file) that demonstrates the problem that you're fixing?
Thanks for this merge request!
Hi @kurtmckee, sorry for the big delay to answer.
With 2KB, the pyPDF keeps showing up the Exception with some pdf files, so we have raised the search to 8KB and it solved the problem.
The 8KB seems fine to you? Could we keep this value?
Codecov Report
Merging #493 (435c9d6) into main (f3cb316) will increase coverage by
0.07%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #493 +/- ##
==========================================
+ Coverage 82.47% 82.54% +0.07%
==========================================
Files 16 16
Lines 3777 3793 +16
Branches 802 806 +4
==========================================
+ Hits 3115 3131 +16
Misses 495 495
Partials 167 167
Impacted Files | Coverage Δ | |
---|---|---|
PyPDF2/_reader.py | 81.59% <100.00%> (+0.38%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f3cb316...435c9d6. Read the comment docs.
@VictorCarlquist Would you mind creating a minimal example that shows the issue? Something like this:
def test_get_images_raw():
strict = True
with_prev_0 = False
pdf_data = (
b"%%PDF-1.7\n"
b"1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj\n"
b"2 0 obj << >> endobj\n"
b"3 0 obj << >> endobj\n"
b"4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]"
b" /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R"
b" /Resources << /Font << >> >>"
b" /Rotate 0 /Type /Page >> endobj\n"
b"5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj\n"
b"xref 1 5\n"
b"%010d 00000 n\n"
b"%010d 00000 n\n"
b"%010d 00000 n\n"
b"%010d 00000 n\n"
b"%010d 00000 n\n"
b"trailer << %s/Root 5 0 R /Size 6 >>\n"
b"startxref %d\n"
b"%%%%EOF"
)
pdf_data = pdf_data % (
pdf_data.find(b"1 0 obj"),
pdf_data.find(b"2 0 obj"),
pdf_data.find(b"3 0 obj"),
pdf_data.find(b"4 0 obj"),
pdf_data.find(b"5 0 obj"),
b"/Prev 0 " if with_prev_0 else b"",
pdf_data.find(b"xref"),
)
pdf_stream = io.BytesIO(pdf_data)
PdfFileReader(pdf_stream, strict=strict) # just shows that nothing crashes
It should be a test that currently fails, but with your PR it doesn't.
Hi @MartinThoma, thank you for your feedback.
I added the tests and I changed the code a little.
Is it possible to install the mock
package?
Is it possible to install the mock package?
Sure - for testing on Py27:
- Add it to
requirements/ci.in
- Run
pip-compile requirements/ci.in
to generate theci.txt
(pip-compile comes from pip-tools)
Is it possible to install the mock package?
Sure - for testing on Py27:
1. Add it to `requirements/ci.in` 2. Run `pip-compile requirements/ci.in` to generate the `ci.txt` (pip-compile comes from pip-tools)
I did it but it's keeping failing, seems that the CI does't get the new ci.txt file.
Damn, sorry, I forgot the hack we have for Python 2.7.
For Python 2.7 you need to add mock here: https://github.com/py-pdf/PyPDF2/blob/main/.github/workflows/github-ci.yaml#L45
Damn, sorry, I forgot the hack we have for Python 2.7.
For Python 2.7 you need to add mock here: https://github.com/py-pdf/PyPDF2/blob/main/.github/workflows/github-ci.yaml#L45
Thanks Martin!
I'm closing this PR now as there hasn't been any activity for several months. The issue addressed here might already be solved.
If you disagree, I can re-open it.