pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

BUG: Fix argument invalid seek when startxref is zero

Open VictorCarlquist opened this issue 5 years ago • 9 comments

Hi!

When pdf file ends with the following lines below, the stream.seek(-11, 1) at line 1854 raises Exception Value (22) Argument invalid.

xref
0 85
0000000000 65535 f
....
0000000000 00000 n
trailer
<<
/Size 85
/Root 84 0 R
/Info 83 0 R

startxref
0
%%EOF

So, when startxref is zero, this commit expand the search to try to find the xref table in the last 2KB.

VictorCarlquist avatar Mar 21 '19 13:03 VictorCarlquist

@VictorCarlquist, it sounds like this is a valuable addition to PyPDF2! Would you add a unit test to your commit (like an example file) that demonstrates the problem that you're fixing?

Thanks for this merge request!

kurtmckee avatar May 07 '19 02:05 kurtmckee

Hi @kurtmckee, sorry for the big delay to answer.

With 2KB, the pyPDF keeps showing up the Exception with some pdf files, so we have raised the search to 8KB and it solved the problem.

The 8KB seems fine to you? Could we keep this value?

VictorCarlquist avatar Aug 19 '19 18:08 VictorCarlquist

Codecov Report

Merging #493 (435c9d6) into main (f3cb316) will increase coverage by 0.07%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #493      +/-   ##
==========================================
+ Coverage   82.47%   82.54%   +0.07%     
==========================================
  Files          16       16              
  Lines        3777     3793      +16     
  Branches      802      806       +4     
==========================================
+ Hits         3115     3131      +16     
  Misses        495      495              
  Partials      167      167              
Impacted Files Coverage Δ
PyPDF2/_reader.py 81.59% <100.00%> (+0.38%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f3cb316...435c9d6. Read the comment docs.

codecov-commenter avatar Apr 17 '22 16:04 codecov-commenter

@VictorCarlquist Would you mind creating a minimal example that shows the issue? Something like this:

def test_get_images_raw():
    strict = True
    with_prev_0 = False
    pdf_data = (
        b"%%PDF-1.7\n"
        b"1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj\n"
        b"2 0 obj << >> endobj\n"
        b"3 0 obj << >> endobj\n"
        b"4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]"
        b" /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R"
        b" /Resources << /Font << >> >>"
        b" /Rotate 0 /Type /Page >> endobj\n"
        b"5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj\n"
        b"xref 1 5\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"trailer << %s/Root 5 0 R /Size 6 >>\n"
        b"startxref %d\n"
        b"%%%%EOF"
    )
    pdf_data = pdf_data % (
        pdf_data.find(b"1 0 obj"),
        pdf_data.find(b"2 0 obj"),
        pdf_data.find(b"3 0 obj"),
        pdf_data.find(b"4 0 obj"),
        pdf_data.find(b"5 0 obj"),
        b"/Prev 0 " if with_prev_0 else b"",
        pdf_data.find(b"xref"),
    )
    pdf_stream = io.BytesIO(pdf_data)
    PdfFileReader(pdf_stream, strict=strict)  # just shows that nothing crashes

It should be a test that currently fails, but with your PR it doesn't.

MartinThoma avatar Apr 17 '22 16:04 MartinThoma

Hi @MartinThoma, thank you for your feedback.

I added the tests and I changed the code a little.

Is it possible to install the mock package?

VictorCarlquist avatar Apr 30 '22 03:04 VictorCarlquist

Is it possible to install the mock package?

Sure - for testing on Py27:

  1. Add it to requirements/ci.in
  2. Run pip-compile requirements/ci.in to generate the ci.txt (pip-compile comes from pip-tools)

MartinThoma avatar Apr 30 '22 16:04 MartinThoma

Is it possible to install the mock package?

Sure - for testing on Py27:

1. Add it to `requirements/ci.in`

2. Run `pip-compile requirements/ci.in` to generate the `ci.txt` (pip-compile comes from pip-tools)

I did it but it's keeping failing, seems that the CI does't get the new ci.txt file.

VictorCarlquist avatar Apr 30 '22 16:04 VictorCarlquist

Damn, sorry, I forgot the hack we have for Python 2.7.

For Python 2.7 you need to add mock here: https://github.com/py-pdf/PyPDF2/blob/main/.github/workflows/github-ci.yaml#L45

MartinThoma avatar Apr 30 '22 18:04 MartinThoma

Damn, sorry, I forgot the hack we have for Python 2.7.

For Python 2.7 you need to add mock here: https://github.com/py-pdf/PyPDF2/blob/main/.github/workflows/github-ci.yaml#L45

Thanks Martin!

VictorCarlquist avatar Apr 30 '22 18:04 VictorCarlquist

I'm closing this PR now as there hasn't been any activity for several months. The issue addressed here might already be solved.

If you disagree, I can re-open it.

MartinThoma avatar Dec 10 '22 18:12 MartinThoma