pypdf Splitting PDF files resulting in larger-than-expected output PDF files

I've noticed a weird behavior trying to split some PDF files containing a table of contents with outlines on their first pages. The issue is that the output file containing the table of contents has the same file size of the input PDF although it has fewer pages since it has been split.

Therefore, trying to split a 9.6 MB PDF with 199 pages with the following code snippet I get two output files: one with 9.6 MB and 100 pages and another one with 99 pages and 3 MB.

Running a similar code with pikepdf I have no issues and I get one file with 6.7 MB and 100 pages and another one with 99 pages and 3 MB.

Environment

$ python -m platform
macOS-12.5.1-x86_64-i386-64bit

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.4

Code

from PyPDF2 import PdfReader, PdfMerger, PageRange

reader = PdfReader("sample.pdf")

num_pages = len(reader.pages)
page_ranges = [PageRange(slice(n, n + 100)) for n in range(0, num_pages, 100)]

for n, page_range in enumerate(page_ranges, 1):
    merger = PdfMerger()
    merger.append(reader, pages=page_range)
    merger.write(f"{'sample'} [{'part'} {n}].pdf")

PDF

Unfortunately, I can't share the PDF file.

Sep 03 '22 17:09 xilopaint

It's interesting that this might be connected to outlines. Thank you for the hint!

I don't know when anybody will pick this one up, but in the mean time you might be interested in https://pypdf2.readthedocs.io/en/latest/user/file-size.html

Sep 04 '22 15:09 MartinThoma

HowtoMakeAccessiblePDF.pdf

Hey @MartinThoma I just found the above PDF file that you can use to reproduce the issue. The file has 6.8 MB and 34 pages. Splitting it into files of 10 pages maximum each using PyPDF2 I get:

File 1: 10 pages / 6.8 MB File 2: 10 pages / 2 MB File 3: 10 pages / 3 MB File 4: 4 pages / 1.7 MB

Using pikepdf I get:

File 1: 10 pages / 588 KB File 2: 10 pages / 2 MB File 3: 10 pages / 3 MB File 4: 4 pages / 1.7 MB

As you can see I don't get the issue with the first output file using pikepdf.

Sep 04 '22 18:09 xilopaint

Hey @MartinThoma, could you reproduce the issue with the PDF file I provided?

Sep 09 '22 14:09 xilopaint

Reproduced! I did some analysis, and we can detect some object such as pages which are not expected. Still under analysis to identify which part of the code/parameters which is inducing the extra pages

Sep 09 '22 18:09 pubpub-zz

Thanks for looking into it!

Sep 09 '22 19:09 xilopaint

I think I've got it. when writing, a process to refer/adjust the objects in the write object. during this step, PyPDF2 parses through the pages to write and "collect" all the referenced indirect objects. What I found in the HowtoMakeAccessiblePDF.pdf, when you look at the fifth page, there is some (internet) link annonations. In this annotation, the "/P" field references the reader pages and not the modified pages. this induces the reader page to be collected also, and through the "/Parent" the other objects. Those are just collected but are not listed in the "/Pages" Tree and therefore not displayed. To fix this I've started to work on some "cloning" capability (identified #1194) Work is in progress.

Note: I did not check what fields would induced the same effect as the annotation

Sep 11 '22 08:09 pubpub-zz

Please, let me know when I can test the fix in other PDF files I have.

Sep 11 '22 14:09 xilopaint

Any progress here @pubpub-zz?

Sep 24 '22 00:09 xilopaint

Work in progress... The cloning is not so easy...🤔

Sep 24 '22 07:09 pubpub-zz

@xilopaint A PR still draft is available. I did some test on HowtoMakeAccessiblePDF.pdf and in this file, the point I've noticed that the problem was linked with /Annots that contains some link to other pages. Passing ["/Annots"] to add_page will prevent copying "/Annots" and the increase of size. An easy way to get a rough idea about about the size is to monitor len(w._objects") and see it increases slowly This is just a first draft to but a good basis for improvement, isn't it

ps: @MartinThoma some advice in order to clear the mypy errors would be appreciated.

Sep 27 '22 18:09 pubpub-zz

A PR still draft is available.

I still have the issue using your fork with my code. Should I change anything to make it work?

Sep 27 '22 19:09 xilopaint

You should use PdfWriter (For the moment PdfMerger has not been modified yet):

import PyPDF2
r=PyPDF2.PdfReader("e:/HowtoMakeAccessiblePDF.pdf")
w=PyPDF2.PdfWriter()
for i in range(10):
    _=w.add_page(r.pages[i],("/Annots","/B"))                  # _= is not required if this coded in 
w.write("e:/extract1-10.pdf")

Sep 27 '22 19:09 pubpub-zz

w.add_page(r.pages[i],("/Annots","/B"))

This feels a bit weird for me. In pikepdf we don't need to use any different parameter to make it work. It just works.

Sep 27 '22 20:09 xilopaint

As said, this is a first draft. We have now a solution where the objects can be modified in a proper manner. We need now to find the best encapsulation about it

Sep 27 '22 20:09 pubpub-zz

This issue is not yet closed😉

Sep 27 '22 20:09 pubpub-zz

I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)

Sep 27 '22 21:09 MartinThoma

Mistake in the issue referenced. This issue should stay open for the moment

Sep 28 '22 09:09 pubpub-zz

I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)

@MartinThoma

Done!

Oct 09 '22 07:10 xilopaint

@xilopaint, If you want to try,I've completed the PR with all the functions from PdfMerger. You just need to change PdfMerger by PdfWriter (no other change required):

import PyPDF2
reader=PyPDF2.PdfFileReader("e:/HowtoMakeAccessiblePDF.pdf")
num_pages = len(reader.pages)
page_ranges = [PyPDF2.PageRange(slice(n, n + 10)) for n in range(0, num_pages, 10)]

for n, page_range in enumerate(page_ranges, 1):
    merger = PyPDF2.PdfWriter()
    merger.append(reader, pages=page_range)
    merger.write(f"e:/Downloads/{'sample'} [{'part'} {n}].pdf")

result of dir : 10/10/2022 19:29 580 716 sample [part 1].pdf 10/10/2022 19:29 1 963 843 sample [part 2].pdf 10/10/2022 19:29 2 996 937 sample [part 3].pdf 10/10/2022 19:29 1 680 228 sample [part 4].pdf

Oct 10 '22 17:10 pubpub-zz

@xilopaint, If you want to try,I've completed the PR with all the functions from PdfMerger. You just need to change PdfMerger by PdfWriter (no other change required)

It worked! Will this PR deprecate PdfMerger as PdfWriter is covering all its methods?

Oct 10 '22 20:10 xilopaint

This should ease maintenability. For compatibility purpose, PdfMerger should be kept as a synonym of PdfWriter with maybe a depreciation warning. @MartinThoma your opinion ?

Oct 10 '22 20:10 pubpub-zz

Will this PR deprecate PdfMerger as PdfWriter is covering all its methods?

I would actually be super happy about deprecating PdfMerger :smile: I always thought that the PdfMerger is confusing.

I would need to check carefully if PdfMerger can be replaced easily by PdfWriter.

Oct 10 '22 20:10 MartinThoma

Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.

Oct 10 '22 20:10 pubpub-zz

Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.

@pubpub-zz

It looks like the PR introduced a bug. Please, run the test suite of my project. One of the tests is failing since I pushed your PR. You just need to run python3 -m unittest discover tests -b .

Oct 10 '22 21:10 xilopaint

@xilopaint, with your project I'm getting ModuleNotFoundError: No module named 'fcntl', I'm working under windows, can you please propose a work around else, can you at least report the stack at failure.

Oct 11 '22 16:10 pubpub-zz

@pubpub-zz you can reproduce the issue with the following sample code and PDF file:

foo.pdf

#!/usr/bin/env python3
from PyPDF2 import PageObject, PdfReader, PdfWriter

reader = PdfReader("foo.pdf")
writer = PdfWriter()

for page in reader.pages:
    out_page = PageObject.create_blank_page(None, 8.3 * 72, 11.7 * 72)
    out_page.merge_page(page)

    writer.add_page(out_page)

with open("bar.pdf", "wb") as f:
    writer.write(f)

reader = PdfReader("bar.pdf")

for n, page in enumerate(reader.pages, 1):
    print(int(page.extract_text()) == n)

The code works with the latest release but not with your fork.

Oct 16 '22 14:10 xilopaint

@xilopaint thanks for the trouble report. the problem seems to be solved, can you confirm?

Oct 16 '22 21:10 pubpub-zz

@xilopaint thanks for the trouble report. the problem seems to be solved, can you confirm?

@pubpub-zz yes, it's fixed.

Oct 16 '22 21:10 xilopaint

Is the PR ready to be merged now?

Oct 16 '22 21:10 xilopaint

Not yet, I need to fix a few points about merging annotations and articles

Oct 16 '22 21:10 pubpub-zz

pypdf pypdf copied to clipboard

Splitting PDF files resulting in larger-than-expected output PDF files

Environment

Code

PDF

pypdf
pypdf copied to clipboard