pypdf
pypdf copied to clipboard
Splitting PDF files resulting in larger-than-expected output PDF files
I've noticed a weird behavior trying to split some PDF files containing a table of contents with outlines on their first pages. The issue is that the output file containing the table of contents has the same file size of the input PDF although it has fewer pages since it has been split.
Therefore, trying to split a 9.6 MB PDF with 199 pages with the following code snippet I get two output files: one with 9.6 MB and 100 pages and another one with 99 pages and 3 MB.
Running a similar code with pikepdf I have no issues and I get one file with 6.7 MB and 100 pages and another one with 99 pages and 3 MB.
Environment
$ python -m platform
macOS-12.5.1-x86_64-i386-64bit
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.4
Code
from PyPDF2 import PdfReader, PdfMerger, PageRange
reader = PdfReader("sample.pdf")
num_pages = len(reader.pages)
page_ranges = [PageRange(slice(n, n + 100)) for n in range(0, num_pages, 100)]
for n, page_range in enumerate(page_ranges, 1):
merger = PdfMerger()
merger.append(reader, pages=page_range)
merger.write(f"{'sample'} [{'part'} {n}].pdf")
Unfortunately, I can't share the PDF file.
It's interesting that this might be connected to outlines. Thank you for the hint!
I don't know when anybody will pick this one up, but in the mean time you might be interested in https://pypdf2.readthedocs.io/en/latest/user/file-size.html
Hey @MartinThoma I just found the above PDF file that you can use to reproduce the issue. The file has 6.8 MB and 34 pages. Splitting it into files of 10 pages maximum each using PyPDF2 I get:
File 1: 10 pages / 6.8 MB File 2: 10 pages / 2 MB File 3: 10 pages / 3 MB File 4: 4 pages / 1.7 MB
Using pikepdf I get:
File 1: 10 pages / 588 KB File 2: 10 pages / 2 MB File 3: 10 pages / 3 MB File 4: 4 pages / 1.7 MB
As you can see I don't get the issue with the first output file using pikepdf.
Hey @MartinThoma, could you reproduce the issue with the PDF file I provided?
Reproduced! I did some analysis, and we can detect some object such as pages which are not expected. Still under analysis to identify which part of the code/parameters which is inducing the extra pages
Thanks for looking into it!
I think I've got it. when writing, a process to refer/adjust the objects in the write object. during this step, PyPDF2 parses through the pages to write and "collect" all the referenced indirect objects. What I found in the HowtoMakeAccessiblePDF.pdf, when you look at the fifth page, there is some (internet) link annonations. In this annotation, the "/P" field references the reader pages and not the modified pages. this induces the reader page to be collected also, and through the "/Parent" the other objects. Those are just collected but are not listed in the "/Pages" Tree and therefore not displayed. To fix this I've started to work on some "cloning" capability (identified #1194) Work is in progress.
Note: I did not check what fields would induced the same effect as the annotation
Please, let me know when I can test the fix in other PDF files I have.
Any progress here @pubpub-zz?
Work in progress... The cloning is not so easy...🤔
@xilopaint A PR still draft is available. I did some test on HowtoMakeAccessiblePDF.pdf and in this file, the point I've noticed that the problem was linked with /Annots that contains some link to other pages. Passing ["/Annots"] to add_page will prevent copying "/Annots" and the increase of size. An easy way to get a rough idea about about the size is to monitor len(w._objects") and see it increases slowly This is just a first draft to but a good basis for improvement, isn't it
ps: @MartinThoma some advice in order to clear the mypy errors would be appreciated.
A PR still draft is available.
I still have the issue using your fork with my code. Should I change anything to make it work?
You should use PdfWriter (For the moment PdfMerger has not been modified yet):
import PyPDF2
r=PyPDF2.PdfReader("e:/HowtoMakeAccessiblePDF.pdf")
w=PyPDF2.PdfWriter()
for i in range(10):
_=w.add_page(r.pages[i],("/Annots","/B")) # _= is not required if this coded in
w.write("e:/extract1-10.pdf")
w.add_page(r.pages[i],("/Annots","/B"))
This feels a bit weird for me. In pikepdf we don't need to use any different parameter to make it work. It just works.
As said, this is a first draft. We have now a solution where the objects can be modified in a proper manner. We need now to find the best encapsulation about it
This issue is not yet closed😉
I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)
Mistake in the issue referenced. This issue should stay open for the moment
I'm away only with my phone until 5th of October. I'll look into it after that (please remind me if I don't answer on 6th 😅)
@MartinThoma
Done!
@xilopaint, If you want to try,I've completed the PR with all the functions from PdfMerger. You just need to change PdfMerger by PdfWriter (no other change required):
import PyPDF2
reader=PyPDF2.PdfFileReader("e:/HowtoMakeAccessiblePDF.pdf")
num_pages = len(reader.pages)
page_ranges = [PyPDF2.PageRange(slice(n, n + 10)) for n in range(0, num_pages, 10)]
for n, page_range in enumerate(page_ranges, 1):
merger = PyPDF2.PdfWriter()
merger.append(reader, pages=page_range)
merger.write(f"e:/Downloads/{'sample'} [{'part'} {n}].pdf")
result of dir : 10/10/2022 19:29 580 716 sample [part 1].pdf 10/10/2022 19:29 1 963 843 sample [part 2].pdf 10/10/2022 19:29 2 996 937 sample [part 3].pdf 10/10/2022 19:29 1 680 228 sample [part 4].pdf
@xilopaint, If you want to try,I've completed the PR with all the functions from PdfMerger. You just need to change PdfMerger by PdfWriter (no other change required)
It worked! Will this PR deprecate PdfMerger
as PdfWriter
is covering all its methods?
This should ease maintenability. For compatibility purpose, PdfMerger should be kept as a synonym of PdfWriter with maybe a depreciation warning. @MartinThoma your opinion ?
Will this PR deprecate PdfMerger as PdfWriter is covering all its methods?
I would actually be super happy about deprecating PdfMerger :smile: I always thought that the PdfMerger is confusing.
I would need to check carefully if PdfMerger can be replaced easily by PdfWriter.
Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.
Before Issuing, some extra test should be done. @xilopaint, If you can carry on your test. And some cleanup (mypy) will be required.
@pubpub-zz
It looks like the PR introduced a bug. Please, run the test suite of my project. One of the tests is failing since I pushed your PR. You just need to run python3 -m unittest discover tests -b
.
@xilopaint,
with your project I'm getting ModuleNotFoundError: No module named 'fcntl'
,
I'm working under windows, can you please propose a work around else, can you at least report the stack at failure.
@pubpub-zz you can reproduce the issue with the following sample code and PDF file:
#!/usr/bin/env python3
from PyPDF2 import PageObject, PdfReader, PdfWriter
reader = PdfReader("foo.pdf")
writer = PdfWriter()
for page in reader.pages:
out_page = PageObject.create_blank_page(None, 8.3 * 72, 11.7 * 72)
out_page.merge_page(page)
writer.add_page(out_page)
with open("bar.pdf", "wb") as f:
writer.write(f)
reader = PdfReader("bar.pdf")
for n, page in enumerate(reader.pages, 1):
print(int(page.extract_text()) == n)
The code works with the latest release but not with your fork.
@xilopaint thanks for the trouble report. the problem seems to be solved, can you confirm?
@xilopaint thanks for the trouble report. the problem seems to be solved, can you confirm?
@pubpub-zz yes, it's fixed.
Is the PR ready to be merged now?
Not yet, I need to fix a few points about merging annotations and articles