pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Extracted pages from some PDFs will have the same file size as the original PDF.

Open M47H3W opened this issue 5 years ago • 2 comments

I have some very large PDFs that I am trying to extract certain pages from and create a new PDF file with those specific pages only. So for example, if I have a 300 page PDF, I want to ultimately have a PDF of the first 10 pages only.

For some reason, some of the PDFs that are being generated with the few extracted pages are the same as the original PDF. A generated pdf containing five pages from a 150mb pdf takes more time to generate and ends up being the same size.

This is the code I am currently using:

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
writer = PdfWriter()
writer.add_page(2)
writer.add_page(3)
writer.add_page(4)
writer.add_page(5)


with open("seperate.pdf", "wb") as fh:
	writer.write(fh)

Initially, I thought it had to with the PDFs version, but I started using PDFElement 6 Pro to view some PDF metadata and it seems that its dependent on what the original pdf was made from. A PDF produced with Adobe PDF Librar...(c) 1T3XT BVBA won't retain the same file size while a pdf produced with Adobe PDF Library 10.0.1 will.

M47H3W avatar Aug 08 '18 08:08 M47H3W

I had the same problem with PDFs containing links and managed to solve it with the remove_links() method. If it's your case this should work:

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("test.pdf")
writer = PdfWriter()
writer.add_page(2)
writer.add_page(3)
writer.add_page(4)
writer.add_page(5)

with open("seperate.pdf", "wb") as fh:
	writer.remove_links()
	writer.write(fh)

xilopaint avatar Aug 17 '18 02:08 xilopaint

People who come here might be interested in https://pypdf2.readthedocs.io/en/latest/user/file-size.html

MartinThoma avatar Jun 29 '22 20:06 MartinThoma

For me the biggest challenge with size is not having the option to use lossy compression.

I have a 6 page file that's 3.5mb in total. I'm breaking that file up into 3 files of 2 pages. Each individual output file is at least 2.6mb, or 8.2mb in total. No fault of PyPDF, I get the same result in acrobat pro. Turns out there is one large texture image, that is shared across pages, that should really be compressed.

Interesting thing is that even when using Acrobat Pro, I can open a file, delete all the content, save it, and it will still have the same file size unless I save it with a different name or options. Seems invisibly retaining the deleted stuff is just the norm for PDFs?

shawnCaza avatar Jan 03 '23 20:01 shawnCaza

@shawnCaza due to linked objects such as articles,... somes objects were copied but not displayed and of no use. If you use append(), you should be able to fix this. Can you try it and give feedbacks ?

pubpub-zz avatar Feb 09 '23 05:02 pubpub-zz

@pubpub-zz I tried everything before I realized it was an image compression issue.

shawnCaza avatar Feb 09 '23 13:02 shawnCaza

Si we can close it?

pubpub-zz avatar Feb 09 '23 15:02 pubpub-zz

The original issue is 5 years old. Who knows if they solved their problem. I'm only here to add nuance to the thread for others who might run into frustrations.

shawnCaza avatar Feb 09 '23 15:02 shawnCaza

I'm closing this now as it's unclear what the state is / how to continue with this. Or if there is anything pypdf can do.

MartinThoma avatar Feb 09 '23 17:02 MartinThoma